Differences

This shows you the differences between two versions of the page.

--- sba [2024/10/27 18:11] – [Memory-Constrained Agents] pedroortega
+++ sba [2024/10/27 18:11] (current) – [Adding the context] pedroortega
@@ Line 56: / Line 56: @@
 The resulting distribution $P_t(\tau)$ is our bounded-rational policy. You will have to experiment with the choices of $\alpha$ (which controls the step size) and $N$ (which controls the representation quality of the target distribution) to obtain a satisfactory training time.
-===== Adding the context =====
+===== Adding a user-provided context =====
 The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a user-provided context string $c \in X^\ast$, i.e. we want to sample strings from $P(\tau|c)$, not $P(\tau)$. The problem is that the contexts $c$ are not generated in a way that conforms to the reward function, so the training procedure above will bias the model.