Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
sba [2024/10/27 18:10] – [Adding the context] pedroortega | sba [2024/10/27 18:11] (current) – [Adding the context] pedroortega | ||
---|---|---|---|
Line 56: | Line 56: | ||
The resulting distribution $P_t(\tau)$ is our bounded-rational policy. You will have to experiment with the choices of $\alpha$ (which controls the step size) and $N$ (which controls the representation quality of the target distribution) to obtain a satisfactory training time. | The resulting distribution $P_t(\tau)$ is our bounded-rational policy. You will have to experiment with the choices of $\alpha$ (which controls the step size) and $N$ (which controls the representation quality of the target distribution) to obtain a satisfactory training time. | ||
- | ===== Adding | + | ===== Adding |
The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a user-provided context string $c \in X^\ast$, i.e. we want to sample strings from $P(\tau|c)$, | The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a user-provided context string $c \in X^\ast$, i.e. we want to sample strings from $P(\tau|c)$, | ||
Line 63: | Line 63: | ||
- | ===== Memory-Constrained Agents ===== | + | ==== Enter memory-constrained agents |
\[ | \[ |