Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
sba [2024/10/27 17:58] – [Ingredients] pedroortega | sba [2024/10/27 18:11] (current) – [Adding the context] pedroortega | ||
---|---|---|---|
Line 8: | Line 8: | ||
We need the following ingredients: | We need the following ingredients: | ||
* an alphabet $X$ over which we define the set of strings $X^\ast$; | * an alphabet $X$ over which we define the set of strings $X^\ast$; | ||
- | * a prior distribution $P(\tau)$ over the strings $X^\ast$ to generate samples, | + | * a prior distribution $P(\tau)$ over the strings $X^\ast$ to generate samples, |
* a reward model $R(\tau) \in \mathbb{R}$ over strings. | * a reward model $R(\tau) \in \mathbb{R}$ over strings. | ||
Line 34: | Line 34: | ||
Because the optimal policy is a distribution, | Because the optimal policy is a distribution, | ||
- | * Generate a string $\tau ~ P(\tau|\theta_t)$. | + | * Generate a string $\tau \sim P(\tau)$. |
- | * Generate a uniform random variate $u ~ U(0, 1)$. | + | * Generate a uniform random variate $u \sim U(0, 1)$. |
* If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$. | * If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$. | ||
* Else, repeat the procedure. | * Else, repeat the procedure. | ||
Line 55: | Line 55: | ||
- **Repeat**: Set $t \leftarrow t + 1$ and repeat from step 2. | - **Repeat**: Set $t \leftarrow t + 1$ and repeat from step 2. | ||
- | The resulting distribution $P_t(\tau)$ is our bounded-rational policy. | + | The resulting distribution $P_t(\tau)$ is our bounded-rational policy. You will have to experiment with the choices of $\alpha$ (which controls the step size) and $N$ (which controls the representation quality of the target distribution) to obtain a satisfactory training time. |
- | ===== Adding | + | ===== Adding |
- | The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a given context $c$, i.e. we want to sample strings from $P(\tau|c)$, | + | The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a user-provided |
- | ===== Memory-Constrained Agents ===== | + | ==== Enter memory-constrained agents |
\[ | \[ |