Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
sba [2024/10/27 17:40] – [Bounded-rational policies] pedroortega | sba [2024/10/27 18:11] (current) – [Adding the context] pedroortega | ||
---|---|---|---|
Line 8: | Line 8: | ||
We need the following ingredients: | We need the following ingredients: | ||
* an alphabet $X$ over which we define the set of strings $X^\ast$; | * an alphabet $X$ over which we define the set of strings $X^\ast$; | ||
- | * a prior distribution $P(\tau)$ to generate samples, | + | * a prior distribution $P(\tau)$ over the strings $X^\ast$ to generate samples, |
* a reward model $R(\tau) \in \mathbb{R}$ over strings. | * a reward model $R(\tau) \in \mathbb{R}$ over strings. | ||
Line 34: | Line 34: | ||
Because the optimal policy is a distribution, | Because the optimal policy is a distribution, | ||
- | * Generate a string $\tau ~ P(\tau|\theta_t)$. | + | * Generate a string $\tau \sim P(\tau)$. |
- | * Generate a uniform random variate $u ~ U(0, 1)$. | + | * Generate a uniform random variate $u \sim U(0, 1)$. |
* If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$. | * If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$. | ||
* Else, repeat the procedure. | * Else, repeat the procedure. | ||
Line 43: | Line 43: | ||
The inverse temperature parameter controls how many candidates you'll have to generate before accepting a sample. If $\beta$ is close enough to zero, then almost every sample gets accepted. But if $\beta$ is very large, it might take forever to obtain a sample, especially if your reward function is structured like a **needle-in-a-haystack** problem. How can we mitigate this? | The inverse temperature parameter controls how many candidates you'll have to generate before accepting a sample. If $\beta$ is close enough to zero, then almost every sample gets accepted. But if $\beta$ is very large, it might take forever to obtain a sample, especially if your reward function is structured like a **needle-in-a-haystack** problem. How can we mitigate this? | ||
- | ==== Enter relaxation | + | ==== Follow the geodesic |
Rather than sampling directly from the optimal policy, we can approach it by taking smaller, intermediate steps, **following a geodesic** in information geometry. | Rather than sampling directly from the optimal policy, we can approach it by taking smaller, intermediate steps, **following a geodesic** in information geometry. | ||
Line 51: | Line 51: | ||
- **Initialization**: | - **Initialization**: | ||
- **Generate better trajectories**: | - **Generate better trajectories**: | ||
+ | - **Stopping condition**: | ||
- **Fine-tune prior model**: Fine-tune the prior model $P_t(\tau)$ with the generated samples $\tau^1, \ldots, \tau^N$. This yields a new model $P_t(\tau)$ with parameters $\theta_t$. | - **Fine-tune prior model**: Fine-tune the prior model $P_t(\tau)$ with the generated samples $\tau^1, \ldots, \tau^N$. This yields a new model $P_t(\tau)$ with parameters $\theta_t$. | ||
- | - Repeat from step 2. | + | - **Repeat**: Set $t \leftarrow t + 1$ and repeat |
- | + | The resulting distribution $P_t(\tau)$ is our bounded-rational policy. You will have to experiment with the choices of $\alpha$ (which controls the step size) and $N$ (which controls the representation quality of the target distribution) to obtain a satisfactory training time. | |
- | ===== Blahut-Arimoto | + | ===== Adding a user-provided context |
+ | The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a user-provided context string $c \in X^\ast$, i.e. we want to sample strings from $P(\tau|c)$, | ||
- | * Generate $N$ samples $\tau^1, \ldots, \tau^N$ from $P()$ | ||
- | * | ||
- | ===== Memory-Constrained Agents ===== | + | |
+ | ==== Enter memory-constrained agents | ||
\[ | \[ |