sba

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
sba [2024/10/27 17:40] – [Bounded-rational policies] pedroortegasba [2024/10/27 18:11] (current) – [Adding the context] pedroortega
Line 8: Line 8:
 We need the following ingredients: We need the following ingredients:
   * an alphabet $X$ over which we define the set of strings $X^\ast$;   * an alphabet $X$ over which we define the set of strings $X^\ast$;
-  * a prior distribution $P(\tau)$ to generate samples, for instance implemented using an LLM;+  * a prior distribution $P(\tau)$ over the strings $X^\ast$ to generate samples, e.g. implemented using an LLM;
   * a reward model $R(\tau) \in \mathbb{R}$ over strings.   * a reward model $R(\tau) \in \mathbb{R}$ over strings.
  
Line 16: Line 16:
 First we'll briefly review bounded-rational policies motivated by information theory. Such policies have many properties. An incomplete list is: First we'll briefly review bounded-rational policies motivated by information theory. Such policies have many properties. An incomplete list is:
   * they have a controllable policy search **complexity**;   * they have a controllable policy search **complexity**;
-  * they are **robust**, in the sense that they protect against adversarial reward perturbations,+  * they are **robust**, in the sense that they protect against adversarial reward perturbations.
  
 Rather than optimizing the expected reward, bounded-rational policies optimize the **free energy objective** Rather than optimizing the expected reward, bounded-rational policies optimize the **free energy objective**
Line 34: Line 34:
 Because the optimal policy is a distribution, acting optimally means obtaining a sample. There's many ways to do this, but the most straightforward is **rejection sampling**. This works as follows: Because the optimal policy is a distribution, acting optimally means obtaining a sample. There's many ways to do this, but the most straightforward is **rejection sampling**. This works as follows:
  
-  * Generate a string $\tau P(\tau|\theta_t)$. +  * Generate a string $\tau \sim P(\tau)$. 
-  * Generate a uniform random variate $u U(0, 1)$.+  * Generate a uniform random variate $u \sim U(0, 1)$.
   * If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$.   * If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$.
   * Else, repeat the procedure.   * Else, repeat the procedure.
Line 43: Line 43:
 The inverse temperature parameter controls how many candidates you'll have to generate before accepting a sample. If $\beta$ is close enough to zero, then almost every sample gets accepted. But if $\beta$ is very large, it might take forever to obtain a sample, especially if your reward function is structured like a **needle-in-a-haystack** problem. How can we mitigate this? The inverse temperature parameter controls how many candidates you'll have to generate before accepting a sample. If $\beta$ is close enough to zero, then almost every sample gets accepted. But if $\beta$ is very large, it might take forever to obtain a sample, especially if your reward function is structured like a **needle-in-a-haystack** problem. How can we mitigate this?
  
-==== Enter relaxation  ====+==== Follow the geodesic  ====
  
 Rather than sampling directly from the optimal policy, we can approach it by taking smaller, intermediate steps, **following a geodesic** in information geometry. Rather than sampling directly from the optimal policy, we can approach it by taking smaller, intermediate steps, **following a geodesic** in information geometry.
Line 51: Line 51:
   - **Initialization**: Let $P_0(\tau)$ be the starting model with parameters $\theta_0$.   - **Initialization**: Let $P_0(\tau)$ be the starting model with parameters $\theta_0$.
   - **Generate better trajectories**: generate $N$ samples $\tau^1, \ldots, \tau^N$ from the bounded-rational optimal policy using $\alpha$ as the inverse temperature parameter and $P_t(\tau)$ as the prior over strings.   - **Generate better trajectories**: generate $N$ samples $\tau^1, \ldots, \tau^N$ from the bounded-rational optimal policy using $\alpha$ as the inverse temperature parameter and $P_t(\tau)$ as the prior over strings.
 +  - **Stopping condition**: If in addition all the samples $\tau^1, \ldots, \tau^N$ pass the acceptance test using the target inverse temperature $\beta$ instead of $\alpha$, stop. 
   - **Fine-tune prior model**: Fine-tune the prior model $P_t(\tau)$ with the generated samples $\tau^1, \ldots, \tau^N$. This yields a new model $P_t(\tau)$ with parameters $\theta_t$.   - **Fine-tune prior model**: Fine-tune the prior model $P_t(\tau)$ with the generated samples $\tau^1, \ldots, \tau^N$. This yields a new model $P_t(\tau)$ with parameters $\theta_t$.
-  - Repeat from step 2.+  - **Repeat**: Set $t \leftarrow t + 1$ and repeat from step 2.
  
-      +The resulting distribution $P_t(\tau)$ is our bounded-rational policy. You will have to experiment with the choices of $\alpha$ (which controls the step size) and $N$ (which controls the representation quality of the target distribution) to obtain a satisfactory training time. 
-===== Blahut-Arimoto =====+===== Adding a user-provided context =====
  
 +The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a user-provided context string $c \in X^\ast$, i.e. we want to sample strings from $P(\tau|c)$, not $P(\tau)$. The problem is that the contexts $c$ are not generated in a way that conforms to the reward function, so the training procedure above will bias the model.
  
-  * Generate $N$ samples $\tau^1, \ldots, \tau^N$ from $P()$ 
-  *  
  
  
-===== Memory-Constrained Agents =====+ 
 +==== Enter memory-constrained agents ====
  
 \[ \[
  • sba.1730050842.txt.gz
  • Last modified: 2024/10/27 17:40
  • by pedroortega