sba

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
sba [2024/10/27 17:45] – [Follow the geodesic] pedroortegasba [2024/10/27 18:11] (current) – [Adding the context] pedroortega
Line 8: Line 8:
 We need the following ingredients: We need the following ingredients:
   * an alphabet $X$ over which we define the set of strings $X^\ast$;   * an alphabet $X$ over which we define the set of strings $X^\ast$;
-  * a prior distribution $P(\tau)$ to generate samples, for instance implemented using an LLM;+  * a prior distribution $P(\tau)$ over the strings $X^\ast$ to generate samples, e.g. implemented using an LLM;
   * a reward model $R(\tau) \in \mathbb{R}$ over strings.   * a reward model $R(\tau) \in \mathbb{R}$ over strings.
  
Line 34: Line 34:
 Because the optimal policy is a distribution, acting optimally means obtaining a sample. There's many ways to do this, but the most straightforward is **rejection sampling**. This works as follows: Because the optimal policy is a distribution, acting optimally means obtaining a sample. There's many ways to do this, but the most straightforward is **rejection sampling**. This works as follows:
  
-  * Generate a string $\tau P(\tau|\theta_t)$. +  * Generate a string $\tau \sim P(\tau)$. 
-  * Generate a uniform random variate $u U(0, 1)$.+  * Generate a uniform random variate $u \sim U(0, 1)$.
   * If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$.   * If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$.
   * Else, repeat the procedure.   * Else, repeat the procedure.
Line 51: Line 51:
   - **Initialization**: Let $P_0(\tau)$ be the starting model with parameters $\theta_0$.   - **Initialization**: Let $P_0(\tau)$ be the starting model with parameters $\theta_0$.
   - **Generate better trajectories**: generate $N$ samples $\tau^1, \ldots, \tau^N$ from the bounded-rational optimal policy using $\alpha$ as the inverse temperature parameter and $P_t(\tau)$ as the prior over strings.   - **Generate better trajectories**: generate $N$ samples $\tau^1, \ldots, \tau^N$ from the bounded-rational optimal policy using $\alpha$ as the inverse temperature parameter and $P_t(\tau)$ as the prior over strings.
 +  - **Stopping condition**: If in addition all the samples $\tau^1, \ldots, \tau^N$ pass the acceptance test using the target inverse temperature $\beta$ instead of $\alpha$, stop. 
   - **Fine-tune prior model**: Fine-tune the prior model $P_t(\tau)$ with the generated samples $\tau^1, \ldots, \tau^N$. This yields a new model $P_t(\tau)$ with parameters $\theta_t$.   - **Fine-tune prior model**: Fine-tune the prior model $P_t(\tau)$ with the generated samples $\tau^1, \ldots, \tau^N$. This yields a new model $P_t(\tau)$ with parameters $\theta_t$.
-  - **Stopping condition**: If the samples $\tau^1, \ldots, \tau^N$ are accepted in the acceptance step using the target inverse temperature $\beta$, stop. Otherwise, repeat from step 2. +  - **Repeat**: Set $\leftarrow t + 1$ and repeat from step 2.
  
-      +The resulting distribution $P_t(\tau)$ is our bounded-rational policy. You will have to experiment with the choices of $\alpha$ (which controls the step size) and $N$ (which controls the representation quality of the target distribution) to obtain a satisfactory training time. 
-===== Blahut-Arimoto =====+===== Adding a user-provided context =====
  
 +The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a user-provided context string $c \in X^\ast$, i.e. we want to sample strings from $P(\tau|c)$, not $P(\tau)$. The problem is that the contexts $c$ are not generated in a way that conforms to the reward function, so the training procedure above will bias the model.
  
-  * Generate $N$ samples $\tau^1, \ldots, \tau^N$ from $P()$ 
-  *  
  
  
-===== Memory-Constrained Agents =====+ 
 +==== Enter memory-constrained agents ====
  
 \[ \[
  • sba.1730051154.txt.gz
  • Last modified: 2024/10/27 17:45
  • by pedroortega