sba

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
sba [2024/10/27 18:00] – [Sampling from the bounded-rational policy] pedroortegasba [2024/10/27 18:11] (current) – [Adding the context] pedroortega
Line 34: Line 34:
 Because the optimal policy is a distribution, acting optimally means obtaining a sample. There's many ways to do this, but the most straightforward is **rejection sampling**. This works as follows: Because the optimal policy is a distribution, acting optimally means obtaining a sample. There's many ways to do this, but the most straightforward is **rejection sampling**. This works as follows:
  
-  * Generate a string $\tau \sim P(\tau|\theta_t)$.+  * Generate a string $\tau \sim P(\tau)$.
   * Generate a uniform random variate $u \sim U(0, 1)$.   * Generate a uniform random variate $u \sim U(0, 1)$.
   * If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$.   * If $u \leq \exp(\beta R(\tau) - \beta R^\ast)$ return $\tau$.
Line 55: Line 55:
   - **Repeat**: Set $t \leftarrow t + 1$ and repeat from step 2.   - **Repeat**: Set $t \leftarrow t + 1$ and repeat from step 2.
  
-The resulting distribution $P_t(\tau)$ is our bounded-rational policy. +The resulting distribution $P_t(\tau)$ is our bounded-rational policy. You will have to experiment with the choices of $\alpha$ (which controls the step size) and $N$ (which controls the representation quality of the target distribution) to obtain a satisfactory training time
-===== Adding the context =====+===== Adding a user-provided context =====
  
-The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a given context $c$, i.e. we want to sample strings from $P(\tau|c)$, not $P(\tau)$. The problem is that the contexts $c$ are not generated in a way that conforms to the reward function which will deteriorate the model.+The above algorithm generates a new prior $P(\tau)$ which places more weights on desirable strings. However, often we want policies to respond to a user-provided context string $c \in X^\ast$, i.e. we want to sample strings from $P(\tau|c)$, not $P(\tau)$. The problem is that the contexts $c$ are not generated in a way that conforms to the reward function, so the training procedure above will bias the model.
  
  
  
  
-===== Memory-Constrained Agents =====+==== Enter memory-constrained agents ====
  
 \[ \[
  • sba.1730052040.txt.gz
  • Last modified: 2024/10/27 18:00
  • by pedroortega