Differences

This shows you the differences between two versions of the page.

--- belief_flows [2023/11/19 17:21] – created - external edit 127.0.0.1
+++ belief_flows [2025/04/08 15:20] (current) – [Belief flows] pedroortega
@@ Line 52: / Line 52: @@
   - **Prior:** We place a Gaussian distribution $P(w)$ to represent our parameter uncertainty. To simplify our exposition, we assume that the covariance matrix is diagonal, and so \[ P(w) = N(w; \mu, \Sigma) = \prod_n N(w_n; \mu_n, \sigma^2_n), \] where $w_n$, $\mu_n$ are the $n$-th components of the parameter and mean vectors respectively, and $\sigma^2_n$ is the $n$-th diagonal element of the covariance matrix $\Sigma$.
   - **Parameter choice:** The learning algorithm now has to choose model parameters to minimize the prediction error. It does so using Thompson sampling, that is, by sampling a parameter vector $w'$ from the prior distribution: \[ \bar{w} \sim P(w). \]
-  - **Evaluation of Loss and Local Update:** Once the parameter is chosen, the learning algorithm is given a supervised pair $(x, y)$ that is can use to evaluate the loss $\ell(y, \hat{y})$, where $\hat{y} = F_{\bar{w}}(x)$ is the predicted output. Based on this loss, the learning algorithm can calculate the update of the parameter $\bar{w}$ using SGD: \[ \bar{w}' = \bar{w} - \eta \cdot \frac{\partial}{\partial w} \ell(y,\hat{y}), \] where $\eta > 0$ is the learning rate.
+  - **Evaluation of Loss and Local Update:** Once the parameter is chosen, the learning algorithm is given a supervised pair $(x, y)$ that it can use to evaluate the loss $\ell(y, \hat{y})$, where $\hat{y} = F_{\bar{w}}(x)$ is the predicted output. Based on this loss, the learning algorithm can calculate the update of the parameter $\bar{w}$ using SGD: \[ \bar{w}' = \bar{w} - \eta \cdot \frac{\partial}{\partial w} \ell(y,\hat{y}), \] where $\eta > 0$ is the learning rate.
   - **Global Update:** Now, the algorithm has to change its prior beliefs $P(w)$ into posterior beliefs $P'(w)$. To do so, it must infer the SGD update over the whole parameter space based solely on the local observation $\bar{w} \rightarrow \bar{w}'$.
     - If we assume a quadratic error function with uncorrelated coordinates, then the class of possible SGD updates becomes the class of linear flow fields in parameter space that transforms each component as \[ w'_n = a_n w_n + b_n, \] preserving the Gaussian shape of the resulting posterior. However, there are many such transformation that are consistent with the observed SGD update $\bar{w} \rightarrow \bar{w}'$, so which one should the algorithm choose?