klderivation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
klderivation [2024/12/09 11:59] – [Conclusion] pedroortegaklderivation [2024/12/24 12:58] (current) – [Connecting to the free energy objective] pedroortega
Line 2: Line 2:
  
 ====== Why does every choice come with an entropy tax? ====== ====== Why does every choice come with an entropy tax? ======
 +
 +> I present a very general derivation that shows how every choice carries an unavoidable "entropy tax," reflecting the hidden cost of shifting from old beliefs to new choices. 
 +
 +//Cite as: Ortega, P.A. “Why does every choice come with a tax?”, Tech Note 3, DAIOS, 2024.//
  
 Imagine every choice you make —whether trivial or life-changing— comes with a hidden "tax". This isn't a financial toll but an entropy tax, an inherent cost tied to the mental effort of shifting from what you initially believe (the prior) to what you choose after deliberation (the posterior). Imagine every choice you make —whether trivial or life-changing— comes with a hidden "tax". This isn't a financial toll but an entropy tax, an inherent cost tied to the mental effort of shifting from what you initially believe (the prior) to what you choose after deliberation (the posterior).
  
-This concept lies at the heart of information-theoretic bounded rationality, and the entropy tax formula looks like this:+This concept lies at the heart of [[https://royalsocietypublishing.org/doi/10.1098/rspa.2012.0683|information-theoretic bounded rationality]], and the tax formula looks like this:
 \[ \[
    \text{Choice Tax} \propto \sum_x P(x|d) \log \frac{ P(x|d) }{ P(x) },    \text{Choice Tax} \propto \sum_x P(x|d) \log \frac{ P(x|d) }{ P(x) },
Line 15: Line 19:
 ===== Assumption 1: Temporal progress as conditioning ===== ===== Assumption 1: Temporal progress as conditioning =====
  
-First we need to model temporal progress of any kind. We'll go with a "spacetime" representation that is standard in measure theory. This works as follows. We assume that we have a collection of all the possible realizations of a process of interest. This is our sample space $\Omega$. To make things simple, let's assume this set is finite. We also place a probability distribution $P$ over all the realizations $\omega \in \Omega$.+First we need to model temporal progress of any kind. We'll go with a "spacetime" representation that is standard in measure theory. This works as follows. We assume that we have a collection of all the possible realizations of a process of interest. This is our sample space $\Omega$. To make things simple, let's assume this set is finite (but potentially huge). We also place a probability distribution $P$ over all the realizations $\omega \in \Omega$.
  
 Now, any event –be it a choice, an observation, a thought, etc.– is a subset of $\Omega$. Whenever an event $e \subset \Omega$ occurs, we condition our sample space by $e$. This means that we restrict our focus only on the elements $\omega \in e$ inside the event, and then renormalize our probabilities:  Now, any event –be it a choice, an observation, a thought, etc.– is a subset of $\Omega$. Whenever an event $e \subset \Omega$ occurs, we condition our sample space by $e$. This means that we restrict our focus only on the elements $\omega \in e$ inside the event, and then renormalize our probabilities: 
Line 33: Line 37:
 ===== Assumption 2: Restrictions on the cost function ===== ===== Assumption 2: Restrictions on the cost function =====
  
-Next, we'll impose constraints on the cost function. We want our cost function to capture efforts that are structurally consistent with the underlying probability space. The following requirements are natural:+Next, we'll impose constraints on the cost function. We want our cost function to capture efforts that are structurally consistent with the underlying probability space. (Later, we'll see how to relax these assumptions without compromising these structural constraints.) The following requirements are natural:
  
 {{ ::cost-axioms.png?nolink |}} {{ ::cost-axioms.png?nolink |}}
  
-  - **Continuity:** The cost function should be a continuous function of the conditional probabilities. Formally, for every pair $a, b$ of events such that $b \subset a$, the conditional cost $C(b|a)$ of bringing about the event $a$ given the event $b$ is a continuous function of the conditional probability $P(b|a)$.+  - **Continuity:** The cost function should be a continuous function of the conditional probabilities. Formally, for every pair $a, b$ of events such that $b \subset a$, the conditional cost $C(b|a)$ of bringing about the event $b$ given the event $a$ is a continuous function of the conditional probability $P(b|a)$.
   - **Transitivity:** If there is a sequence of three events, bringing about the last from the first costs as much as doing it in two steps. Formally, for every triplet of events $a, b, c$ such that $c \subset b \subset a$, the cost is additive: $C(c|a) = C(b|a) + C(c|b)$.   - **Transitivity:** If there is a sequence of three events, bringing about the last from the first costs as much as doing it in two steps. Formally, for every triplet of events $a, b, c$ such that $c \subset b \subset a$, the cost is additive: $C(c|a) = C(b|a) + C(c|b)$.
   - **Monotonicity:** Events of higher probability are easier to bring about than those of lower probability. Formally, we have for every $a, b, c, d$ such that $b \subset a$ and $d \subset c$, $P(b|a) > P(d|c)$ iff $C(b|a) < C(d|c)$.   - **Monotonicity:** Events of higher probability are easier to bring about than those of lower probability. Formally, we have for every $a, b, c, d$ such that $b \subset a$ and $d \subset c$, $P(b|a) > P(d|c)$ iff $C(b|a) < C(d|c)$.
Line 43: Line 47:
 These requirements are essentially equivalent to Shannon's axioms for entropy restated in terms of events. As a result, we get that the only cost function that obeys these requirements is the information content: These requirements are essentially equivalent to Shannon's axioms for entropy restated in terms of events. As a result, we get that the only cost function that obeys these requirements is the information content:
 \[ \[
-   C(a|b) = -\beta \log P(a|B),+   C(a|b) = -\beta \log P(a|b),
 \] \]
 where $\beta > 0$ is factor that determines the units of the cost. where $\beta > 0$ is factor that determines the units of the cost.
Line 61: Line 65:
 ===== Cost of deliberation ===== ===== Cost of deliberation =====
  
-Now, let's calculate the cost of transforming the prior choice probabilities into posterior choice probabilities:+Now, based on our sketch above, let's calculate the cost of transforming the prior choice probabilities into posterior choice probabilities:
 \[ \[
   \begin{align}   \begin{align}
Line 78: Line 82:
 We've obtained two expectation terms. The second is proportional to the Kullback-Leibler divergence between of the posterior to the prior choice probabilities. What is the first expectation? We've obtained two expectation terms. The second is proportional to the Kullback-Leibler divergence between of the posterior to the prior choice probabilities. What is the first expectation?
  
-The first expectation represents the expected cost of each individual choice. This is because each term $C(x \cap d|x \cap c)$ measures the cost of transforming the relative probability of a specific choice.+The first expectation represents the expected cost of each individual choice (if each choice were to occur deterministically). This is because each term $C(x \cap d|x \cap c)$ measures the cost of transforming the relative probability of a specific choice.
  
 ===== Connecting to the free energy objective ===== ===== Connecting to the free energy objective =====
Line 84: Line 88:
 We can transform the above equality into a variational principle by replacing the individual choice costs $C(x \cap d|x \cap c)$ with arbitrary numbers. The resulting expression is convex in the posterior choice probabilities $P(x|d)$, so we get a nice and clean objective function with a unique minimum. We can transform the above equality into a variational principle by replacing the individual choice costs $C(x \cap d|x \cap c)$ with arbitrary numbers. The resulting expression is convex in the posterior choice probabilities $P(x|d)$, so we get a nice and clean objective function with a unique minimum.
  
-We can even go a step further: by multiplying the expression by $-1$, we can treat the costs as utilities, obtaining+We can even go a step further: noticing that the variational problem is translationally invariant in the costs, and multiplying the expression by $-1$, we can treat the resulting "negative costs plus a constant" as utilities, obtaining
 \[ \[
   \sum_x P(x|d) U(x) - \frac{1}{\beta} \sum_x P(x|d) \log \frac{ P(x|d) }{ P(x|c) }.   \sum_x P(x|d) U(x) - \frac{1}{\beta} \sum_x P(x|d) \log \frac{ P(x|d) }{ P(x|c) }.
Line 112: Line 116:
  
 Next time you’re weighing options, remember: even the act of choosing comes with its own price. Next time you’re weighing options, remember: even the act of choosing comes with its own price.
 +
 +===== References =====
 +
 +  - Our first derivation of the free energy difference is in "Information, Utility, and Bounded Rationality" by Ortega & Braun, Proceedings of the conference on artificial general intelligence, 2011.([[https://adaptiveagents.org/_media/papers/utilityinfoboundedrationality.pdf|AGI 2011]]).
 +  - The derivation with probability measures comes from "Information-Theoretic Bounded Rationality" by Ortega et al., 2015.  [[https://arxiv.org/pdf/1512.06789|Arxiv1512.06789]].
  
  • klderivation.1733745540.txt.gz
  • Last modified: 2024/12/09 11:59
  • by pedroortega