Skip to main content

Fano's inequality

(Thank you to Alex May for catching an error in the first version of this post!)

I had to learn Fano's inequality for a quantum information project, but found the proof as it currently exists on Wikipedia extremely confusing, largely because it uses a lot of jargon about "messages" and "estimates" that, in my opinion, obscures the underlying mathematics. So here's a simple proof. I'll probably try to edit the Wikipedia proof after I post this.

Heuristically speaking, Fano's inequality is a bound on how successfully I can guess the output of a random process $X$ if I can't access $X$ directly but can access a different random process $Y$ that is correlated with $X.$ To state the bound precisely, we'll need a bit of notation.

A finite random variable, henceforth just a "random variable," is a finite set $\mathcal{X}$ together with a map $p_X : \mathcal{X} \rightarrow [0, 1]$ satisfying
$$\sum_{x \in \mathcal{X}} p_X(x) = 1.$$
The function $p_X$ is called the probability distribution of the random variable. We think of the elements of the set as corresponding to different possible outcomes of some process, with $p_X(x)$ denoting the probability of outcome $x.$

To discuss correlated events, we need to introduce the idea of a joint distribution over two random variables. We are given two finite sets $\mathcal{X}$ and $\mathcal{Y}$ that index outcomes of a process, together with a map
$$p_{X,Y} : \mathcal{X} \times \mathcal{Y} \rightarrow [0, 1]$$
satisfying
$$\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X,Y} (x, y) = 1.$$
The marginal distributions are defined by
$$p_X(x) = \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y)$$
and
$$p_Y(y) = \sum_{x \in \mathcal{X}} p_{X, Y}(x, y).$$
One can check that these are probability distributions on $\mathcal{X}$ and $\mathcal{Y},$ respectively.

Now, suppose we want to guess the outcome of some process whose statistics are governed by a known probability distribution $p_X.$ The system producing the random outcomes is behind some curtain; we aren't actually allowed to look at the outcomes. But every time an outcome is produced, our friend Bob on the other side of the curtain asks us to guess what we think the outcome was. We'd like to find a guessing strategy that results in us being correct as often as possible.

The most naive thing we could do is to set up our own random number generator that produces outcomes according to the probability distribution $p_X$, and every time Bob asks us to guess the outcome of his machine we just tell him what outcome came out of our machine. But this strategy is pretty bad. The joint distribution of our two systems is
$$p_{X_1, X_2}(x_1, x_2) = p_{X}(x_1) p_{X}(x_2),$$
and the probability that we guess Bob's outcome correctly is the diagonal sum
$$\sum_{x_1 = x_2} p_{X_1, X_2}(x_1, x_2) = \sum_{x} p_{X}(x)^2.$$

We can do better if we share some randomness with Bob's system. What this means is, imagine if the probability distribution Bob is sampling, $p_{X}(x),$ is actually the marginal of some larger probability distribution $p_{X, Y}(x, y),$ and we're allowed to see the outcomes of system $Y.$ Maybe Bob's system $X$ is connected by a wire to a second system $Y$ on our side of the curtain. We'll come up with some guessing strategy, which is a function
$$f : \mathcal{Y} \rightarrow \mathcal{X},$$
and whenever we see our system produce outcome $y$, we'll guess that Bob's system produced outcome $f(y).$ Fano's inequality is a lower bound on the probability that our guess is wrong, in terms of information-theoretic quantities related to the probability distribution $p_{X, Y}.$

The specific information theoretic quantity we will need is the Shannon entropy of a probability distribution, which is defined by
$$H(X) = - \sum_{x \in \mathcal{X}} p_{X}(x) \log p_{X}(x).$$
The formula for the entropy on our joint distribution is
$$H(X, Y) = - \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X, Y} (x, y) \log p_{X, Y} (y).$$
We will also need the conditional entropy, which is defined by
$$H(X|Y) = H(X,Y) - H(Y). \tag{1}$$
If you look at an information theory textbook, their definition of conditional entropy will be different; they will define it by
$$H(X|Y) = - \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X,Y}(x, y) \log \frac{p_{X,Y}(x, y)}{p_{Y}(y)},$$
and then prove that this satisfies equation (1), which they will call the "chain rule of entropy." I find starting with equation (1) more intuitive, because that's the version of conditional entropy that generalizes to quantum information theory. Plus, the proof of the "chain rule of entropy" is completely trivial; you just use the identity $\log(a/b) = \log(a) - \log(b).$

So, here's the proof of Fano's inequality. For bookkeeping purposes, we introduce the set $\mathcal{B} = \{0, 1\},$ and define a probability distribution on $\mathcal{X} \times \mathcal{Y} \times \mathcal{B}$ by
$$p_{X, Y, B}(x, y, b) = \delta_{b, \Pi(x=f(y))} p_{X, Y}(x, y).$$
Here $\Pi(x= f(y))$ is the indicator function that returns one if $x = f(y)$ and zero if $x \neq f(y).$ On the surface, this probability distribution isn't particularly interesting; for fixed $x, y,$ it just slots the value $p_{X,Y}(x,y)$ into either $b=1$ or $b=0$ depending on whether or not we have $x = f(y).$ But thinking about the entropic quantities of this extended distribution turns out to be very useful.

From now on, we'll drop subscripts on all probability distributions, since the arguments should make it clear which distribution we're talking about. This is okay, because you can see by inspection that $p_{X,Y}(x,y)$ is in fact the marginal distribution of $p_{X, Y, B}(x, y, b),$ so there's no ambiguity in what we mean by $p(x, y).$ It will also be helpful to have the explicit expression for the marginal distribution on $\mathcal{Y} \times \mathcal{B},$ which is
$$p(y, b) = \delta_{b, 1} p(f(y), y) + \delta_{b, 0} \sum_{x \neq f(y)} p(x, y). \tag{2}$$

We start with the conditional entropy
$$H(X|Y) = H(X, Y) - H(Y),$$
which we rewrite as
$$H(X|Y) = [H(X, Y) - H(Y, B)] + [H(Y, B) - H(Y)]. \tag{3}$$
The first term in brackets can be computed explicitly using equation (2) as
$$\begin{split}& - \sum_{x, y} p(x, y) \log p(x, y) \\ & + \sum_{y} p(f(y), y)  \log p(f(y), y) \\ & + \sum_{y} \sum_{x \neq f(y)} p(x, y) \log\left( \sum_{x' \neq f(y)} p(x', y)\right). \end{split}$$
The diagonal piece of the first sum — meaning, the set of terms with $x = f(y)$ cancels with the second sum, leaving us with
$$\sum_{y} \sum_{x \neq f(y)} p(x,y) \log\left( \frac{\sum_{x' \neq f(y)} p(x', y)}{p(x, y)} \right).$$
Because the logarithm is a concave function, we can apply Jensen's inequality (equation 2 in this link) to upper bound the above by
$$\left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log\left( \sum_{y} \sum_{x, x' \neq f(y)} p(x', y) \right).$$
Inside the logarithm, the summands have no $x$-dependence, so we can do the $x$ sum directly to obtain
$$\left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log\left((|\mathcal{X}| - 1) \sum_{y} \sum_{x' \neq f(y)} p(x', y) \right).$$
At this point we simply use the fact that the sum $\sum_{y} \sum_{x' \neq f(y)} p(x', y)$ is upper bounded by one to obtain the upper bound
$$H(X, Y) - H(Y, B) \leq \left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log(|\mathcal{X}| - 1).$$
The coefficient $\left(\sum_{y} \sum_{x \neq f(y)} p(x, y) \right)$ has a useful physical interpretation — it is the probability our guessing strategy fails! It's the sum over $p(x, y)$ for all outcomes where our guess is incorrect. So we'll call this quantity $P_{F},$ and write our upper bound as
$$H(X, Y) - H(Y, B) \leq P_F \log(|\mathcal{X}| - 1).$$
Plugging this back into equation (3) gives
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + [H(Y, B) - H(Y)].$$

To bound the second term, we use the positivity of mutual information, which states
$$H(B) + H(Y) - H(Y, B) \geq 0.$$
This gives
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + H(B).$$
The marginal distribution on $B$ is given by
$$p(0) = \sum_{y} \sum_{x \neq f(y)} p(x, y),$$
$$p(1) = \sum_{y} p(f(y), y).$$
In other words, the marginal distribution on $B$ is $\{P_F, 1 - P_F\}.$ We will call the entropy of this distribution the binary entropy $H_{b}(P_F).$ Putting all this together gives us Fano's inequality,
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + H_{b}(P_F).$$

Comments

Popular posts from this blog

Stone's theorem

 Stone's theorem is the basic result describing group-like unitary flows on Hilbert space. If the map $t \mapsto U(t)$ is continuous in a sense we will make precise later, and each $U(t)$ is a unitary map on a Hilbert space $\mathcal{H},$ and we have $U(t+s)=U(t)U(s),$ then Stone's theorem asserts the existence of a (self-adjoint, positive definite, unbounded) operator $\Delta$ satisfying $U(t) = \Delta^{it}.$ This reduces the study of group-like unitary flows to the study of (self-adjoint, etc etc) operators. Quantum mechanically, it tells us that every group-like unitary evolution is generated by a time-independent Hamiltonian. This lets us study very general symmetry transformations in terms of Hamiltonians. The standard proof of Stone's theorem, which you'll see if you look at Wikipedia , involves trying to make sense of a limit like $\lim_{t \to 0} (U(t) - 1)/t$. However, I have recently learned of a beautiful proof of Stone's theorem that works instead by stud

Vector integration

Lately I've been thinking a lot about algebras of operators acting on a Hilbert space, since they provide an extremely useful tool for thinking about locality in quantum field theory. I'm working on a review article about Tomita-Takesaki modular theory to supplement my recent review on the type classification of von Neumann algebras . The core object of study in Tomita-Takesaki theory is a one-parameter group of unitary operators $\Delta^{it},$ generated by a single positive (often unbounded) operator $\Delta.$ In physics, the Tomita-Takesaki unitaries furnish a "hidden thermodynamic symmetry" of a physical state. A lot of interesting physics and mathematics can be learned by studying the analytic structure of the function $z \mapsto \Delta^{z}$ for generic complex $z,$ or of the function $z \mapsto \Delta^{z} |\psi\rangle$ for generic complex $z$ and some fixed state $|\psi\rangle.$ But in order to do this, we need to understand how to do complex analysis for operat

The stress-energy tensor in field theory

I came to physics research through general relativity, where the stress energy tensor plays a very important role, and where it has a single unambiguous meaning as the functional derivative of the theory with respect to metric perturbations. In flat-space quantum field theory, some texts present the stress tensor this way, while some present the stress tensor as a set of Noether currents associated with spatial translations. These definitions are usually presented as being equivalent, or rather, equivalent up to the addition of some total derivative that doesn't affect the physics. However, this is not actually the case. The two stress tensors differ by terms that can be made to vanish classically, but that have an important effect in the quantum theory. In particular, the Ward identities of the two different stress tensors are different. This has caused me a lot of grief over the years, as I've tried to compare equations between texts that use two different definitions of the