Skip to main content

Fano's inequality

(Thank you to Alex May for catching an error in the first version of this post!)

I had to learn Fano's inequality for a quantum information project, but found the proof as it currently exists on Wikipedia extremely confusing, largely because it uses a lot of jargon about "messages" and "estimates" that, in my opinion, obscures the underlying mathematics. So here's a simple proof. I'll probably try to edit the Wikipedia proof after I post this.

Heuristically speaking, Fano's inequality is a bound on how successfully I can guess the output of a random process $X$ if I can't access $X$ directly but can access a different random process $Y$ that is correlated with $X.$ To state the bound precisely, we'll need a bit of notation.

A finite random variable, henceforth just a "random variable," is a finite set $\mathcal{X}$ together with a map $p_X : \mathcal{X} \rightarrow [0, 1]$ satisfying
$$\sum_{x \in \mathcal{X}} p_X(x) = 1.$$
The function $p_X$ is called the probability distribution of the random variable. We think of the elements of the set as corresponding to different possible outcomes of some process, with $p_X(x)$ denoting the probability of outcome $x.$

To discuss correlated events, we need to introduce the idea of a joint distribution over two random variables. We are given two finite sets $\mathcal{X}$ and $\mathcal{Y}$ that index outcomes of a process, together with a map
$$p_{X,Y} : \mathcal{X} \times \mathcal{Y} \rightarrow [0, 1]$$
satisfying
$$\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X,Y} (x, y) = 1.$$
The marginal distributions are defined by
$$p_X(x) = \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y)$$
and
$$p_Y(y) = \sum_{x \in \mathcal{X}} p_{X, Y}(x, y).$$
One can check that these are probability distributions on $\mathcal{X}$ and $\mathcal{Y},$ respectively.

Now, suppose we want to guess the outcome of some process whose statistics are governed by a known probability distribution $p_X.$ The system producing the random outcomes is behind some curtain; we aren't actually allowed to look at the outcomes. But every time an outcome is produced, our friend Bob on the other side of the curtain asks us to guess what we think the outcome was. We'd like to find a guessing strategy that results in us being correct as often as possible.

The most naive thing we could do is to set up our own random number generator that produces outcomes according to the probability distribution $p_X$, and every time Bob asks us to guess the outcome of his machine we just tell him what outcome came out of our machine. But this strategy is pretty bad. The joint distribution of our two systems is
$$p_{X_1, X_2}(x_1, x_2) = p_{X}(x_1) p_{X}(x_2),$$
and the probability that we guess Bob's outcome correctly is the diagonal sum
$$\sum_{x_1 = x_2} p_{X_1, X_2}(x_1, x_2) = \sum_{x} p_{X}(x)^2.$$

We can do better if we share some randomness with Bob's system. What this means is, imagine if the probability distribution Bob is sampling, $p_{X}(x),$ is actually the marginal of some larger probability distribution $p_{X, Y}(x, y),$ and we're allowed to see the outcomes of system $Y.$ Maybe Bob's system $X$ is connected by a wire to a second system $Y$ on our side of the curtain. We'll come up with some guessing strategy, which is a function
$$f : \mathcal{Y} \rightarrow \mathcal{X},$$
and whenever we see our system produce outcome $y$, we'll guess that Bob's system produced outcome $f(y).$ Fano's inequality is a lower bound on the probability that our guess is wrong, in terms of information-theoretic quantities related to the probability distribution $p_{X, Y}.$

The specific information theoretic quantity we will need is the Shannon entropy of a probability distribution, which is defined by
$$H(X) = - \sum_{x \in \mathcal{X}} p_{X}(x) \log p_{X}(x).$$
The formula for the entropy on our joint distribution is
$$H(X, Y) = - \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X, Y} (x, y) \log p_{X, Y} (y).$$
We will also need the conditional entropy, which is defined by
$$H(X|Y) = H(X,Y) - H(Y). \tag{1}$$
If you look at an information theory textbook, their definition of conditional entropy will be different; they will define it by
$$H(X|Y) = - \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X,Y}(x, y) \log \frac{p_{X,Y}(x, y)}{p_{Y}(y)},$$
and then prove that this satisfies equation (1), which they will call the "chain rule of entropy." I find starting with equation (1) more intuitive, because that's the version of conditional entropy that generalizes to quantum information theory. Plus, the proof of the "chain rule of entropy" is completely trivial; you just use the identity $\log(a/b) = \log(a) - \log(b).$

So, here's the proof of Fano's inequality. For bookkeeping purposes, we introduce the set $\mathcal{B} = \{0, 1\},$ and define a probability distribution on $\mathcal{X} \times \mathcal{Y} \times \mathcal{B}$ by
$$p_{X, Y, B}(x, y, b) = \delta_{b, \Pi(x=f(y))} p_{X, Y}(x, y).$$
Here $\Pi(x= f(y))$ is the indicator function that returns one if $x = f(y)$ and zero if $x \neq f(y).$ On the surface, this probability distribution isn't particularly interesting; for fixed $x, y,$ it just slots the value $p_{X,Y}(x,y)$ into either $b=1$ or $b=0$ depending on whether or not we have $x = f(y).$ But thinking about the entropic quantities of this extended distribution turns out to be very useful.

From now on, we'll drop subscripts on all probability distributions, since the arguments should make it clear which distribution we're talking about. This is okay, because you can see by inspection that $p_{X,Y}(x,y)$ is in fact the marginal distribution of $p_{X, Y, B}(x, y, b),$ so there's no ambiguity in what we mean by $p(x, y).$ It will also be helpful to have the explicit expression for the marginal distribution on $\mathcal{Y} \times \mathcal{B},$ which is
$$p(y, b) = \delta_{b, 1} p(f(y), y) + \delta_{b, 0} \sum_{x \neq f(y)} p(x, y). \tag{2}$$

We start with the conditional entropy
$$H(X|Y) = H(X, Y) - H(Y),$$
which we rewrite as
$$H(X|Y) = [H(X, Y) - H(Y, B)] + [H(Y, B) - H(Y)]. \tag{3}$$
The first term in brackets can be computed explicitly using equation (2) as
$$\begin{split}& - \sum_{x, y} p(x, y) \log p(x, y) \\ & + \sum_{y} p(f(y), y)  \log p(f(y), y) \\ & + \sum_{y} \sum_{x \neq f(y)} p(x, y) \log\left( \sum_{x' \neq f(y)} p(x', y)\right). \end{split}$$
The diagonal piece of the first sum — meaning, the set of terms with $x = f(y)$ cancels with the second sum, leaving us with
$$\sum_{y} \sum_{x \neq f(y)} p(x,y) \log\left( \frac{\sum_{x' \neq f(y)} p(x', y)}{p(x, y)} \right).$$
Because the logarithm is a concave function, we can apply Jensen's inequality (equation 2 in this link) to upper bound the above by
$$\left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log\left( \sum_{y} \sum_{x, x' \neq f(y)} p(x', y) \right).$$
Inside the logarithm, the summands have no $x$-dependence, so we can do the $x$ sum directly to obtain
$$\left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log\left((|\mathcal{X}| - 1) \sum_{y} \sum_{x' \neq f(y)} p(x', y) \right).$$
At this point we simply use the fact that the sum $\sum_{y} \sum_{x' \neq f(y)} p(x', y)$ is upper bounded by one to obtain the upper bound
$$H(X, Y) - H(Y, B) \leq \left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log(|\mathcal{X}| - 1).$$
The coefficient $\left(\sum_{y} \sum_{x \neq f(y)} p(x, y) \right)$ has a useful physical interpretation — it is the probability our guessing strategy fails! It's the sum over $p(x, y)$ for all outcomes where our guess is incorrect. So we'll call this quantity $P_{F},$ and write our upper bound as
$$H(X, Y) - H(Y, B) \leq P_F \log(|\mathcal{X}| - 1).$$
Plugging this back into equation (3) gives
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + [H(Y, B) - H(Y)].$$

To bound the second term, we use the positivity of mutual information, which states
$$H(B) + H(Y) - H(Y, B) \geq 0.$$
This gives
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + H(B).$$
The marginal distribution on $B$ is given by
$$p(0) = \sum_{y} \sum_{x \neq f(y)} p(x, y),$$
$$p(1) = \sum_{y} p(f(y), y).$$
In other words, the marginal distribution on $B$ is $\{P_F, 1 - P_F\}.$ We will call the entropy of this distribution the binary entropy $H_{b}(P_F).$ Putting all this together gives us Fano's inequality,
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + H_{b}(P_F).$$

Comments

Popular posts from this blog

Pick functions and operator monotones

Any time you can order mathematical objects, it is productive to ask what operations preserve the ordering. For example, real numbers have a natural ordering, and we have $x \geq y \Rightarrow x^k \geq y^k$ for any odd natural number $k$. If we further impose the assumption $y \geq 0,$ then order preservation holds for $k$ any positive real number. Self-adjoint operators on a Hilbert space have a natural (partial) order as well. We write $A \geq 0$ for a self-adjoint operator $A$ if we have $$\langle \psi | A | \psi \rangle \geq 0$$ for every vector $|\psi\rangle,$ and we write $A \geq B$ for self-adjoint operators $A$ and $B$ if we have $(A - B) \geq 0.$ Curiously, many operations that are monotonic for real numbers are not monotonic for matrices. For example, the matrices $$P = \frac{1}{2} \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$$ and $$Q = \begin{pmatrix} 0 & 0 \\ 0 & 1 \end{pmatrix}$$ are both self-adjoint and positive, so we have $P+Q \geq P \geq 0$, but a str...

Envelopes of holomorphy and the timelike tube theorem

Complex analysis, as we usually learn it, is the study of differentiable functions from $\mathbb{C}$ to $\mathbb{C}$. These functions have many nice properties: if they are differentiable even once then they are infinitely differentiable; in fact they are analytic, meaning they can be represented in the vicinity of any point as an absolutely convergent power series; moreover at any point $z_0$, the power series has radius of convergence equal to the radius of the biggest disc centered at $z_0$ which can be embedded in the domain of the function. The same basic properties hold for differentiable functions in higher complex dimensions. If $\Omega$ is a domain --- i.e., a connected open set --- in $\mathbb{C}^n$, and $f : \Omega \to \mathbb{C}^n$ is once differentiable, then it is in fact analytic, and can be represented as a power series in a neighborhood of any point $z_*$, i.e., we have an expression like $$f(z) = \sum a_{k_1 \dots k_n} (z_1 - z_*)^{k_1} \dots (z_n - z_*)^{k_n}.$$ The ...

Stone's theorem

 Stone's theorem is the basic result describing group-like unitary flows on Hilbert space. If the map $t \mapsto U(t)$ is continuous in a sense we will make precise later, and each $U(t)$ is a unitary map on a Hilbert space $\mathcal{H},$ and we have $U(t+s)=U(t)U(s),$ then Stone's theorem asserts the existence of a (self-adjoint, positive definite, unbounded) operator $\Delta$ satisfying $U(t) = \Delta^{it}.$ This reduces the study of group-like unitary flows to the study of (self-adjoint, etc etc) operators. Quantum mechanically, it tells us that every group-like unitary evolution is generated by a time-independent Hamiltonian. This lets us study very general symmetry transformations in terms of Hamiltonians. The standard proof of Stone's theorem, which you'll see if you look at Wikipedia , involves trying to make sense of a limit like $\lim_{t \to 0} (U(t) - 1)/t$. However, I have recently learned of a beautiful proof of Stone's theorem that works instead by stud...