Skip to main content

Fano's inequality

(Thank you to Alex May for catching an error in the first version of this post!)

I had to learn Fano's inequality for a quantum information project, but found the proof as it currently exists on Wikipedia extremely confusing, largely because it uses a lot of jargon about "messages" and "estimates" that, in my opinion, obscures the underlying mathematics. So here's a simple proof. I'll probably try to edit the Wikipedia proof after I post this.

Heuristically speaking, Fano's inequality is a bound on how successfully I can guess the output of a random process X if I can't access X directly but can access a different random process Y that is correlated with X. To state the bound precisely, we'll need a bit of notation.

A finite random variable, henceforth just a "random variable," is a finite set X together with a map pX:X[0,1] satisfying
xXpX(x)=1.
The function pX is called the probability distribution of the random variable. We think of the elements of the set as corresponding to different possible outcomes of some process, with pX(x) denoting the probability of outcome x.

To discuss correlated events, we need to introduce the idea of a joint distribution over two random variables. We are given two finite sets X and Y that index outcomes of a process, together with a map
pX,Y:X×Y[0,1]
satisfying
xX,yYpX,Y(x,y)=1.
The marginal distributions are defined by
pX(x)=yYpX,Y(x,y)
and
pY(y)=xXpX,Y(x,y).
One can check that these are probability distributions on X and Y, respectively.

Now, suppose we want to guess the outcome of some process whose statistics are governed by a known probability distribution pX. The system producing the random outcomes is behind some curtain; we aren't actually allowed to look at the outcomes. But every time an outcome is produced, our friend Bob on the other side of the curtain asks us to guess what we think the outcome was. We'd like to find a guessing strategy that results in us being correct as often as possible.

The most naive thing we could do is to set up our own random number generator that produces outcomes according to the probability distribution pX, and every time Bob asks us to guess the outcome of his machine we just tell him what outcome came out of our machine. But this strategy is pretty bad. The joint distribution of our two systems is
pX1,X2(x1,x2)=pX(x1)pX(x2),
and the probability that we guess Bob's outcome correctly is the diagonal sum
x1=x2pX1,X2(x1,x2)=xpX(x)2.

We can do better if we share some randomness with Bob's system. What this means is, imagine if the probability distribution Bob is sampling, pX(x), is actually the marginal of some larger probability distribution pX,Y(x,y), and we're allowed to see the outcomes of system Y. Maybe Bob's system X is connected by a wire to a second system Y on our side of the curtain. We'll come up with some guessing strategy, which is a function
f:YX,
and whenever we see our system produce outcome y, we'll guess that Bob's system produced outcome f(y). Fano's inequality is a lower bound on the probability that our guess is wrong, in terms of information-theoretic quantities related to the probability distribution pX,Y.

The specific information theoretic quantity we will need is the Shannon entropy of a probability distribution, which is defined by
H(X)=xXpX(x)logpX(x).
The formula for the entropy on our joint distribution is
H(X,Y)=xX,yYpX,Y(x,y)logpX,Y(y).
We will also need the conditional entropy, which is defined by
(1)H(X|Y)=H(X,Y)H(Y).
If you look at an information theory textbook, their definition of conditional entropy will be different; they will define it by
H(X|Y)=xX,yYpX,Y(x,y)logpX,Y(x,y)pY(y),
and then prove that this satisfies equation (1), which they will call the "chain rule of entropy." I find starting with equation (1) more intuitive, because that's the version of conditional entropy that generalizes to quantum information theory. Plus, the proof of the "chain rule of entropy" is completely trivial; you just use the identity log(a/b)=log(a)log(b).

So, here's the proof of Fano's inequality. For bookkeeping purposes, we introduce the set B={0,1}, and define a probability distribution on X×Y×B by
pX,Y,B(x,y,b)=δb,Π(x=f(y))pX,Y(x,y).
Here Π(x=f(y)) is the indicator function that returns one if x=f(y) and zero if xf(y). On the surface, this probability distribution isn't particularly interesting; for fixed x,y, it just slots the value pX,Y(x,y) into either b=1 or b=0 depending on whether or not we have x=f(y). But thinking about the entropic quantities of this extended distribution turns out to be very useful.

From now on, we'll drop subscripts on all probability distributions, since the arguments should make it clear which distribution we're talking about. This is okay, because you can see by inspection that pX,Y(x,y) is in fact the marginal distribution of pX,Y,B(x,y,b), so there's no ambiguity in what we mean by p(x,y). It will also be helpful to have the explicit expression for the marginal distribution on Y×B, which is
(2)p(y,b)=δb,1p(f(y),y)+δb,0xf(y)p(x,y).

We start with the conditional entropy
H(X|Y)=H(X,Y)H(Y),
which we rewrite as
(3)H(X|Y)=[H(X,Y)H(Y,B)]+[H(Y,B)H(Y)].
The first term in brackets can be computed explicitly using equation (2) as
x,yp(x,y)logp(x,y)+yp(f(y),y)logp(f(y),y)+yxf(y)p(x,y)log(xf(y)p(x,y)).
The diagonal piece of the first sum — meaning, the set of terms with x=f(y) cancels with the second sum, leaving us with
yxf(y)p(x,y)log(xf(y)p(x,y)p(x,y)).
Because the logarithm is a concave function, we can apply Jensen's inequality (equation 2 in this link) to upper bound the above by
(yxf(y)p(x,y))log(yx,xf(y)p(x,y)).
Inside the logarithm, the summands have no x-dependence, so we can do the x sum directly to obtain
(yxf(y)p(x,y))log((|X|1)yxf(y)p(x,y)).
At this point we simply use the fact that the sum yxf(y)p(x,y) is upper bounded by one to obtain the upper bound
H(X,Y)H(Y,B)(yxf(y)p(x,y))log(|X|1).
The coefficient (yxf(y)p(x,y)) has a useful physical interpretation — it is the probability our guessing strategy fails! It's the sum over p(x,y) for all outcomes where our guess is incorrect. So we'll call this quantity PF, and write our upper bound as
H(X,Y)H(Y,B)PFlog(|X|1).
Plugging this back into equation (3) gives
H(X|Y)PFlog(|X|1)+[H(Y,B)H(Y)].

To bound the second term, we use the positivity of mutual information, which states
H(B)+H(Y)H(Y,B)0.
This gives
H(X|Y)PFlog(|X|1)+H(B).
The marginal distribution on B is given by
p(0)=yxf(y)p(x,y),
p(1)=yp(f(y),y).
In other words, the marginal distribution on B is {PF,1PF}. We will call the entropy of this distribution the binary entropy Hb(PF). Putting all this together gives us Fano's inequality,
H(X|Y)PFlog(|X|1)+Hb(PF).

Comments

Popular posts from this blog

Pick functions and operator monotones

Any time you can order mathematical objects, it is productive to ask what operations preserve the ordering. For example, real numbers have a natural ordering, and we have xyxkyk for any odd natural number k. If we further impose the assumption y0, then order preservation holds for k any positive real number. Self-adjoint operators on a Hilbert space have a natural (partial) order as well. We write A0 for a self-adjoint operator A if we have ψ|A|ψ0 for every vector |ψ, and we write AB for self-adjoint operators A and B if we have (AB)0. Curiously, many operations that are monotonic for real numbers are not monotonic for matrices. For example, the matrices P=12(1111) and Q=(0001) are both self-adjoint and positive, so we have P+QP0, but a str...

Envelopes of holomorphy and the timelike tube theorem

Complex analysis, as we usually learn it, is the study of differentiable functions from C to C. These functions have many nice properties: if they are differentiable even once then they are infinitely differentiable; in fact they are analytic, meaning they can be represented in the vicinity of any point as an absolutely convergent power series; moreover at any point z0, the power series has radius of convergence equal to the radius of the biggest disc centered at z0 which can be embedded in the domain of the function. The same basic properties hold for differentiable functions in higher complex dimensions. If Ω is a domain --- i.e., a connected open set --- in Cn, and f:ΩCn is once differentiable, then it is in fact analytic, and can be represented as a power series in a neighborhood of any point z, i.e., we have an expression like f(z)=ak1kn(z1z)k1(znz)kn. The ...

Some recent talks (Summer 2024)

My posting frequency has decreased since grad school, since while I'm spending about as much time learning as I always have, much more of my pedagogy these days ends up in papers. But I've given a few pedagogically-oriented talks recently that may be of interest to the people who read this blog. I gave a mini-course on "the algebraic approach" at Bootstrap 2024. The lecture notes can be found here , and videos are available here . The first lecture covers the basic tools of algebraic quantum field theory; the second describes the Faulkner-Leigh-Parrikar-Wang argument for the averaged null energy condition in Minkowski spacetime; the third describes recent developments on the entropy of semiclassical black holes, including my recent paper with Chris Akers . Before the paper with Chris was finished, I gave a general overview of the "crossed product" approach to black hole entropy at KITP. The video is available here . The first part of the talk goes back in ti...