(Thank you to Alex May for catching an error in the first version of this post!)
I had to learn Fano's inequality for a quantum information project, but found the proof as it currently exists on Wikipedia extremely confusing, largely because it uses a lot of jargon about "messages" and "estimates" that, in my opinion, obscures the underlying mathematics. So here's a simple proof. I'll probably try to edit the Wikipedia proof after I post this.
Heuristically speaking, Fano's inequality is a bound on how successfully I can guess the output of a random process if I can't access directly but can access a different random process that is correlated with To state the bound precisely, we'll need a bit of notation.
A finite random variable, henceforth just a "random variable," is a finite set together with a map satisfying
The function is called the probability distribution of the random variable. We think of the elements of the set as corresponding to different possible outcomes of some process, with denoting the probability of outcome
To discuss correlated events, we need to introduce the idea of a joint distribution over two random variables. We are given two finite sets and that index outcomes of a process, together with a map
satisfying
The marginal distributions are defined by
and
One can check that these are probability distributions on and respectively.
Now, suppose we want to guess the outcome of some process whose statistics are governed by a known probability distribution The system producing the random outcomes is behind some curtain; we aren't actually allowed to look at the outcomes. But every time an outcome is produced, our friend Bob on the other side of the curtain asks us to guess what we think the outcome was. We'd like to find a guessing strategy that results in us being correct as often as possible.
The most naive thing we could do is to set up our own random number generator that produces outcomes according to the probability distribution , and every time Bob asks us to guess the outcome of his machine we just tell him what outcome came out of our machine. But this strategy is pretty bad. The joint distribution of our two systems is
and the probability that we guess Bob's outcome correctly is the diagonal sum
We can do better if we share some randomness with Bob's system. What this means is, imagine if the probability distribution Bob is sampling, is actually the marginal of some larger probability distribution and we're allowed to see the outcomes of system Maybe Bob's system is connected by a wire to a second system on our side of the curtain. We'll come up with some guessing strategy, which is a function
and whenever we see our system produce outcome , we'll guess that Bob's system produced outcome Fano's inequality is a lower bound on the probability that our guess is wrong, in terms of information-theoretic quantities related to the probability distribution
The specific information theoretic quantity we will need is the Shannon entropy of a probability distribution, which is defined by
The formula for the entropy on our joint distribution is
We will also need the conditional entropy, which is defined by
If you look at an information theory textbook, their definition of conditional entropy will be different; they will define it by
and then prove that this satisfies equation (1), which they will call the "chain rule of entropy." I find starting with equation (1) more intuitive, because that's the version of conditional entropy that generalizes to quantum information theory. Plus, the proof of the "chain rule of entropy" is completely trivial; you just use the identity
So, here's the proof of Fano's inequality. For bookkeeping purposes, we introduce the set and define a probability distribution on by
Here is the indicator function that returns one if and zero if On the surface, this probability distribution isn't particularly interesting; for fixed it just slots the value into either or depending on whether or not we have But thinking about the entropic quantities of this extended distribution turns out to be very useful.
From now on, we'll drop subscripts on all probability distributions, since the arguments should make it clear which distribution we're talking about. This is okay, because you can see by inspection that is in fact the marginal distribution of so there's no ambiguity in what we mean by It will also be helpful to have the explicit expression for the marginal distribution on which is
We start with the conditional entropy
which we rewrite as
The first term in brackets can be computed explicitly using equation (2) as
The diagonal piece of the first sum — meaning, the set of terms with — cancels with the second sum, leaving us with
Because the logarithm is a concave function, we can apply Jensen's inequality (equation 2 in this link) to upper bound the above by
Inside the logarithm, the summands have no -dependence, so we can do the sum directly to obtain
At this point we simply use the fact that the sum is upper bounded by one to obtain the upper bound
The coefficient has a useful physical interpretation — it is the probability our guessing strategy fails! It's the sum over for all outcomes where our guess is incorrect. So we'll call this quantity and write our upper bound as
Plugging this back into equation (3) gives
To bound the second term, we use the positivity of mutual information, which states
This gives
The marginal distribution on is given by
In other words, the marginal distribution on is We will call the entropy of this distribution the binary entropy Putting all this together gives us Fano's inequality,
Comments
Post a Comment