(Thank you to Alex May for catching an error in the first version of this post!)
I had to learn Fano's inequality for a quantum information project, but found the proof as it currently exists on Wikipedia extremely confusing, largely because it uses a lot of jargon about "messages" and "estimates" that, in my opinion, obscures the underlying mathematics. So here's a simple proof. I'll probably try to edit the Wikipedia proof after I post this.
Heuristically speaking, Fano's inequality is a bound on how successfully I can guess the output of a random process $X$ if I can't access $X$ directly but can access a different random process $Y$ that is correlated with $X.$ To state the bound precisely, we'll need a bit of notation.
A finite random variable, henceforth just a "random variable," is a finite set $\mathcal{X}$ together with a map $p_X : \mathcal{X} \rightarrow [0, 1]$ satisfying
$$\sum_{x \in \mathcal{X}} p_X(x) = 1.$$
The function $p_X$ is called the probability distribution of the random variable. We think of the elements of the set as corresponding to different possible outcomes of some process, with $p_X(x)$ denoting the probability of outcome $x.$
To discuss correlated events, we need to introduce the idea of a joint distribution over two random variables. We are given two finite sets $\mathcal{X}$ and $\mathcal{Y}$ that index outcomes of a process, together with a map
$$p_{X,Y} : \mathcal{X} \times \mathcal{Y} \rightarrow [0, 1]$$
satisfying
$$\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X,Y} (x, y) = 1.$$
The marginal distributions are defined by
$$p_X(x) = \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y)$$
and
$$p_Y(y) = \sum_{x \in \mathcal{X}} p_{X, Y}(x, y).$$
One can check that these are probability distributions on $\mathcal{X}$ and $\mathcal{Y},$ respectively.
Now, suppose we want to guess the outcome of some process whose statistics are governed by a known probability distribution $p_X.$ The system producing the random outcomes is behind some curtain; we aren't actually allowed to look at the outcomes. But every time an outcome is produced, our friend Bob on the other side of the curtain asks us to guess what we think the outcome was. We'd like to find a guessing strategy that results in us being correct as often as possible.
The most naive thing we could do is to set up our own random number generator that produces outcomes according to the probability distribution $p_X$, and every time Bob asks us to guess the outcome of his machine we just tell him what outcome came out of our machine. But this strategy is pretty bad. The joint distribution of our two systems is
$$p_{X_1, X_2}(x_1, x_2) = p_{X}(x_1) p_{X}(x_2),$$
and the probability that we guess Bob's outcome correctly is the diagonal sum
$$\sum_{x_1 = x_2} p_{X_1, X_2}(x_1, x_2) = \sum_{x} p_{X}(x)^2.$$
We can do better if we share some randomness with Bob's system. What this means is, imagine if the probability distribution Bob is sampling, $p_{X}(x),$ is actually the marginal of some larger probability distribution $p_{X, Y}(x, y),$ and we're allowed to see the outcomes of system $Y.$ Maybe Bob's system $X$ is connected by a wire to a second system $Y$ on our side of the curtain. We'll come up with some guessing strategy, which is a function
$$f : \mathcal{Y} \rightarrow \mathcal{X},$$
and whenever we see our system produce outcome $y$, we'll guess that Bob's system produced outcome $f(y).$ Fano's inequality is a lower bound on the probability that our guess is wrong, in terms of information-theoretic quantities related to the probability distribution $p_{X, Y}.$
The specific information theoretic quantity we will need is the Shannon entropy of a probability distribution, which is defined by
$$H(X) = - \sum_{x \in \mathcal{X}} p_{X}(x) \log p_{X}(x).$$
The formula for the entropy on our joint distribution is
$$H(X, Y) = - \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X, Y} (x, y) \log p_{X, Y} (y).$$
We will also need the conditional entropy, which is defined by
$$H(X|Y) = H(X,Y) - H(Y). \tag{1}$$
If you look at an information theory textbook, their definition of conditional entropy will be different; they will define it by
$$H(X|Y) = - \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{X,Y}(x, y) \log \frac{p_{X,Y}(x, y)}{p_{Y}(y)},$$
and then prove that this satisfies equation (1), which they will call the "chain rule of entropy." I find starting with equation (1) more intuitive, because that's the version of conditional entropy that generalizes to quantum information theory. Plus, the proof of the "chain rule of entropy" is completely trivial; you just use the identity $\log(a/b) = \log(a) - \log(b).$
So, here's the proof of Fano's inequality. For bookkeeping purposes, we introduce the set $\mathcal{B} = \{0, 1\},$ and define a probability distribution on $\mathcal{X} \times \mathcal{Y} \times \mathcal{B}$ by
$$p_{X, Y, B}(x, y, b) = \delta_{b, \Pi(x=f(y))} p_{X, Y}(x, y).$$
Here $\Pi(x= f(y))$ is the indicator function that returns one if $x = f(y)$ and zero if $x \neq f(y).$ On the surface, this probability distribution isn't particularly interesting; for fixed $x, y,$ it just slots the value $p_{X,Y}(x,y)$ into either $b=1$ or $b=0$ depending on whether or not we have $x = f(y).$ But thinking about the entropic quantities of this extended distribution turns out to be very useful.
From now on, we'll drop subscripts on all probability distributions, since the arguments should make it clear which distribution we're talking about. This is okay, because you can see by inspection that $p_{X,Y}(x,y)$ is in fact the marginal distribution of $p_{X, Y, B}(x, y, b),$ so there's no ambiguity in what we mean by $p(x, y).$ It will also be helpful to have the explicit expression for the marginal distribution on $\mathcal{Y} \times \mathcal{B},$ which is
$$p(y, b) = \delta_{b, 1} p(f(y), y) + \delta_{b, 0} \sum_{x \neq f(y)} p(x, y). \tag{2}$$
We start with the conditional entropy
$$H(X|Y) = H(X, Y) - H(Y),$$
which we rewrite as
$$H(X|Y) = [H(X, Y) - H(Y, B)] + [H(Y, B) - H(Y)]. \tag{3}$$
The first term in brackets can be computed explicitly using equation (2) as
$$\begin{split}& - \sum_{x, y} p(x, y) \log p(x, y) \\ & + \sum_{y} p(f(y), y) \log p(f(y), y) \\ & + \sum_{y} \sum_{x \neq f(y)} p(x, y) \log\left( \sum_{x' \neq f(y)} p(x', y)\right). \end{split}$$
The diagonal piece of the first sum — meaning, the set of terms with $x = f(y)$ — cancels with the second sum, leaving us with
$$\sum_{y} \sum_{x \neq f(y)} p(x,y) \log\left( \frac{\sum_{x' \neq f(y)} p(x', y)}{p(x, y)} \right).$$
Because the logarithm is a concave function, we can apply Jensen's inequality (equation 2 in this link) to upper bound the above by
$$\left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log\left( \sum_{y} \sum_{x, x' \neq f(y)} p(x', y) \right).$$
Inside the logarithm, the summands have no $x$-dependence, so we can do the $x$ sum directly to obtain
$$\left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log\left((|\mathcal{X}| - 1) \sum_{y} \sum_{x' \neq f(y)} p(x', y) \right).$$
At this point we simply use the fact that the sum $\sum_{y} \sum_{x' \neq f(y)} p(x', y)$ is upper bounded by one to obtain the upper bound
$$H(X, Y) - H(Y, B) \leq \left( \sum_y \sum_{x \neq f(y)} p(x, y) \right) \log(|\mathcal{X}| - 1).$$
The coefficient $\left(\sum_{y} \sum_{x \neq f(y)} p(x, y) \right)$ has a useful physical interpretation — it is the probability our guessing strategy fails! It's the sum over $p(x, y)$ for all outcomes where our guess is incorrect. So we'll call this quantity $P_{F},$ and write our upper bound as
$$H(X, Y) - H(Y, B) \leq P_F \log(|\mathcal{X}| - 1).$$
Plugging this back into equation (3) gives
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + [H(Y, B) - H(Y)].$$
To bound the second term, we use the positivity of mutual information, which states
$$H(B) + H(Y) - H(Y, B) \geq 0.$$
This gives
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + H(B).$$
The marginal distribution on $B$ is given by
$$p(0) = \sum_{y} \sum_{x \neq f(y)} p(x, y),$$
$$p(1) = \sum_{y} p(f(y), y).$$
In other words, the marginal distribution on $B$ is $\{P_F, 1 - P_F\}.$ We will call the entropy of this distribution the binary entropy $H_{b}(P_F).$ Putting all this together gives us Fano's inequality,
$$H(X|Y) \leq P_F \log(|\mathcal{X}| - 1) + H_{b}(P_F).$$
Comments
Post a Comment