Basic properties of the Shannon's Entropy and of the Shannon's Information.







Definition of Entropy:

Suppose X is random variable which takes on a finite set of values according to a probability distribution p(X).  Then the entropy of this probability distribution is defined to be the quantity
 
 

Where X takes the value xi , 1 <= i <= n.

If X is continuous, then it is defined like this

Note that H(X) is defined only for x which have non zero probability.
 
 
 

Theorem 1:

Suppose X is a random variable  with probability distribution p(x1), p(x2), ... p(xn), where p(xi)>0, and 1<= i <= n.  Then H(X) <= lg n, with equality if and only if p(xi) = 1/n for 1<= i <= n.

Proof:





Applying Jensen's Inequality, we obtain the following:
 
 





Further, equality occurs if and only if p(xi) = 1/n, for 1<= i <= n since:






Theorem 2:

H(X,Y) <= H(X) + H(Y), with equality if and only if X and Y are independent events.

Proof:

Suppose X takes on values xi , 1<= i <= m, and Y takes on values yj , 1<= j <= n.

First observe that

where 1<= i <= m, and

where 1<= j <= n.  Now, by definition

By property of the lg function,

On the other hand

Combining the last two equation we get





By property of the lg function and of summation






Applying Jensen's Inequality, we obtain the following:

Which is equal to lg(1)=0.
 
 

Now it is not difficult to see that equality occurs when  p(xi)p(yj) / p(xi,yj) = c for all i,j (see Jensen's Inequality).  Further more we must have, for equality to occur,

It follows that c must be equal to 1.  Hence, equality occurs if and only if p(xi)p(yj) = p(xi,yj), i.e. if and only if X and Y are independent.
 
 
 
 
 

Definition of Conditional Entropy:

Suppose X and Y are two random variables.  Then for any fixed value y of Y, we get a conditional probability distribution p(X|y).  By definition

Let the conditional entropy H(X|Y) be defined to be the weighted average (with respect to the probabilities p(y)) of the entropies H(X|y) over all possible values y.  Therefore it is




Theorem 3:
 

H(X,Y) = H(Y) + H(X|Y).

Proof:

Suppose X takes on values xi , 1<= i <= m, and Y takes on values yj , 1<= j <= n.  Then





This development is obviously symmetric since p(xi,yj) = p(xi | yj)p(yj) = p(yj | xi)p(xi).
Therefore H(X,Y) = H(Y) + H(X|Y) = H(X) + H(Y|Y).
 
 

Corollary

H(X|Y) <= H(X) with equality if and only if X and Y are independent.
 

Proof:

From Theorem 2, we know that H(X,Y) <= H(X) + H(Y)

with equality if and only if X and Y are independent.  Using Theorem 3, we get

H(X,Y) = H(X | Y) + H(Y) <= H(X) + H(Y)

with equality if and only if  X and Y are independent.  And the corollary follows.
 
 
 
 

Definition of Mutual Information:
 

The mutual information I(X | Y) is define to be the difference of entropy on X generated by the knowledge of Y:

I(X | Y) = H(X) - H(X | Y)

Suppose that, in a pattern recognition system, there is n Classes C and the feature vector X has m dimensions, which can take q value each, therefore X can take qm value xj.  Then I(C | X) = H(C) - H(C | X) is equal to

Proof:
 

First remember that

Now

By definition of entropy we have

By the laws of probabilities







Theorem 4:
 

I(C | X) is symmetric in its two argument, i.e. I(C | X) = H(C) - H(C | X) = H(X) - H(X | C)

Proof:


 
 


 






Naturally, the mutual information of the class C and 1 feature xi is defined has





The sum of the amount of information with respect to the the individual feature is






Theorem 5 Non additivity of the Information:

This means  that

Proof:

1)  We already know from Theorem 2  That

with equality if and only if the xj are independent.  Therefore if we assume the class unconditional independence,  then the information becomes

This imply that

with equality if and only if the class conditional independence is fulfilled, but this being impossible (see this theorem), it can only be greater.  This means that, if you assume class conditional independence, you gain greater information by computing the information of the feature vector than by computing the information of the individual features.

2)

Let X be the feature vector with feature xj , 1 <= j <= m, and each feature can take q value.  Denote Xj  the value that the vector can take, where  1 <= j <= qm .


 


Assuming the class conditional independence, you get
 



 
 




But since

for j, then all this big equation collapse to

This imply that

with equality if and only if the class unconditional independence is fulfilled, but this being impossible (see this theorem), it can only be smaller.  Hence the theorem.