Suppose X is random variable which takes on a finite set of values according
to a probability distribution p(X). Then the entropy of this probability
distribution is defined to be the quantity
![]()
Where X takes the value xi , 1 <= i <= n.
If X is continuous, then it is defined like this
![]()
Note that H(X) is defined only for x which have non zero probability.
Suppose X is a random variable with probability distribution p(x1), p(x2), ... p(xn), where p(xi)>0, and 1<= i <= n. Then H(X) <= lg n, with equality if and only if p(xi) = 1/n for 1<= i <= n.
Proof:

Applying Jensen's Inequality, we obtain the
following:

Further, equality occurs if and only if p(xi) = 1/n, for 1<= i <= n since:

H(X,Y) <= H(X) + H(Y), with equality if and only if X and Y are independent events.
Proof:
Suppose X takes on values xi , 1<= i <= m, and Y takes on values yj , 1<= j <= n.
First observe that
![]()
where 1<= i <= m, and
![]()
where 1<= j <= n. Now, by definition
![]()
By property of the lg function,
![]()
On the other hand
![]()
Combining the last two equation we get
![]()
By property of the lg function and of summation
![]()
Applying Jensen's Inequality, we obtain the following:
![]()
Which is equal to lg(1)=0.
Now it is not difficult to see that equality occurs when p(xi)p(yj) / p(xi,yj) = c for all i,j (see Jensen's Inequality). Further more we must have, for equality to occur,
![]()
It follows that c must be equal to 1. Hence, equality occurs if
and only if p(xi)p(yj) = p(xi,yj),
i.e. if and only if X and Y are independent.
Definition of Conditional Entropy:
Suppose X and Y are two random variables. Then for any fixed value y of Y, we get a conditional probability distribution p(X|y). By definition
![]()
Let the conditional entropy H(X|Y) be defined to be the weighted average (with respect to the probabilities p(y)) of the entropies H(X|y) over all possible values y. Therefore it is
![]()
H(X,Y) = H(Y) + H(X|Y).
Proof:
Suppose X takes on values xi , 1<= i <= m, and Y takes on values yj , 1<= j <= n. Then

This development is obviously symmetric since p(xi,yj)
= p(xi
| yj)p(yj) = p(yj |
xi)p(xi).
Therefore H(X,Y) = H(Y) + H(X|Y) = H(X) + H(Y|Y).
H(X|Y) <= H(X) with equality if and only if X and Y are independent.
Proof:
From Theorem 2, we know that H(X,Y) <= H(X) + H(Y)
with equality if and only if X and Y are independent. Using Theorem 3, we get
H(X,Y) = H(X | Y) + H(Y) <= H(X) + H(Y)
with equality if and only if X and Y are independent. And
the corollary follows.
Definition of Mutual Information:
The mutual information I(X | Y) is define to be the difference of entropy on X generated by the knowledge of Y:
I(X | Y) = H(X) - H(X | Y)
Suppose that, in a pattern recognition system, there is n Classes C and the feature vector X has m dimensions, which can take q value each, therefore X can take qm value xj. Then I(C | X) = H(C) - H(C | X) is equal to
![]()
Proof:
First remember that
Now
![]()
By definition of entropy we have
![]()
By the laws of probabilities
![]()
I(C | X) is symmetric in its two argument, i.e. I(C | X) = H(C) - H(C | X) = H(X) - H(X | C)
Proof:
![]()
Naturally, the mutual information of the class C and 1 feature xi is defined has
![]()
The sum of the amount of information with respect to the the individual feature is
![]()
Theorem 5 Non additivity of the Information:
This means that
![]()
Proof:
1) We already know from Theorem 2 That
![]()
with equality if and only if the xj are independent. Therefore if we assume the class unconditional independence, then the information becomes
![]()
This imply that
![]()
with equality if and only if the class conditional independence is fulfilled, but this being impossible (see this theorem), it can only be greater. This means that, if you assume class conditional independence, you gain greater information by computing the information of the feature vector than by computing the information of the individual features.
2)
Let X be the feature vector with feature xj , 1 <= j <= m, and each feature can take q value. Denote Xj the value that the vector can take, where 1 <= j <= qm .
![]()
Assuming the class conditional
independence, you get

But since
![]()
for j, then all this big equation collapse to

This imply that
![]()
with equality if and only if the class
unconditional independence is fulfilled, but this being impossible
(see this theorem),
it can only be smaller. Hence the theorem.