Suppose X is random variable which takes on a finite set of values according
to a probability distribution p(X). Then the entropy of this probability
distribution is defined to be the quantity

Where X takes the value x_{i} , 1 <= i <= n.

If X is continuous, then it is defined like this

Note that H(X) is defined only for x which have non zero probability.

Suppose X is a random variable with probability distribution p(x_{1}),
p(x_{2}), ... p(x_{n}), where p(x_{i})>0, and 1<=
i <= n. Then H(X) <= lg n, with equality if and only if p(x_{i})
= 1/n for 1<= i <= n.

__Proof:__

Applying Jensen's Inequality, we obtain the
following:

Further, equality occurs if and only if p(x_{i}) = 1/n, for
1<= i <= n since:

H(X,Y) <= H(X) + H(Y), with equality if and only if X and Y are independent events.

__Proof:__

Suppose X takes on values x_{i} , 1<= i <= m, and Y takes
on values y_{j} , 1<= j <= n.

First observe that

where 1<= i <= m, and

where 1<= j <= n. Now, by definition

By property of the lg function,

On the other hand

Combining the last two equation we get

By property of the lg function and of summation

Applying Jensen's Inequality, we obtain the following:

Which is equal to lg(1)=0.

Now it is not difficult to see that equality occurs when p(x_{i})p(y_{j})
/ p(x_{i},y_{j}) = c for all i,j (see Jensen's
Inequality). Further more we must have, for equality to occur,

It follows that c must be equal to 1. Hence, equality occurs if
and only if p(x_{i})p(y_{j}) = p(x_{i},y_{j}),
i.e. if and only if X and Y are independent.

__Definition of
Conditional Entropy:__

Suppose X and Y are two random variables. Then for any fixed value y of Y, we get a conditional probability distribution p(X|y). By definition

Let the conditional entropy H(X|Y) be defined to be the weighted average (with respect to the probabilities p(y)) of the entropies H(X|y) over all possible values y. Therefore it is

H(X,Y) = H(Y) + H(X|Y).

__Proof:__

Suppose X takes on values x_{i} , 1<= i <= m, and Y takes
on values y_{j} , 1<= j <= n. Then

This development is obviously symmetric since p(x_{i},y_{j})
= p(x_{i
}| y_{j})p(y_{j}) = p(y_{j }|
x_{i})p(x_{i}).

Therefore H(X,Y) = H(Y) + H(X|Y) = H(X) + H(Y|Y).

H(X|Y) <= H(X) with equality if and only if X and Y are independent.

__Proof:__

From Theorem 2, we know that H(X,Y) <= H(X) + H(Y)

with equality if and only if X and Y are independent. Using Theorem 3, we get

H(X,Y) = H(X | Y) + H(Y) <= H(X) + H(Y)

with equality if and only if X and Y are independent. And
the corollary follows.

__Definition of Mutual Information:__

The mutual information I(X | Y) is define to be the difference of entropy on X generated by the knowledge of Y:

I(X | Y) = H(X) - H(X | Y)

Suppose that, in a pattern recognition system, there is n Classes C
and the feature vector X has m dimensions, which can take q value each,
therefore X can take q^{m} value x_{j}. Then I(C
| X) = H(C) - H(C | X) is equal to

__Proof:__

First remember that

Now

By definition of entropy we have

By the laws of probabilities

I(C | X) is symmetric in its two argument, i.e. I(C | X) = H(C) - H(C | X) = H(X) - H(X | C)

__Proof:__

Naturally, the mutual information of the class C and 1 feature x_{i}
is defined has

The sum of the amount of information with respect to the the individual feature is

__Theorem 5 Non additivity of the Information:__

This means that

__Proof:__

1) We already know from Theorem 2 That

with equality if and only if the x_{j} are independent.
Therefore if we assume the class
unconditional independence, then the information becomes

This imply that

with equality if and only if the class conditional independence is fulfilled, but this being impossible (see this theorem), it can only be greater. This means that, if you assume class conditional independence, you gain greater information by computing the information of the feature vector than by computing the information of the individual features.

2)

Let X be the feature vector with feature x_{j} , 1 <= j <=
m, and each feature can take q value. Denote X_{j}
the value that the vector can take, where 1 <= j <= q^{m}
.

Assuming the class conditional
independence, you get

But since

for j, then all this big equation collapse to

This imply that

with equality if and only if the class
unconditional independence is fulfilled, but this being impossible
(see this theorem),
it can only be smaller. Hence the theorem.