Let a pattern recognition system be composed of m classes ck, 1<= k <=m, and of a feature vector X of n component. In such a system, two different independence assumptions can be made about the system to simplify certain computations.
1) The Class unconditional independence:
![]()
or
2) The Class conditional independence:
![]()
But it is little known that these two are in general mutually contradictory,
The next theorem state that more formally.
Theorem , contradiction of independence:
If p(xi | ck)
p(xi | cn), then the two previous independence assumption
can not be fulfilled simultaneously.
Proof:
By definition we have
![]()
and
![]()
1) Now assume the Class unconditional independence
![]()
Then
![]()
2) On the other hand, if the class conditional independence were true, then we would have
p( xi, xj | ck) = p(xi | ck) p(xj | ck)
which implies
![]()
which would imply that, using last equations of the first and the second
part of the proof
![]()
or
![]()
But for this to be true, p(xi | ck) / p(xi)
needs to be constant for all k; but we have assumed otherwise (p(xi
| ck)
p(xi | cn)), therefore the assumption, p( xi,
xj | ck) = p(xi | ck) p(xj
| ck) can not be true, hence the theorem.
But a totally different situation can happen if one
does not assume p(xi | ck)
p(xi | cn). Then a probability system can be
build where one can assume both the class conditional
independence and Class unconditional independence
without any problem. But usually such a system of classes is awkward
and unrealistic. Here is one example:

If p(xi) is uniform over the four regions,
then inside each class, p(xi | ck) is also uniform.
Furthermore, if X1 and X2 are independent, then automatically, they shall
be independent inside each class. This is a very specific example,
which is possible only because the conditional probability from one class
to another is the same for one variable. Note also that this example
would work if p(xi | ck) were a Gausians. Then
one would obtain a bizarre global density function with four extremum.
If the 4 Gausian are independent, and suitably centered in each region,
then again one could make the two independence assumptions. Finally,
in those two example, one can immediatly see that the informatin does decompose
(i.e. I(X) = I(x1) + I(x2) ).