On the contradiction between class unconditional and class conditional Independence







Let a pattern recognition system be composed of m classes ck, 1<= k <=m, and of a feature vector X of n component.  In such a system, two different independence assumptions can be made about the system to simplify certain computations.

1)  The Class unconditional independence:
 
 

or

2)  The Class conditional independence:
 
 




But it is little known that these two are in general mutually contradictory,  The next theorem state that more formally.
 
 
 
 

Theorem , contradiction of independence:

If p(xi | ck p(xi | cn), then the two previous independence assumption can not be fulfilled simultaneously.

Proof:

By definition we have

and



1) Now assume the Class unconditional independence

Then




2) On the other hand, if the class conditional independence were true, then we would have

p( xi, xj | ck) = p(xi | ck) p(xj | ck)

which implies

which would imply that, using last equations of the first and the second part of the proof
 
 

or





But for this to be true, p(xi | ck) / p(xi) needs to be constant for all k; but we have assumed otherwise (p(xi | ck p(xi | cn)), therefore the assumption, p( xi, xj | ck) = p(xi | ck) p(xj | ck) can not be true, hence the theorem.
 
 

    But a totally different situation can happen if one does not assume p(xi | ck p(xi | cn).  Then a probability system can be build where one can assume both the class conditional independence and Class unconditional independence without any problem.  But usually such a system of classes is awkward and unrealistic.  Here is one example:

    If p(xi) is uniform over the four regions, then inside each class, p(xi | ck) is also uniform.  Furthermore, if X1 and X2 are independent, then automatically, they shall be independent inside each class.  This is a very specific example, which is possible only because the conditional probability from one class to another is the same for one variable.  Note also that this example would work if p(xi | ck) were a Gausians.  Then one would obtain a bizarre global density function with four extremum.  If the 4 Gausian are independent, and suitably centered in each region, then again one could make the two independence assumptions.  Finally, in those two example, one can immediatly see that the informatin does decompose (i.e. I(X) = I(x1) + I(x2) ).