Most pattern recognition systems perform two distinct functions:
In a general pattern recognition system, the features extracted
from each pattern are properties which have been selected on the basis
of one or both of the following (and usually conflicting criteria):
In order to simplify the classification problem, features are chosen to (1) cluster the patterns in their respective classes, and (2) separate the classes. These two aims may also conflict with each other.
Performance Measures
It is not feasible to choose a set of features to directly minimize the probability of error in identifying classes. The error rate is a function of the classification technique as well as the feature extraction method. A weaker criterion must be used to evaluate a set of features.
Several performance measure can be use to choose a good set of features. Features can be chosen to maximize the "distance" in a probability space between pairs of classes. That is, features are extracted from a pair of classes of patterns, and then a distance measure is used to measure the separation of the distributions of the features of the two classes. Two distance measures which have been used for pattern recognition are the divergence function and the Fisher linear discriminant function. Here we will discuss another criterion: the Shannon's information measure. It measures the values of a group of features for identifying a set of classes of patterns. However, it usually requires a fair amount of computation, and that is why some authors have tried to simplified it. We will show that some assumption leads wrongful method for selecting set of features based on the performance of individual features.
The mutual information is defined this way
where C is the random variable associated with the class of a given pattern and X is the feature vector used by the recognition system. H(C) and H(C | X) are respectively the entropy of the class and the conditional entropy of the class knowing the feature vector. Therefore the mutual information compute the difference of the incertitude to which a class pattern belong before the measurement of the feature vector X and after the measurement of X. The mutual information can therefore be used to estimate the quality of a feature set (i.e. its potential to minimizes the error of classification of a pattern). And since it requires a fair amount of computation, certain author have tried to simplified its computation by assuming both class conditional independence and class unconditional independence. Under both these two assumptions, the mutual information of C and of the feature vector X decompose into the sum of the mutual information of C and of the individual feature (xi) of the feature vector X. But unfortunately, both these condition of independence are, in general, contradictory (see this theorem). And then, if they can not both be assumed, the mutual information does not decompose into the sum of the mutual information of the individual features (see this theorem). This implies that to select a group of suitable feature, one must compute I(C | X), and one can not try to evaluate a group of feature by evaluating the feature individually; the statistical dependencies among the features may increase or decrease the mutual information of the mutual information.
Final note, the proof in the following two pages
are all done for discrete probability density function, but they can all
be extended quite obviously and easily to the continuous case.
References
[1] G. T. Toussaint, Comments on " A Modified Figure of Merit for Feature Selection in Pattern Recognition", IEEE Transactions on information theory, september, 1971
[2] J. E. Paul, Jr, A. J. Goetze, and J. B. O'Neal, Jr., A modified figure of merit for feature selection in pattern recognition, IEEE Trans. Inform. Theory (Corresp.), vol. IT-16, July 1970, pp. 504-507.
[3] T. L. Barabash, On properties of symbol recognition, Eng. Cybern. (USSR), Sept./Oct. 1965, pp 71-77.
[4] H. F. Ryan, The information content measure as a performance criterion for feature selection, IEEE Proc. 7th symp. Adaptive Processes, Los Angeles, Calif., Dec. 16-18, 1968.
[5] Douglas R. Stinson, Cryptography: Theory and Practice, CRC Press LLc, 1995,pp 44-64.