Independent Binary Features

As mentioned previously, in pattern recognition many times a pattern recognition algorithm will output a feature vector of the observed item. For instance, in the MIT reading machine for the blind or even the cheque recognition procedure - a feature of d dimension is output. If each feature in the vector is binary and assumed (correctly or incorrectly) independent, a simplification of Bayes Rule can employed:

The 2 Category Case

Here, we consider a 2-category problem in which the components of the feature vector are binary-valued and conditionally independent (which yields a simplified decision rule):

We also assign the following probabilities (p and q) to each xi in X:

and

If pi > qi, we expect to xi to be 1 more frequently when the state of nature is w1 than when it is w2. If we assume conditional independence, we can write P(X|wi) as the product of probabilities for the components of X. The class conditional probabilities are then:

Let’s explain the first equation. For any xi, if it equals 1, then the expression binaryfeatures4.gif is 1. So only binaryfeatures5.gif is considered; which makes sense since pi is the probability that x=1. If xi=0, then only the second term is considered, and (1-pi) is 1 - (probability that x=1) which is the probability that x=0. So, for every xi, the appropriate probability is multiplied to obtain a final product.



Since this is a two class problem the discriminant function g(x) = g1(x) - g2(x) where:

and


The likelihood ratio is therefore given by:


which yields the discriminant function as follows:




If we notice that this function is linear in xi, we can rewrite it as a linear function of xi


binaryfeatures8.gif
where
binaryfeatures9.gif
and
binaryfeatures10.gif



The discriminant function g(x) will therefore indicate whether the current feature vector belongs to class 1 and class 2. It is important to note that w0 and wi are weights calculated for the linear discriminant. A decision boundary lies wherever g(x) = 0. This decision boundary can be a line, or hyper-plane depending upon the dimension of the feature space.

The decision boundry g(x) = 0 is a line on a cartesian plan for a two dimensional (d = 2) feature space.
In a three dimensional feature space, the decision boundary g(x) = 0 is a plane,




Higher Dimensional Case

Once there are more than two potential classes to classify the data into, the problem becomes more difficult. The procedure above does not yield the correct answer since this discriminant function's likelihood is a ratio between two possible states. Therefore, the discriminant function shown previously must be utilized so that every gi(x) is considered. However, this approach can be used with the following trick: