We have all had the experience of waiting for a friend in a crowded place. The friend is late so, we scan the crowd hoping to recognize the friend from far away. Lets say the friend has short, dark hair, they wear glasses, and often wear a favorite green sweater. Then, consciously or unconsciously, we use what we know about our friend, like the short hair and glasses, to try to pick them out of the crowd. We have all also had a case of mistaken identity. Perhaps a stranger is also wearing a green sweater, or has glasses and short, dark hair, and it is only upon glancing at another of the stranger's features that we notice the difference.
This is exactly the problem of classification. Given a set of features we would like to be able to classify something with a great probability of success, that is, we do not want to misclassify our object. In the context of having a computer program make classifications, we would like to limit the number of features we consider. This is not only to cut down on computation time and storage space, but also because the techniques used need an exponential amount of data in the number of features to make good predictions, and data can be expensive to gather.
This leads us to the question, if we only want to base our decision on k features, which k features are the best to base the decision on? One idea has been to just take the best k features. Unfortunately we will see that this does not give the best result even in the case where the features do not depend on each other. But before we dive further into this discussion we need to define what we mean when we use words like 'best'.
In the example above we presented a scenario where we wanted to take information we had about an object, and use it to classify that object as a certain thing. This section gives us a language to describe the problem in a way so that we can know, given a certain amount of information, what the probability is that we make an incorrect classification. The contents of this section can be found in any elementary probability text. If the reader is familiar with probability theory, they may skip to the next section. <LINK>
Let's Y be the property that we wish to guess. For simplification purposes, we let Y take either 0 or 1.
For example :
Y=0 if this is our friend we meet.
Y=1 if this is not.
Let's X be a vector of d measurements, X = (X_{1}, X_{2}, ..., X_{d}).
For example :
with d=3, X=(Dark hair, Glasses, Sweater)
We would consider the binary case, that is to say each of these measurements X_{i} could take either 0 or 1 :
For example :
X_{1} : Dark hair
X_{2} : Glasses
X_{3} : Sweater
P(Y | X), “probability of Y given X”
This is the probability of an event Y to occur given some evidence X. Roughly speaking, this is a measure of how likely it is for the person that we see to be the friend given the evidence we collect about them.
For example :
P( Y=1 | X=(1.25, 4/3, 2) ) = 0.9
is the probability that the guy we meet is our friend given that the hair color has 1.25% luminosity, has 4 by 3 glasses and a sweater of type 2 (green with pink spot !).
Now, consider that we want to take the final decision : “Is this really our Friend ?”. If all of our features are close to what we think they should be, then there is a great probability that it is indeed our friend. But if some of the measurements are not what we think they should be then the probability of it being our friend decreases. This intuitive idea of choosing the 'most likely' decision is exactly what is captured by bayes rule.
The
bayes decision rule tells us to take the choice with the maximum a
priori probability.
We can define the probability to make the correct choice given the evidence x we saw :
And the overall probability to make the right choice :
so,
For example :
P( Y=1 | X=(1.25, 4/3, 2) ) = 0.9 and P( Y=0 | X=(1.25, 4/3, 2) ) = 0.1
So, we will choose Y=1, that is “this is our friend !!!”. And the probability to be correct is P_{c}^{B} = 0.9.
The base probability of error is just the probability that we made the wrong choice using bayes decision rule. So this is just one minus the probability that we made the correct guess defined above.
We can define the conditional bayes probability of error given x :
And the bayes probability of error :
so,
For example :
P( Y=1 | X=(1.25, 4/3, 2) ) = 0.9 and P( Y=0 | X=(1.25, 4/3, 2) ) = 0.1
So the probability of error is P(error) = min(0.9, 0.1) = 0.1
Now that the notation has been explained, it should be clear that when we say the 'best' feature we mean the single feature that minimizes the Bayes probability of error. Furthermore, when we say the k best features we are referring to the feature vector, X, where |X| = k, which minimizes the Bayes error.
Consider again the idea of taking the best k features and constructing our feature vector X from these best k. If two best features depend heavily on one another, that is, we can observe some large correlation between the two then conceptually one contains some of the information of the other. What this means is that when considering the two together as a predictor we get very little decrease in error compared to either single one of the two. This is because we already 'know' much of what the second feature is going to predict based on the first.
As an example lets go back to waiting for our friend. Say the friend has straight hair and it is also dark. If there are not so many dark haired people around then it would make it easier to spot our friend, as would be the case if there were fewer straight haired people. But lets say that ninety five percent of the people with dark hair also have straight hair, then using both as criteria would not give us any better results than simply using one, but using one in conjunction with another criteria, such as wearing a green sweater, would help us recognize our friend more easily. Thus, it would appear that the problem in this type of classification is simply the fact that the features are dependent on each other. However, this is not the case.
In 1967, Elashoff published a paper in which he presented an example of a case of three independent features where but . That is, the best two are not the two best.
In 1971, Toussaint extended this further. He showed that for there existed a case where . That is the two best are not the best two and the best single feature need not be in the best k. It is this example that we show.
Let and . Then we have that the bayes error is given by
Also Elashoff had showed previously that for two independent features then the probability of error of the two together is given by
Toussaint gave the example where for features
Algebraic calculation into our equations for Bayes error gives :
Which gives the desired result.
In 1976 Thomas Cover gave another example of a case where the k best examples are not the k-best. His example differs from Toussaint's in that it is based on independent observations of the same two experiments, instead of three different independent measurements.
First of all let's consider that we have some prior knowledge about the population of the people we are meeting today... Let's for example imagine that we are at a party and we know that half of the people are friends of us.
So P(Y=0) = P(Y=1) = 1/2.
Let's consider what happens in both cases (Y=0 and Y=1), this is what we call the « a posteriori probabilities », that is the probability to see the evidence given that we are god and we knows whether this a friend or not.
Y=0 (this is not a friend).
We observe and
Y=1 (this is a friend)
We observe and
X_{1} and X_{2} are any observation we could make on the person, like X_{1}= « dark hairs » and X_{2}= « glasses ».
The bayes probability of error if we take our decisions according to only X_{1} is (see notation section) :
=
= (Bayes Rule)
=
=
in the same way, we could compute the probability of error we made using only X_{2},
=
Now we can ask ourselves whether collecting evidence X_{1 }twice in an independent manner could gives us more information. So let's X'_{1} (respectively X'_{2}) be the second observation of X_{1} (respectively X_{2})_{.}
=
=
in the same way,
=
And finally we can consider gathering both X_{1} and X_{2} to take a decision.
=
Now we have the probabilities of error let's look at a particular example :
p_{0}=0.96, p_{1}=0.04, r_{0}=0.9 and r_{1}=0
That's gives us, just by plugging the numbers in the formulae :
=0.04 <=0.05
and
=0.005< =0.022< =0.040
So, two observations of the best experiment (checking X_{1}) doesn't change anything, the error we make stay the same, whereas, two observations of the worst experiment (checking X_{2}) decrease the probability by a factor of 10 !!!. So if we are allowed to do two experiments, the best thing to do is actually to make our decision according to the two "worst" single features (X_{2}) !!!
To make a concrete example, your at the party, and the people are dancing around such that you can't see very well the person you try to recognize. At first you saw quickly dark hairs (X_{1}), but your not sure... With this piece of information only you have 0.04 chance to make a mistake. The same way, if you only saw quickly that he has glasses you have 0.05 chance to make an error. With both information you have more chance to succeed (that's make sense) with a probability of error of 0.022.
Now, let's imagine that you only saw that he has glasses (X_{2}), you
can think, "well, I'm not sure, let's check it again". So you wait
for the same guy to pass in front of you and saw a second time that he has glasses.
Now the probability that you make a mistake is only 0.005... so you would be
10 times more confident on the fact that this person is really one of your friend.
Cover asserts in his paper that the space for which the best k measurements are not the k-best is large. In particular, he states that "It is not necessary to choose r_{1}=0". To test his assertion we have written a calculator which allows you to play with the values of his parameters and outputs the bayes error for all 2-best combinations of experiments.
This is a JavaScript calculator that enable you to play with the probability
distribution.
Just change the value of the a posteriori probability and observe directly the
effect on the bayes error probability and the ordering of the sets, from the
"best" to the "worst" set.... Have fun....
≤ ≤ ≤ ≤
Although the javascript directly above gives an example of the different possible orderings given the different probability distributions, it does not provide a good, graphic, visualization. Our applet examines Covers example more carefully, providing a graphical way to look at the possible orderings.
Link to our applet : click here...