Minimizing the Perceptron Criterion Function



What is a perceptron? What does it have to do with pattern classification?
Click here to for a great tutorial which will answer all your questions...

How does Linear Programming fit in? (Duda and Hart)

In most pattern classification applications we cannot assume that the samples are linearly separable. When the patterns are not separable by a hyperplane, we would still like to obtain a wieght vector that classifies as many samples correctly as possible. It turns out that the number of errors (samples incorrectly classified) is not a linear function of the components of the weight vector. Thus trying to minimize the number of errors is not a linear programming problem.

The problem of minimizing the perceptron criterion function can be formulated as a linear program. Why is this useful? Well, the minimization of the perceptron criterion function yields a separating vector on the separable case and a "reasonably good" solution in the inseparable case.

The basic perceptron criterion function is given by:
Jp(a) yiY (-aTyi)

Where Y(a) is the set of samples misclassified by a
If a = 0 we obtain a useless solution. To avoid this, we introduce a positive vector b:
J'p(a) yiY' (bi - aTyi)

where yiY' if aTyi bi. J'p(a) is a piecewise-linear function of a. How do we use linear programming to minimize J'p(a) if it is not a linear function? We use the following trick: introduce n (# of samples) artificial variables and their constraints. Consider the following Linear Program:
Minimize

Subject to


i=1...n si
si bi - aTyi
si 0

If we fix a value of a, then J'p(a) = minimum{i=1...n si} (the minimum value of our objective function) since the best we can do is take si = max(0,bi - aTyi). Thus, if we minimize i=1...n si over s and a we will obtain the minimum value of J'p(a).
Conclusion: we can use linear programing to minimize the perceptron criterion function

...Click here to return to the main page