Classifiers 

Our Algorithm

Here we present a family of classification algorithms which we developed. Our intention was to demonstrate implementation of various methods and allow for interactive comparison of their relative performance.

As some of the algorithms are computationally demanding we have kept the dimensionality of the feature space low (d was set to 2) in order to speed up the calculations without a significant loss in generality. For similar reasons we have limited the classifier to consider only linear discriminant functions. Linear discriminant function limits us to two distinct classes. Every pattern has its class information coded as 1 or -1. Given a set of input patterns we apply three different techniques to find a line which best partitions the training set into two distinct classes. A linear discriminant function in a d dimensional space can be parameterized as:
linear discriminant function (1)
where xi are the components of the feature vector and the weights need to be adjusted to optimize the performance of the classifier. To further reduce the number of parameters in the 2-dimensional case in question, brute force algorithm and the genetic algorithm parameterize the line by the angle (with respect to x axis) and the distance of closest approach to the origin - this is called the normal parameterization.

Brute Force Approach

Assuming that the feature vectors from the training set are rescaled to the (-1,1) interval the parameter space one needs to sweep in order to find the optimal solution is bound. Angles are limited to (0o,360o) and distances to (0,1.41). Given a specific granularity (which determines the increment in the parameter value as we seek the solution) for each parameter the number of discriminant functions that need to be tested is finite. The algorithm (in d dimensions) runs in O(n1n2...nd) time, where nk is the number of `subdivisions' along the k'th parameter axes. Clearly, this trial-and-error approach becomes impractical as the dimensionality of the feature space increases.

Neural Network

a single neuron neural network A neural network searches for the optimal solution by performing an error minimizing procedure. As we are looking for a linear discriminant function, a neural network with no hidden layers (just a single perceptron) suffices. Given a feature vector the network output is calculated by:
neural network output (2)
where ai are called synaptic strengths, b is called the threshold and f(x) is called the activation function. If identity function is used for activation the neural network is just a linear function. To simplify the calculation of the error function we use f(x)=c tanh(x). Functional value has the same sign as x (which is the linear combination from equation 2) and classification is done according to that sign (which explains why -1 and 1 were chosen for class variables). All patterns yielding a positive value of this activation function are classified as class 1, and the rest are classified as class -1. A constant c determines the steepness of the function close to the origin (how closely it resembles the step function). The error function is a sum of squared difference between the pattern's class and the network output summed over the whole training set:
error function (3)
Error back-propagation based learning methods adjust the parameters in order to minimize the error function. As the gradient of the function points in the direction of the steepest increase in the functional value the road to the minimum should lead the other way. Thus each parameter p is incremented according to:
error back-propagation (4)
where e is a small number called the learning rate. Learning rate is varied to influence the speed of the network convergence. Weights can be adjusted either after every pattern is presented to the network (online learning) or the increment can be accumulated until the whole training set has passed through the network (batch learning). In the network training stage the training set is repeatedly 'passed' through a network (each passing is called the epoch) until the parameters converge and the error function doesn't decrease any more. Note that the performance should always be calculated using a feature set that not used in training.

Error minimization methods, such as neural networks, use local information about the shape of the parameter space and therefore run a risk of being trapped in a local minimum thus resulting in a sub-optimal solution to the problem.

Genetic Algorithm

Genetic algorithms are a model of computation generally used for optimization, and a classification problem reduces to finding the parameters of the optimum discriminant function defining the boundary between classes. We would like to have a method which initially performs a coarse search of the whole space and thus doesn't run a risk of missing the global minimum but which, in the final stage, concentrates on the neighborhoods of the local minima and finds the optimum. Genetic algorithms combine global search and the use of local information.

An elementary unit of a genetic algorithm, called a chromosome, carries the information about the set of parameters representing a particular instance of a discriminant function. The appropriate form of the discriminant function and its parameterization are given in advance.

Each chromosome has a number of genes equal to the number of parameters used in the discriminant function. A gene is a binary representation of the value of the parameter. It is a sequence of bits (0 and 1) and its length (which is pre-determined) reflects the precision of parameter values. In our study, data was rescaled so that all relevant parameter values fall in the (-1,1) range and a 10 bit gene used yields precision of about 0.1%.

A fitness function of a chromosome is a measure its performance and it is a quantity the algorithm is trying to maximize. In our case it is the fraction of patterns properly classified by applying the discriminant function parameterized by the chromosome to a given testing set. This choice allows us to test the genetic algorithm against the selection methods performed earlier.

single point crossover mutation Two chromosomes are combined to form a new one using a single point crossover. A random point in the chromosome is picked. All the information from parent A is copied from the start up to the crossover point, then all the information from parent B is copied from the crossover point to the end of the chromosome. The new chromosome thus gets the head of one parent's chromosome combined with the tail of the other. The ambiguity between parents A and B (which parent contributes the head and which contributes the tail to the new chromosome) is resolved by a random draw. Variations exist which use more than one crossover point, or combine information from parents in other ways. Mutations are a crucial part of the algorithm as they allow creation of radically new solutions. In our implementation each bit in the new chromosome's genome has a fixed probability of mutation - a mutation switches a bit from 0 to 1 or vice versa. Mutation probability is an essential parameter and should be varied as time progresses. Initially, one would like to set a large mutation probability and quickly probe different regions of the parameter space. As we are approaching the optimum, mutation probability should be decreased so that new chromosomes are created sufficiently close to their parents.

probabilities of breeding and dying A colony is a collection of chromosomes which evolves as the algorithm progresses. It is created by providing two random chromosomes and is then propagated through epochs. Colony size is kept bound to speed up the algorithm convergence. In every epoch each organism has a chance of `breeding' and thus passing its genes to new chromosomes. A probability of breeding is determined by 1-colony size / maximum colony size which simulates having finite `resources' to support the colony. A partner is chosen as the best performer among a 10 random chromosomes To keep the size of the colony bound chromosomes also 'die' based on two distinct criteria. During every epoch, all chromosomes other than the best performer can die because of 'overpopulation' with a probability of colony size / maximum colony size. In addition, a portion of chromosomes with low fitness function have a fixed probability of dying from natural selection. It is important that malperforming chromosomes are not destroyed unconditionally because they still have a chance of creating good offspring.

Link to other Neural Network sites:
Neural Network Information
Link to other Genetic Algorithm sites:
Hitch-Hikers Guide to Evolutionary Computation

Applet