A classification problem deals with associating a given input pattern with one of the distinct classes. Patterns are specified by a number of features (representing some measurements made on the objects that are being classified) so it is natural to think of them as d-dimensional vectors, where d is the number of different features. This representation gives rise to a concept of feature space. Patterns are points in this d-dimensional space and classes are sub-spaces. Classification problem reduces to determining which of the regions a given pattern falls into. If classes do not overlap they are said to be separable and, in principle, one can design a decision rule which will successfully classify any input pattern. A decision rule determines a decision boundary which partitions the feature space into regions associated with each class. It represents our best solution to the classification problem.
The picture illustrates a 2-dimensional feature space with three classes occupying regions of the space.
The goal is to design a decision rule which is easy to compute and yields the smallest possible probability of misclassification of input patterns from the feature space. Formally, the optimal decision rule can be found using Bayesian decision theory, which incorporates methods from probability and statistics.
Our information about the classes (which we use to design a decision rule) is usually derived from some finite sample of patterns with known class affiliations. This sample is called a training set. If we make a decision boundary complex enough every pattern in the training set will be properly classified using the underlying decision rule, even if the distributions of patterns overlap. Classifiers are designed with a purpose of classifying novel patters (not belonging to the training set) and it is unlikely that an overly complex decision boundary would provide good generalization (classify novel patterns with the same success rate that is achieved on the training set) as it was tuned to perform extremely well on the training set. This is known as over-fitting the training set.
The picture shows a decision boundary over-fitting a training set distributed according to the classes of the previous figure. Perfect classification is achieved even though the classes overlap.
Therefore, we might seek to simplify the shape of the decision boundary which will, by sacrificing performance on the training samples, improve the performance on new patterns. Different classifiers can be implemented by constructing an appropriate discriminant function gi(x), where i is the class index. A pattern x is associated with the class j such that gj(x)>gi(x) for every i not equal to j. A simplest discriminant function is linear in pattern features. If a particular type of discriminant function is a priori chosen for a problem at hand, one is left with a task of finding the parameters yielding the optimal performance (minimal probability of error) of the classifier.
Links to other classification sites:
Classification Society of North America
Pattern Recognition Course on the Web (by Richard O. Duda)