|
|
![]() |
View Applet
Cluster analysis is a process used to solve classification problems. Its object is to group data points into clusters so that the degree of association is strong between members of the same cluster and weak between members of different clusters. Thus each cluster describes the class to which its members belong.
As a result, cluster analysis can reveal similarities in data which may have been otherwise impossible to find. Examples of such applications would be a classification scheme for related fauna and flora, as well as models with which to describe populations, or even methods to recognize blood cells or handwriting.
Given a set S consisting of n points in an m-dimensional Enclidean space Rm, how can S be partitionned into k clusters. Letting Pk signify the set of all partitions in S into k clusters, a clustering criterion W(P,k) is assigned to every P which measures the correctness of each partition into k clusters. The problem thus, is to find a partition P' which maximizes (or minimizes) the criterion over all partitions into k clusters.
In order to determine the number of clusters and their partitions, numerous clustering algorithms exist which fall in one of two categories: hierarchical and non-hierarchical clustering.
Non-hierarchical clustering algorithms produce disjoint clusters and thus work well when a given set is composed of a number of distinct classes or when the data description is "flat".
Suppose we have n feature vectors X1, X2, ... , Xn all belonging to the same class C and we know that they belong to k clusters such that k < n. If clusters are well separated we can use a minimum distance classifier to separate them. We first initialize the means m1 ... mk of k clusters. One of the ways to do this is just to assign random numbers to them. We then determine the membership of each X by taking the || X - mi ||. The minimum distance determines X's membership in a respective cluster. This is done for all n feature vectors Here is the actual algorithm:
Figure 1 explains how the algorithm works in the case when k=2. We randomly select the values for m1 and m2, and the algorithm converges when the means no longer change. (Let mi = mi)
Figure 1
The are certain problems with this algorith, like when the means are chose such that some Mi happens to be 0, so that it cannot be updated. It is, hence, advisable to try a number of different starting points. The results depend a lot on the value of k, i.e. the actual number of clusters. Often, we have no way of knowing how many clusters exist. Figure 2 shows what happens when we use k=3. (Let mi = mi)
Figure 2
Sometimes the clustering division turns out to be better at higher k.We can go all the way up to k=n. However, this procedure will give us something known as the Nearest Neighbor Classifier which will be discussed later on. It performs great if the number of feature vectors is large; however, computationally it is much more expensive.
In every iteration of the classical k-means procedure, we assumed that each feature vecotor belongs to exactly one cluster. We can relax this condition and assume that each sample Xi has some graded or "fuzzy" membership in a cluster mj. The probabilities of cluster membership for each point are normalized as:
Sci=1 P(wi | xj) = 1, where j = 1, ..., n
Each mi is then re-calculated as:
mj = ( Snj=1 P(wi | xj)b xj ) / ( Snj=1 P(wi | xj)b
And each P(wi | xj) is then re-calculated as:
P(wi | xj) = ( 1 / dij )1/(b-1) / ( Scr=1 ( 1 / drj )1/(b-1) ), where dij = || xj - mi ||2
The algorithm looks as follows:
It may seem that the graded membership, which fuzzy K-means offers, improves the convergence of the algorithm. There are still drawbacks to this method: incorrect specification of the number of clusters.
It may sometimes happen that we do not have all the data points at once, but rather obtain them gradually over a period of time. The Sequential K-means procedure offers us the flexibility of updating the means as new data points arrive.
Our objective is to find such partitions of a set of n samples which extremize the criterion function (distance, etc.) Due to an overwhelming number of partitions of n elements into c subsets ( cn /c! ), we cannot consider all possibilities. One approace is Iterative Optimization descibed by the Basic iterative minimum-squared-error clustering which could be modified to a Sequential K-Means procedure. The idea is to come up with a reasonable inital partition and to "move" samples from one group to another if such move improves the value of the criterion function.
Basic iterative minimum-squared-error clustering:
This procedure has certain drawbacks as well, one of them being the order in which the candidates are selected.
There are many times when clusters have sub-classes within them, and which in turn have sub-classes of their own. Examples of such occurences can be found in biological taxonomy when we have a species in the 'animal' kingdom, with a 'chordata' phylum, a sub-phylum of 'vertebrata, and so on. Such classifications are hierarchical and proper partitions can be found by the following classification methods.
More formally, a sequence is said to be a hierarchical clustering if there exists 2 samples, c1 and c2, which belong in the same cluster at some level k and remain clustered together at all higher levels > k.
One example of a hiearachical clustering is a correspondence tree, or a dendrogram (shown in Figure 1), which shows how samples are grouped together. The first level shows all samples xi as singleton clusters. As we increase levels, more and more samples are clustered together in a hierarchical manner.
Figure 1Another example is one based on sets where each cluster level may contain sets that are subclusters as shown in the Venn diagram in Figure 2.
Figure 2The hierarchical clustering procedures can be divided into two different approaches: agglomerative and divisive. The agglomerative approach to cluster analysis, used by the nearest and farthest neighbour algorithms, is a bottom-up clumping approach where we begin with n singleton clusters and successively merge clusters to produce the other ones. The minimal spanning tree, on the other hand, uses the divisive approach which is a top-down approach where we start with one cluster and successively split clusters to produce others.
In general, all agglomerative algorithms usually yield the same results if the clusters are compact and well separated. However, if the clusters are close to one another or if their shapes are not hyperspherical, different results can be expected. The space and time complexities are O(n2) and O(cn2d2) respectively, where c is the number of clusters and d is the distance between them.
The following is the nearest-neighbour clustering algorithm:
To find the nearest clusters in step 5, the following clustering criterion function is used:
dmin(Di, Dj) = min || x - x' ||, where x e Di and x' e Dj
The merging of the two clusters in step 6 simply corresponds to adding an edge betwwen the nearest pair of nodes in Di and Dj.
Also, if instead of terminating after a predetermined number of clusters have been obtained, it is possible to set the termination criteria to stop when the distance between nearest clusters exceeds a predetermined threshold, thus becoming the single-linkage algorithm.
Some interesting properties of this algorithm is that it will always produce a tree since edges linking clusters always go between distinct clusters, thus no cycles are produced as can be seen in Figure 3. Also, if the algorithm is allowed to run until all clusters are connected then a spanning tree is generated, or more precisely, a minimal spanning tree since the edges we are inserting between clusters are always the shortest ones in distance.
Figure 3As a result, the nearest-neighbour algorithm can effectively detect elongated clusters but is not as effective when clusters are compact and of equal size. Therefore we look to the farthest-neighbour algorithm to detect such clusters.
The following is the farthest-neighbour clustering algorithm:
To find the nearest clusters in step 5, the following clustering criterion function is used:
dmax(Di, Dj) = max || x - x' ||, where x e Di and x' e Dj
Also, if instead of terminating after a predetermined number of clusters have been obtained, it is possible to set the termination criteria to stop when the distance between nearest clusters exceeds a predetermined threshold, thus becoming the complete-linkage algorithm.
The farthest-neighbour algorithm prevents the grouping of elongated clusters; instead, clusters are composed of complete subgraphs (Figure 4). As a result, the distance of two clusters is determined not by the distance between the two closest nodes (as in the nearest-neighbour algorithm), but by the distance between the two most distant nodes. Thus, when merging the two clusters, edges are added between every pair of nodes in the affected clusters.
Figure 4This algorithm thus detects clusters which are compact and of roughly equal size, but has difficulties when the clusters are elongated and can generate groupings which are meaningless.
Possible compromises between the two agglomerative approaches suggested above are to use the following criterion to find the nearest clusters:
davg(Di, Dj) = ( 1 / (ninj) ) S S || x - x' ||, where x e Di and x' e Dj
dmean(Di, Dj) = || mi - mj ||
The minimal spanning tree algorithm is a divisive approach because we are given the minimal spanning tree of all the nodes thus providing us with only one cluster to work with initially.
There exists numerous ways to divide the clusters successively. One way is to simply remove the longest edge and create 2 clusters and then recursively removing the longest edge until the desired number of clusters is obtained.
Another method in selecting which edge to remove is to compare the length of the edges to the lengths of the edges incident upon its nodes. As a result, we can consider an edge to be inconsistent if its length l is significantly larger than l', where l' is the average length of all other edges incident on its nodes (Figure 5). This method has the advantage of determining clusters which have different distances separating one another.
Figure 5Another use of the minimal spanning tree is to find dense clusters embedded in a sparse set of points (Figure 6). All that has to be done is to remove all edges longer than some predetermined length in order to extract clusters which are closer than the specified length to each other. If the length is chosen accordingly, the dense clusters are extracted from a sparse set of points easily.
Figure 6This method assumes that the data points X1, X2, ... , Xn come from a Poisson distribution (density function f(x)=e-m mx/x!) and that the clusters are convex. This procedure uses a method of maximum likelihood estimation. For C1, C2, ... , Ck clusters and n feature vectors, the likelyhood function is as follows:
fC(x) = [1/(m(C))n] Pni=1IC (xi), where C is the union of k disjoint domains or clusters
m(D) is the sum of Lebesgue measures of the k subsets of C. The maximum likelyhood estimation of k subsets C1, C2, ... , Ck is constituted by k subgroups of points such that the sum of the Lebesgue measures of their disjoint convex hulls is minimum. In a two-dimensional space (m=2), the optimal solution is given by k subgroups of points such that the areas of their disjoint convex hulls is minimum.
There are several implementations to this method. One is the global algorithm which is a dynamic programming procedure giving an optimal partition. (but not necesssarily unique) It takes polynomial time. Another approach is to apply a hierarchic divisive procedure which yeilds a partition of the set of objects into k clusters and then improve the partition according to the hypervolume criterion. The three methods below involve the hypervolume criterion.
This method consists of plotting the value of a clustering criterion against the number of clusters k and analyzing the graph looking for discontinuities in the slope. A sharp step in the curve signifies the number of clusters whereas no sharp inclinations suggests only one cluster. This procedure however can be very unreliable by showing a large number of clusters when analyzing unstructured data.
This method looks at C1, C2, ... , Ck subsets making an assumption that the points come from a Poisson distribution. The solution is based on the dilation of the convex hull H(x) of x = {x1, x2, ... , xk} about its centroid. A decision rule for estimating the number of clusters looks at whether the intersection of the convex sets is NULL or not starting at t=2 clusters and going up until a specified stopping condition.
This method assumes n data points from a Poisson process and k disjoint convex sets C1, C2, ... , Ck in an m-dimensional Euclidian space. For a given integer k >= 2, we test whether a subdivision into k clusters is significantly better than the subdivision into k-1 clusters. In practice, when the distribution is uknown, the hypothesis is rejected when the ratio is large. The ration has the form:
W(P,k)/W(P,k-1)
(For more information see (Hardy, 1992))
This method is based on the assumption of multivariate normality and is a likelihood ratio criterion to test the hypothesis of k clusters against k-1 clusters
This is a test for the number of clusters in a hiearchical clustering sequence. The stopping rule selects the partition associated to the first stage j in the cluster sequence j = 2, ..., n-2 which satisfies:
where m is the mean of the distribution of n-1 values, sz is the standard deviation of those same values, and k is the standard deviate.
As a result, zj+1 lies in the upper tail of the distribution of criterion values m. The best results are when the number of clusters are found to be n-j for the first zj+1 satisfying the condition. Therefore, if none satisfy the condition, there is no relevant classification.
This test determines the number of clusters within a hierarchical sequence also. This time, the stopping rule chooses the partition associated with the first stage j in the partial cluster sequence j = r, r+1, ..., n-2 which satisfy the condition:
where r is the number of items in the moving average, m is the moving mean in stage j, Lj is the correction trend lag in stage j, bj is the moving least squares slope in stage j, sz is the moving estimate of the standard deviation in stage j.
As before, if no zj+1 satisfies the condition, then we conclude there is no relevant classification for the data.
This method assesses the number of clusters k by selecting the k for which k2|W| is a minimum, where W is the matrix of the group dispersion.
The above methods M1 through M7 have been tested (Hardy, 96) on a variety of data. It was shown that if the clusters are compact and well separated then almost every method succeded in finding the obvios clusters. However, what was more interesting was the behavior of these methods on specific arrangements of data points in 2D Euclidian space.
It is pretty clear that the methods based on the hypervolume criterion performs well where applicable, even when there is no group structure present. One thing to mention is about the last method M7 which failed in every sigle test by being off by a factor of 10.
As can be seen from the above algorithms, there is a problem in implementing an algorithm which will work well in finding clusters in any set of data. Any approach chosen carries with it a set of implicit assumptions about the type of structure the data is in which prevents us from selecting any one algorithm and using it for all data. The methods may give the correct number of clusters contained within the data but whether the partitioning is accurate is another issue.
Since using different clustering criterion and methods for determining the number of clusters and their partitions can produce varying results when applied to identical data, we should be wary about accepting the results of a single clustering method. As a result, various cluster analysis methods should be used and an average obtained, or the data should be analyzed for size, shape and separation and only after that analysis has been done do we select an appropriate clustering algorithm to determine the number of clusters and their partitionings.
Duda, Richard. "Feature Selection and Clustering for HCI". http://www-engr.sjsu.edu/~knapp/HCIRDFSC/FSC_home.htm, 1996-97.
Duda, R. & Hart, P. & Stork, D. Pattern Classification: 2nd edition. John Wiley & Sons, Inc: California, USA, 1998.
Hardy, Andre. "On the Number of Clusters". Computational Statistics & Data Analysis: volume 23, p. 83-96, 1996.