Study notes for Clustering and K-means
1. Clustering Analysis
- Clustering is the process of grouping a set of (unlabeled) data objects into multiple groups or clusters such that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. Dissimilarities and similarities are assessed based on the attribute values describing the objects and often involve distance measures.
- Cluster analysis is defined as "a statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics" (Jain, 2009).
- Clustering can be considered as the most important unsupervised learning problem.
- Clusters can be different in terms of their shape, size and density.
Problems of Clustering Algorithms
- Dealing with large number of dimensions and large number of data items can be problematic because of time complexity.
- The effectiveness of the method depends on the definition of "distance" or "similarity" metric.
- The outcomes of an algorithm can be interpreted in different ways.
- The presence of noisy data makes the detection of clusters even more difficult.
Taxonomy of Clustering Algorithms
- Partitioning Algorithms: (1) K-medoids Algorithms; (2) K-means Algorithms; (3) Probabilistic Algorithms;
- Hierarchical Algorithms: (1) Agglomerative Algorithms; (2) Divisive Algorithms.
- Density-based Algorithms: (1) Density-Based Connectivity Clustering; (2) Density Functions Clustering.
- Grid-based Algorithms:
- Algorithms for high-dimensional clustering: (1) Subspace Clustering; (2) Projection Algorithms; (3) Co-Clustering (i.e. Biclustering).
Cluster Validity
- Compactness.
- Separateness.
Some Notes
- It is reported that the clustering algorithms following the same clustering strategy result in similar clustering in spite of minor variations in the parameters or objective functions involved (Jain, 2009).
- In other words, there is no best clustering algorithm. Each clustering algorithm imposes a structure on the data either explicitly or implicitly.
- Cluster analysis is an exploratory tool. With the emergence of new applications, it has become increasingly clear that the task of seeking the best clustering principle might indeed be futile.
- A clustering method is a general strategy employed to solve a clustering problem.
- A clustering algorithm, on the other hand, is simply an instance of a method. For example, minimizing the square error is a clustering method, and there are many different clustering algorithms, including K-means (Jain, 2009).
- Side information can be regarded as any external information.
Research Trends
- Clustering ensembles. By taking multiple looks at the same data, one can generate multiple partitions (cluster ensemble) of the same data. By combining the resulting partitions, it is possible to obtain a good data partitioning even when the clusters are not compact and well separated. It can be implemented by:
- applying different clustering algorithms.
- applying the same clustering algorithm with different values of parameters or initialization.
- combining of different data representations (feature space) and clustering algorithms.
Example. K-means is run multiple, say N, times with varying values of the number of clusters K. The new similarity between a pair of points is defined as the number of times the two points co-occur in the same cluster in N runs of K-means. The final clustering is obtained by clustering the data based on the new pair-wise similarity (Jain, 2009). - Semi-supervised clustering makes use of side information in addition to similarity matrix. Those side information contains pair-wise constraints which are usually provided by the domain experts.
- a must-link constraint specifies that the point pair connected by the constraint belong to the same cluster.
- a cannot-link constraint specifies that the point pair connected by the constraint do not belong to the same cluster.
- Large-scale clustering addresses the challenge of clustering millions of data points that are represented in thousands of features, such as document clustering, gene clustering, etc. Typical examples are tree-based (e.g., kd-tree), BIRCH, coreset K-means, etc.
- Multi-way clustering. Co-clustering aims to cluster both features and instances of the data (or both rows and columns of the nxd pattern matrix) simultaneously to identify the subset of features where the resulting clusters are meaningful according to certain evaluation criterion.
- Heterogeneous data refers to the data where the objects may not benaturally represented using a fixed length feature vector. Typical examples include rank data, dynamic data, graph data, relational data, etc.
2. K-means
- Even though K-means was first proposed over 50 years ago, it is still one of the most widely used algorithms for clustering. Ease of implementation, simplicity, efficiency, and empirical success are the main reasons for its popularity (Jain, 2009).
- It cannot converge to the global optimum and only terminate at a local optimum, even though recent study has shown that with a large probability K-means could converge to the global minimum when clusters are well separated (Meila, 2006).
- The results may depend on the initial random selection of cluster centers.To obtain good results in practice, it is common to run the k-means algorithm multiple times with different initial cluster centers.
- K-means is typically used with the Euclidean metric for computing distance between points and cluster centers. As a result, K-means find spherical or ball-shaped clusters in data (Jain, 2009).
- It is very sensitive to noise and outliers because a small number of such data can substantially influence the mean value.
- The time complexity of the k-means is O(nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Therefore, the method is relatively scalable and efficient in processing large data sets.
3. K-means Variants
Fuzzy c-means
- In K-means, each data point is assigned to a single cluster, called hard assignment.
- Fuzzy c-means is an extension of K-means where each data point can be a member of multiple clusters with a membership value, called soft assignment.
- Fuzzy c-means is proposed by Dunn (1973). A good overview of fuzzy set based clustering is available in (Backer, 1978).
Bisecting K-means
- Steinbach et al. (2000) proposed a hierarchical divisive version of K-means, called bisecting K-means, that recusively partitions the data into two clusters at each step.
X-means
- Pelleg and Moore (2000) automatically finds K by optimizing a criterion such as Akaike Information Criterion (AIC) or Bayesian Informaiton Criterion (BIC).
Kernel K-means
- Scholkopf et al. (1998) proposed a kernel-version K-means, to detect arbitray shaped clusters with an appropriate choice of the kernel similarity function.
References
- Backer, 1978. Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets. Delft University Press.
- Dunn, 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybernet. 3, 32–57.
- Jain, 2009, Data clustering: 50 years beyond K-means, Pattern Recognition Letters.
- Pelleg and Moore, 2000. X-means: Extending k-means with efficient estimation of the number of clusters. ICML.
- Scholkopf et al., 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10 (5), 1299–1319.
- Steinbach et al., 2000. A comparison of document clustering techniques. In: KDD Workshop on Text Mining