层次聚类的连接标准
缘由
写这篇博客是因为看到一篇介绍聚类的博客,里面介绍到层次聚类时,提到了linkage criterion,博客把这翻译成了连接标准。之前很少用过层次聚类,所以对这个概念并不熟悉。于是搜索了一下,把一些知识点总结与此,大部分来源于维基百科和Quora以及scikit-learn文档。
Linkage criteria
维基百科上的定义是:The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.
翻译过来是,连接标准决定了两个簇之间的距离函数。也就是说,两个簇的距离怎么衡量,怎么计算,由连接标准决定。
维基百科上提供了10种衡量距离的方法:
- Maximum or complete-linkage clustering
- Minimum or single-linkage clustering
- Mean or average linkage clustering, or UPGMA
- Centroid linkage clustering, or UPGMC
- Minimum energy clustering
- The sum of all intra-cluster variance.
- The decrease in variance for the cluster being merged (Ward's criterion).
- The probability that candidate clusters spawn from the same distribution function (V-linkage).
- The product of in-degree and out-degree on a k-nearest-neighbour graph (graph degree 10. linkage).
- The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.
这里的标准太多了,我就不一一讨论了,因为有几种涉及到挺复杂的数学公式,而且我们也很少用。
which linkage criterion to use
Quora上有人提问:What is the best linkage criterion for hierarchical cluster analysis?
目前有一个MIT的phD回答说,很多人都对这个问题做个实验,相关的论文非常多,最后的结论是,average linkage是最有效的,当我们做层次聚类的时候要首选average linkage,而single linkage是效果最差的。。
sklearn里的linkage criterion
这里重点介绍sklearn里面提供的三种标准:ward, complete, average。(具体可以去看看sklearn.cluster.AgglomerativeClustering的文档)sklearn对这三个的定义是:
- ward minimizes the variance of the clusters being merged.
- average uses the average of the distances of each observation of the two sets.
- complete or maximum linkage uses the maximum distances between all observations of the two sets.
第二个和第三个还比较好理解,对应wiki里的第三个和第一个。这里ward的定义里面提到了方差,所以显得不好理解。
wiki上的Ward's method里面有这句话:Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.
我的理解是,起初每个点单独是一个簇,此时所有的方差都是0,所以总的方差也是0。当有合并动作时,总的方差会变大,我们要选择使总方差最小的那两个簇的合并。