层次聚类的连接标准

缘由

写这篇博客是因为看到一篇介绍聚类的博客,里面介绍到层次聚类时,提到了linkage criterion,博客把这翻译成了连接标准。之前很少用过层次聚类,所以对这个概念并不熟悉。于是搜索了一下,把一些知识点总结与此,大部分来源于维基百科和Quora以及scikit-learn文档。

Linkage criteria

维基百科上的定义是:The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.

翻译过来是,连接标准决定了两个簇之间的距离函数。也就是说,两个簇的距离怎么衡量,怎么计算,由连接标准决定。

维基百科上提供了10种衡量距离的方法:

  1. Maximum or complete-linkage clustering
  2. Minimum or single-linkage clustering
  3. Mean or average linkage clustering, or UPGMA
  4. Centroid linkage clustering, or UPGMC
  5. Minimum energy clustering
  6. The sum of all intra-cluster variance.
  7. The decrease in variance for the cluster being merged (Ward's criterion).
  8. The probability that candidate clusters spawn from the same distribution function (V-linkage).
  9. The product of in-degree and out-degree on a k-nearest-neighbour graph (graph degree 10. linkage).
  10. The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.

这里的标准太多了,我就不一一讨论了,因为有几种涉及到挺复杂的数学公式,而且我们也很少用。

which linkage criterion to use

Quora上有人提问:What is the best linkage criterion for hierarchical cluster analysis?

目前有一个MIT的phD回答说,很多人都对这个问题做个实验,相关的论文非常多,最后的结论是,average linkage是最有效的,当我们做层次聚类的时候要首选average linkage,而single linkage是效果最差的。。

sklearn里的linkage criterion

这里重点介绍sklearn里面提供的三种标准:ward, complete, average。(具体可以去看看sklearn.cluster.AgglomerativeClustering的文档)sklearn对这三个的定义是:

  • ward minimizes the variance of the clusters being merged.
  • average uses the average of the distances of each observation of the two sets.
  • complete or maximum linkage uses the maximum distances between all observations of the two sets.

第二个和第三个还比较好理解,对应wiki里的第三个和第一个。这里ward的定义里面提到了方差,所以显得不好理解。

wiki上的Ward's method里面有这句话:Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.

我的理解是,起初每个点单独是一个簇,此时所有的方差都是0,所以总的方差也是0。当有合并动作时,总的方差会变大,我们要选择使总方差最小的那两个簇的合并。

posted @ 2017-04-04 21:23  james+zhao  阅读(3685)  评论(0编辑  收藏  举报