聚类指标

1 简介

之前接触过一段时间聚类算法,这里记录一下在聚类中常用的评价指标,并给出相应的代码。假设我们对10个对象进行聚类,这10个对象的原始标签为[0, 0, 0, 1, 1, 1, 2, 2, 3, 3],那么预测标签[1, 2, 0, 2, 2, 2, 3, 3, 3, 0][2, 3, 1, 3, 3, 3, 0, 0, 0, 1]得到的指标应该一样(对于同样的聚类结果,即便给各个簇分配不同ID,其得到的指标结果应该一样)。

2 常用聚类指标

假设对N个点进行聚类,N个点之间可以形成\(\frac{N \times (N-1)}{2}\)条连边。如下图所示,我们可以对这些连边进行分类:

第一类(红色):在真实标签和聚类结果下都属于同一个cluster,TP。
第二类(绿色):在真实标签下属于一个cluster,在聚类结果中不属于一个cluster,FP。
第三类(橙色):在真实标签下不属于一个cluster,在聚类结果下属于一个cluster,FN。
第四类(蓝色):在真实标签和聚类结果下都不属于一个clusger,TN。
根据sklearn.metrics.cluster.fowlkes_mallows_score,这里的FP和FN定义好像和很多资料中的相反。

2.1 FMI

Fowlkes-Mallows index (FMI)定义为precision和recall的几何平均
\(FMI = \sqrt{\frac{TP \times TP}{(TP + FN) \times (TP + FP)}}\)

def fowlkes_mallows_score(labels_true, labels_pred, *, sparse=False):
    """Measure the similarity of two clusterings of a set of points.
    .. versionadded:: 0.18
    The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of
    the precision and recall::
        FMI = TP / sqrt((TP + FP) * (TP + FN))
    Where ``TP`` is the number of **True Positive** (i.e. the number of pair of
    points that belongs in the same clusters in both ``labels_true`` and
    ``labels_pred``), ``FP`` is the number of **False Positive** (i.e. the
    number of pair of points that belongs in the same clusters in
    ``labels_true`` and not in ``labels_pred``) and ``FN`` is the number of
    **False Negative** (i.e the number of pair of points that belongs in the
    same clusters in ``labels_pred`` and not in ``labels_True``).
    The score ranges from 0 to 1. A high value indicates a good similarity
    between two clusters.
    Read more in the :ref:`User Guide <fowlkes_mallows_scores>`.
    Parameters
    ----------
    labels_true : int array, shape = (``n_samples``,)
        A clustering of the data into disjoint subsets.
    labels_pred : array, shape = (``n_samples``, )
        A clustering of the data into disjoint subsets.
    sparse : bool, default=False
        Compute contingency matrix internally with sparse matrix.
    Returns
    -------
    score : float
       The resulting Fowlkes-Mallows score.
    Examples
    --------
    Perfect labelings are both homogeneous and complete, hence have
    score 1.0::
      >>> from sklearn.metrics.cluster import fowlkes_mallows_score
      >>> fowlkes_mallows_score([0, 0, 1, 1], [0, 0, 1, 1])
      1.0
      >>> fowlkes_mallows_score([0, 0, 1, 1], [1, 1, 0, 0])
      1.0
    If classes members are completely split across different clusters,
    the assignment is totally random, hence the FMI is null::
      >>> fowlkes_mallows_score([0, 0, 0, 0], [0, 1, 2, 3])
      0.0
    References
    ----------
    .. [1] `E. B. Fowkles and C. L. Mallows, 1983. "A method for comparing two
       hierarchical clusterings". Journal of the American Statistical
       Association
       <http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf>`_
    .. [2] `Wikipedia entry for the Fowlkes-Mallows Index
           <https://en.wikipedia.org/wiki/Fowlkes-Mallows_index>`_
    """
    labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
    n_samples, = labels_true.shape

    c = contingency_matrix(labels_true, labels_pred,
                           sparse=True)
    c = c.astype(np.int64, **_astype_copy_false(c))
    tk = np.dot(c.data, c.data) - n_samples
    pk = np.sum(np.asarray(c.sum(axis=0)).ravel() ** 2) - n_samples
    qk = np.sum(np.asarray(c.sum(axis=1)).ravel() ** 2) - n_samples
    return np.sqrt(tk / pk) * np.sqrt(tk / qk) if tk != 0. else 0.

2.2 Pair-wise F-Score

Pair-wise Precision和Pair-wise Recall和FMI中的计算一样,只不过不同于FMI求的是precision和recall的几何平均,Pair-wise F-Score计算的是precision和recall的调和平均

2.3 BCubed F-Score

参考Linkage论文,BCubed指标的计算公式:

可以用下面这张图来理解计算过程及后面的bcubed函数。

2.4 NMI

NMI: Normalized Mutual Information
对数据\(D=\{x_1, x_2, x_3, ..., x_n\}\)进行聚类,假设通过聚类得到p个cluster,\(C = \{C_1, C_2, ..., C_p\}\),根据标签得到q个cluster,\(C^* = \{C^*_1, C^*_2, ..., C^*_q\}\),则:
\(\begin{aligned} M I\left(C, C^{*}\right) &=\sum_{i=1}^{p} \sum_{j=1}^{q} P\left(C_{i}, C_{j}^{*}\right) \log \frac{P\left(C_{i} \cap C_{j}^{*}\right)}{P\left(C_{i}\right) P\left(C_{j}^{*}\right)} \\ &=\sum_{i=1}^{p} \sum_{j=1}^{q} \frac{\left|C_{i} \cap C_{j}^{*}\right|}{n} \log \frac{n \cdot\left|C_{i} \cap C_{j}^{*}\right|}{\left|C_{i}\right|\left|C_{j}^{*}\right|} \end{aligned}\)
这个公式也可以结合解释BCubed的图示进一步理解。

下面的\(H(C)\)是信息熵的形式,它可以用来表示不确定性度量
\(\begin{aligned} H(C) &=-\sum_{i=1}^{p} P\left(C_{i}\right) \log P\left(C_{i}\right) \\ &=-\sum_{i=1}^{p} \frac{\left|C_{i}\right|}{n} \log \left(\frac{\left|C_{i}\right|}{n}\right) \end{aligned}\)
根据互信息和信息熵的关系,我们有:
\(M I\left(C, C^{*}\right)=H(C)-H\left(C \mid C^{*}\right)\)
如果互信息(mutual information)越大,说明两者的关系越密切(如果\(H(C) = H\left(C \mid C^{*}\right)\),说明了解\(C^{*}\)不能为\(C\)提供任何有用的信息、以减少\(C\)的不确定性)。对于聚类结果,我们期望MI值能更大。但是如果将所有样本各划分成1类,\(MI(C, C^{*})\)会达到最大(类的split严重,虽然每个cluster的纯度很高,但是召回很低),为了避免这一倾向,使用Normalized mutual information,因为划分的cluster过多时,\(H(C)\)(信息熵,不确定性越大,熵越大)会更大。
\(N M I(C,C^{*})=\frac{MI(C, C^{*})}{\sqrt{H(C) H(C^{*})}}\)

3 指标计算代码

参考github: learn-to-cluster

from __future__ import division
import numpy as np
from sklearn.metrics.cluster import contigency_matrix, normalized_mutual_info_score
from skleran.metrics improt precision_score, recall_score

def _check(gt_labels, pred_labels):
    if gt_labels.ndim != 1:
        raise ValueError("gt_labels must be 1D: shape is {}".format(gt_labels.shape))
    if pred_labels.ndim != 1:
        raise ValueError("pred_labels must be 1D: shape is {}".format(pred_labels.shape))
    if pred_labels.shape != gt_labels.shape:
        raise ValueError("gt_labels and pred_labels must have same size, got {} and {}".format(gt_labels.shape, pred_labels.shape))
    return gt_labels, pred_labels

def _get_lb2idxs(labels):
    lb2idxs = {}
    for idx, lb in enumerate(labels):
        if lb not in lb2idxs:
            lb2idxs[lb] = []
        lb2idxs[lb].append(idx)
    return lb2idxs
        
def fowlkes_mallows_score(gt_labels, pred_labels, sparse=True):
    n_samples, = gt_labels.shape
    
    c = contingency_matrix(gt_labels, pred_labels, sparse=sparse)
    tk = np.dot(c.data, c.data) - n_samples
    pk = np.sum(np.asarray(c.sum(axis=0)).ravel()**2) - n_samples
    qk = np.sum(np.asarray(c.sum(axis=1)).ravel()**2) - n_samples
    avg_pre = tk / pk
    avg_rec = tk / qk
    fscore = 2. * avg_pre * avg_rec / (avg_pre + avg_rec)
    
    return fscore

def pairwise(gt_labels, pred_labels, sparse=True):
    _check(gt_labels, pred_labels)
    return fowlkes_mallows_score(gt_labels, pred_labels, sparse)

def bcubed(gt_labels, pred_labels):
    gt_lb2idxs = _get_lb2idxs(gt_labels)
    pred_lb2idxs = _get_lb2idxs(pred_labels)
    
    num_lbs = len(gt_labels.keys())
    pre = np.zeros(num_lbs)
    rec = np.zeros(num_lbs)
    gt_num = np.zeros(num_lbs)
    
    for idx, gt_idxs in enumerate(gt_lb2idxs.values()):
        all_pred_lbs = np.unique(pred_labels[gt_idxs])
        gt_num[idx] = len(gt_idxs)
        for pred_lb in all_pred_lbs:
            pred_idxs = pred_lb2idxs[pred_lb]
            n = 1. * np.intersect1d(gt_idxs, pred_idxs).size
            # 为什么是 n**2,理解一下
            # [n1 * (n1 / p1) + n2 * (n2 / p2) + ... + nk * (nk / pk)] / n
            # n = n1 + n2 + ... + nk
            pre[idx] += n**2 / len(pred_idxs)
            rec[idx] += n**2 / gt_num[i]
            
    gt_num = gt_num.sum()
    avg_pre = pre.sum() / gt_sum
    avg_rec = rec.sum() / gt_sum
    return

def nmi(gt_labels, pred_labels):
    return normalized_mutual_info_score(pred_labels, gt_labels)

另外一种实现Bcubed的方式:

import numpy as np
from scipy.sparse import coo_matrix

__all__ = ['bcubed', 'conditional_entropy', 'contingency_matrix', 'der',
           'goodman_kruskal_tau', 'mutual_information']


EPS = np.finfo(float).eps


def contingency_matrix(ref_labels, sys_labels):
    """Return contingency matrix between ``ref_labels`` and ``sys_labels``."""
    ref_classes, ref_class_inds = np.unique(ref_labels, return_inverse=True)
    sys_classes, sys_class_inds = np.unique(sys_labels, return_inverse=True)
    n_frames = ref_labels.size
    # Following works because coo_matrix sums duplicate entries. Is roughly
    # twice as fast as np.histogram2d.
    cmatrix = coo_matrix(
        (np.ones(n_frames), (ref_class_inds, sys_class_inds)),
        shape=(ref_classes.size, sys_classes.size),
        dtype=np.int)
    cmatrix = cmatrix.toarray()
    return cmatrix, ref_classes, sys_classes


def bcubed(ref_labels, sys_labels, cm=None):
    """Return B-cubed precision, recall, and F1.

    The B-cubed precision of an item is the proportion of items with its
    system label that share its reference label (Bagga and Baldwin, 1998).
    Similarly, the B-cubed recall of an item is the proportion pf items
    with its reference label that share its system label. The overall B-cubed
    precision and recall, then, are the means of the precision and recall for
    each item.

    Parameters
    ----------
    ref_labels : ndarray, (n_frames,)
        Reference labels.

    sys_labels : ndarray, (n_frames,)
        System labels.

    cm : ndarray, (n_ref_classes, n_sys_classes)
        Contingency matrix between reference and system labelings. If None,
        will be computed automatically from ``ref_labels`` and ``sys_labels``.
        Otherwise, the given value will be used and ``ref_labels`` and
        ``sys_labels`` ignored.
        (Default: None)

    Returns
    -------
    precision : float
        B-cubed precision.

    recall : float
        B-cubed recall.

    f1 : float
        B-cubed F1.

    References
    ----------
    Bagga, A. andBaldwin, B. (1998). "Algorithmsfor scoring coreference
    chains." Proceedings of LREC 1998.
    """
    if cm is None:
        cm, _, _ = contingency_matrix(ref_labels, sys_labels)
    cm = cm.astype('float64')
    cm_norm = cm / cm.sum()
    precision = np.sum(cm_norm * (cm / cm.sum(axis=0)))
    recall = np.sum(cm_norm * (cm / np.expand_dims(cm.sum(axis=1), 1)))
    f1 = 2*(precision*recall)/(precision + recall)
    return precision, recall, f1

posted @ 2021-08-07 12:13  渐渐的笔记本  阅读(1406)  评论(0编辑  收藏  举报