模糊k-means聚类
接上一篇博文:聚类算法概述
模糊kmeans算法是kmeans聚类模糊形式。与kmeans算法排他性聚类不同,模糊kmeans尝试从数据集中生成有重叠的簇。在研究领域,这也叫做模糊c-means算法,可以把模糊kmeans看作kmeans算法的扩展。
kmeans致力于寻找硬簇(一个数据集点只属于某一个簇)。在一个软聚类算法中,任何点都属于不止一个簇,而且该点到这些簇之间都有一定大小的吸引度。这种吸引度与该点到这个簇中心距离成比例。
mahout中实现部分也是FuzzyKMeansClusterer 和 FuzzyKMeansDriver 一个是in-memory的一个是mapreduce 的。
模糊kmeans有一个参数m ,叫做模糊因子。与kmeans不同的是,模糊因子引入不是把向量分配到最近的中心,而是计算每个点到每个簇的关联度。
假设一个向量V,到k个簇的距离分别为d1,d2。。。dk。向量V到第一簇的关联度计算如下:
这个公式也就是表达意思:如果越接近该向量簇中心,就会得到更高的权重。
mahout中具体实现这个公式的在FuzzyKMeansClusterer 中
package org.apache.mahout.clustering.fuzzykmeans; import java.util.Collection; import java.util.List; import org.apache.mahout.math.DenseVector; import org.apache.mahout.math.Vector; public class FuzzyKMeansClusterer { private static final double MINIMAL_VALUE = 0.0000000001; private double m = 2.0; // default value public Vector computePi(Collection<SoftCluster> clusters, List<Double> clusterDistanceList) { Vector pi = new DenseVector(clusters.size()); for (int i = 0; i < clusters.size(); i++) { double probWeight = computeProbWeight(clusterDistanceList.get(i), clusterDistanceList); pi.set(i, probWeight); } return pi; } /** Computes the probability of a point belonging to a cluster */ public double computeProbWeight(double clusterDistance, Iterable<Double> clusterDistanceList) { if (clusterDistance == 0) { clusterDistance = MINIMAL_VALUE; } double denom = 0.0; for (double eachCDist : clusterDistanceList) { if (eachCDist == 0.0) { eachCDist = MINIMAL_VALUE; } denom += Math.pow(clusterDistance / eachCDist, 2.0 / (m - 1)); } return 1.0 / denom; } public void setM(double m) { this.m = m; } }
而在分布式中FuzzyKMeansDriver
/** * Iterate over the input vectors to produce cluster directories for each iteration * * @param input * the directory pathname for input points * @param clustersIn * the file pathname for initial cluster centers * @param output * the directory pathname for output points * @param convergenceDelta * the convergence delta value * @param maxIterations * the maximum number of iterations * @param m * the fuzzification factor, see * http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering * @param runSequential if true run in sequential execution mode * * @return the Path of the final clusters directory */ public static Path buildClusters(Configuration conf, Path input, Path clustersIn, Path output, double convergenceDelta, int maxIterations, float m, boolean runSequential) throws IOException, InterruptedException, ClassNotFoundException { List<Cluster> clusters = Lists.newArrayList(); FuzzyKMeansUtil.configureWithClusterInfo(conf, clustersIn, clusters); if (conf == null) { conf = new Configuration(); } if (clusters.isEmpty()) { throw new IllegalStateException("No input clusters found in " + clustersIn + ". Check your -c argument."); } Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); ClusteringPolicy policy = new FuzzyKMeansClusteringPolicy(m, convergenceDelta); ClusterClassifier prior = new ClusterClassifier(clusters, policy); prior.writeToSeqFiles(priorClustersPath); if (runSequential) { ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, maxIterations); } else { ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations); } return output; }