hadoop与spark的处理技巧（六）聚类算法(2)K-means

K-均值算法试图将一系列样本分割成K个不同的类簇（其中K是模型的输入参数）

K-means

K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The spark.mllib implementation includes a parallelized variant of the k-means++ method called kmeans||. The implementation in spark.mllib has the following parameters:

k is the number of desired clusters. Note that it is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster.
maxIterations is the maximum number of iterations to run.
initializationMode specifies either random initialization or initialization via k-means||.
runs This param has no effect since Spark 2.0.0.
initializationSteps determines the number of steps in the k-means|| algorithm.
epsilon determines the distance threshold within which we consider k-means to have converged.
initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performe

（1）K是簇的数量，返回的数量可以少于K，例如再样本个数小于K时

（2）最大迭代次数

（3）指定初始化方式：随机初始化或者通过KmeansII指定

（4）runs--spark2之后弃用

（5）初始化步长

（6）epsilon确定我们认为k-means已收敛的距离阈值

（7）initialModel 是用于初始化的可设置的簇中心，如果已设置，就只执行一次

posted @ 2020-05-19 16:20 疯狂摇头的青蛙阅读(247) 评论(0) 编辑收藏举报

刷新页面返回顶部

疯狂摇头的青蛙

hadoop与spark的处理技巧（六）聚类算法(2)K-means

K-means

公告