KMeans and optimization

  • random sheme or say naive 

input: k, set of n points  

place k centroids at random locations 随机选

  • repeat the follow operations until convergence 重复到收敛

--for each point i:

  1. 找到k个中最近centroid j   (距离公式)
  2. 将point i 放入cluster j中

--for each cluster j:

  1. 对此cluster j中的每个point计算所有的attribute的均值

(attribute不能是categorical or ordinal必须是numeric)

  • stop when none of the cluster assignments change   所有点不再改变cluster membership
  • O(iterations*k*n*dimensions)  ,per interation:O(kn)  memory O(k+n)
  • 无法precache,每次迭代都会改变centroids

 

 

  • optimization

1 k-means++(using adaptive sampling scheme) :slow but samll error ; 随机选择:extremely fast,large error

https://blog.csdn.net/the_lastest/article/details/78288955

主要思想:优化中心点的初始化。

 

2AFK-MC2: using Markov chain improving k-means++

  • AFK-MC2   改变seeding的方式

paper :https://las.inf.ethz.ch/files/bachem16fast.pdf

Initial data points are states in the Mchain

a further data point is sampled to act as the candidate for the next state

randomized decision determines whether the Mchain transitions to the candidate or whether it remains old state

repeat and the last state returned as the initial cluster center

  • code 
  1. 欧氏距离:np.lianlg.norm(a-b)
  2. np.loadtxt(naem)
  3. 变量:

    参数:epsilon =0 //threshold,minimun error used in stop condition

    history_centroids = []

    configuration记录:num_instances,num_features = dataset.shape

    初始:prototype = dataset[np.random.randint(0,num_instances-1,size =k)]

    np.ndarray num_instances个[],每个[]中num_features个元素,存放centroid:prototypes_old = np.zeros(prototype.shape)

    存放cluster:belongs_to=np.zeros((num_instances,1))

  4.   迭代:
while norm>epsilon:

  iteration+=1

  norm = dist_method(prototype,prototype_old) //用来看是否停止,迭代前后的变化

  for index_in,instance in enumrate(dataset):
    dist_vec = np.zeros((k,1))

    for index_prototype,prototype in enumrate(prototypes):

      dist_vec[index_prototype] =dist_method[prototype,instance]

    belongs_to[index_in,0]=np.argmin(dist_vec)

  tmp_prototype = np.zeros((k,num_features))

  for .....(cluster)

 

  • scaling n,k

sample and approximation approaches: 效果不好,当k增大分类更糟。

initial centroid selection:(seedling smarter): like 'blaklist' 、'Elkan's' 、'Hamerly's' algorithm

  • blacklist algorithm  

在data上建立一个tree,在所有centroid上迭代,排除一些。

setup cost O(nlgn) to build tree, computation worst:O(knlgn)  ,  memory O(k+nlgn)

  • 'Elkan's'  

计算centroids之间距离,平衡points和centroid的距离来减少距离计算

no setup costs,worst O(k^2+kn)  memory O(k^2+kn)

  • Dual-Tree k-means with bounded single-iteration runtime

paper: http://www.ratml.org/pub/pdf/2016dual.pdf

  1. build two trees: query-treeT  reference-tree Q  T:一个instance task of查最近邻,保存点  Q:最近邻来自的set
  2. 同时traverse  当访问(T.node,Q.node)一对时,看是否可剪,可则prune整个子树(可用于最近邻search, kernel density estimation, kernel conditional density estimation.....等等)
  3. space tree:不是 space partitioning tree 允许nodes overlap。undirected acyclic rooted simple graph
    1. 每个节点有许多points(0) 与一个父节点连接,许多子节点(0)
    2. 根节点
    3. 每个点至少被包含在一个树节点中
    4. 每个节点有一个多维的凸子集(convex subset)包含着该节点中的所有点以及孩子节点所表示的convex subsets    即每个节点有bounding shape包含所有descendant points
  4. traverse

访问pair(T Q节点的组合) no more than once并对combination计算给出score

if score>bound or infinite, the combination is pruned。否则计算Tnode的每个点和Qnode的每个点,而不是计算每个descendant point之间score

直到tree只有叶子的时候,call base case

!!:dual-tree algorithm = space tree+pruning dual-tree traversal+BaseCase() Score() 。进一步理解见link

小结: 

kmeans,对初始中心点有依赖,可能会误差比较大,局部最优而非全局,迭代次数也受影响。优化要么是先用算法改进初始点的选择。如Canopy或层次聚类。要么则对k确定的方法,如使用类簇半径或直径作为指标,k小于真实值时,指标变化较大。还有就是scalable的问题,减少迭代或运算次数的方法,如上文所示,引进HMM或使用树结构来优化迭代过程。

posted on 2017-09-23 04:50  satyrs  阅读(251)  评论(1编辑  收藏  举报

导航