K-means聚类的Python实现
生物信息学原理作业第五弹:K-means聚类的实现。
转载请保留出处!
原理参考:K-means聚类(上)
数据是老师给的,二维,2 * 3800的数据。plot一下可以看到有7类。
怎么确定分类个数我正在学习,这个脚本就直接给了初始分类了,等我学会了再发。
下面贴上Python代码,版本为Python3.6。
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Wed Dec 6 16:01:17 2017 4 5 @author: zxzhu 6 """ 7 import numpy as np 8 import matplotlib.pyplot as plt 9 from numpy import random 10 11 def Distance(x): 12 def Dis(y): 13 return np.sqrt(sum((x-y)**2)) #欧式距离 14 return Dis 15 16 def init_k_means(k): 17 k_means = {} 18 for i in range(k): 19 k_means[i] = [] 20 return k_means 21 22 def cal_seed(k_mean): #重新计算种子点 23 k_mean = np.array(k_mean) 24 new_seed = np.mean(k_mean,axis=0) #各维度均值 25 return new_seed 26 27 def K_means(data,seed_k,k_means): 28 for i in data: 29 f = Distance(i) 30 dis = list(map(f,seed_k)) #某一点距所有种子点的距离 31 index = dis.index(min(dis)) 32 k_means[index].append(i) 33 34 new_seed = [] #存储新种子 35 for i in range(len(seed_k)): 36 new_seed.append(cal_seed(k_means[i])) 37 new_seed = np.array(new_seed) 38 return k_means,new_seed 39 40 def run_K_means(data,k): 41 seed_k = data[random.randint(len(data),size=k)] #随机产生种子点 42 k_means = init_k_means(k) #初始化每一类 43 result = K_means(data,seed_k,k_means) 44 count = 0 45 while not (result[1] == seed_k).all(): #种子点改变,继续聚类 46 count+=1 47 seed_k = result[1] 48 k_means = init_k_means(k=7) 49 result = K_means(data,seed_k,k_means) 50 print('Done') 51 #print(result[1]) 52 print(count) 53 plt.figure(figsize=(8,8)) 54 Color = 'rbgyckm' 55 for i in range(k): 56 mydata = np.array(result[0][i]) 57 plt.scatter(mydata[:,0],mydata[:,1],color = Color[i]) 58 return result[0] 59 60 data = np.loadtxt('K-means_data') 61 run_K_means(data,k=7)
附上结果图:
这个算法太依赖于初始种子点的选取了,随机选点很有可能会得到局部最优的结果,所以下一步学习一下怎么设置初始种子点以及分类数目。
Talk is cheap,show me your code!