K-means
1.距离计算
给定样本
连续属性的距离计算
闵可夫斯基距离
离散属性的距离计算
VDM距离
混合合计
假设有个连续属性,有个离散属性,令连续属性排列在离散属性之前,则两个样本之间的距离有
K均值
目标
给定样本集"K均值"算法针对聚类所得簇划分最小化平方误差
其中的均值向量
这个符号即计算簇中元素的个数
算法
输入:样本集,聚类簇数k
聚类簇数是个超参数
- 从中速记选择k个样本作为初始均值向量
- repeat
- 令
- for j=1,2,...,m do
-
- 计算样本与各均值向量
-
- 根据距离最近的均值向量确定的簇标记:
-
- 将样本划入相应的簇
- end for
- for i =1,2,...,k do
-
- 计算新均值向量
-
- if then
-
-
- 将当前均值向量更新为
-
-
- else
-
-
- 保持当前均值向量不变
-
-
- end if
- end for
until 当前均值向量均未更新
python代码
西瓜书数据集
点击查看代码
0.697,0.460
0.774,0.376
0.634,0.264
0.608,0.318
0.556,0.215
0.403,0.237
0.481,0.149
0.437,0.211
0.666,0.091
0.243,0.267
0.245,0.057
0.343,0.099
0.639,0.161
0.657,0.198
0.360,0.370
0.593,0.042
0.719,0.103
0.359,0.188
0.339,0.241
0.282,0.257
0.748,0.232
0.714,0.346
0.483,0.312
0.478,0.437
0.525,0.369
0.751,0.489
0.532,0.472
0.473,0.376
0.725,0.445
0.446,0.459
代码实现
点击查看代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(98765)
def getIntRandom(max, n):
'''
随机采样一组正整数
:param max: 最大值,随机范围是0-最大值-1
:param n: 随机数的个数
:return:
'''
ret = set()
while len(ret) < n:
rdm = np.random.randint(0, max - 1)
ret.add(rdm)
return list(ret)
def dist_minkowski(x, mu, p=2):
'''
计算闵可夫斯基距离
:param x: 样本点
:param mu: 中心点
:param p:p=2就是欧式距离
:return:
'''
return np.sum((x - mu) ** p) ** (1 / p)
def gen_train_data():
# X_train = np.array(pd.read_excel("./datasets/kmeans数据.xls", header=0)) # 样本
X_train = np.array(pd.read_csv("./datasets/waterlemon.txt")) # 样本
return X_train
if __name__ == "__main__":
X_train = gen_train_data()
print(X_train)
max_iter = 10000 # 最大迭代次数
M = X_train.shape[1] # 维度
N = X_train.shape[0] # 样本量
K = 3 # 超参数,分类数
# 初始化簇中心 mu ,随机选择
int_rdm = getIntRandom(N, K)
mu = [X_train[i] for i in int_rdm]
loop = 0
while (True):
print("迭代次数 = ", loop, " 簇中心向量 = ", mu)
C = [[] for i in range(K)] # 簇 C_i # 可以尝试把C做成一个空矩阵,下面用矩阵运算比较方便
for j in range(N):
min_dist = None
min_k = 0
for k in range(K):
# 计算样本𝑥𝑗与各均值向量𝜇𝑖(1≤𝑖≤𝑘)的距离:𝑑𝑗𝑖=||𝑥𝑗−𝜇𝑖||2
dist = dist_minkowski(X_train[j], mu[k])
# 根据距离最近的均值向量确定𝑥𝑗的簇标记:𝜆𝑗=𝑎𝑟𝑔𝑚𝑖𝑛𝑖∈{1,2,...,𝑘}𝑑𝑗𝑖
# 将样本𝑥𝑖划入相应的簇𝜆𝑖=𝜆𝑗∪{𝑥𝑗}
if min_dist == None or (dist < min_dist):
min_dist = dist
min_k = k
# 将样本𝑥𝑖划入相应的簇𝜆𝑖 =𝜆𝑗∪{𝑥𝑗}
C[min_k].append(X_train[j])
stable_k = 0
for k in range(K):
# 计算新均值向量𝜇′𝑖=1|𝑖|∑𝑥∈𝑖𝑥
mu_new = np.sum(np.array(C[k]), axis=0) / len(C[k])
if not all(mu_new == mu[k]):
mu[k] = mu_new
else:
stable_k += 1
loop += 1
if loop > max_iter or stable_k == K:
break
print("最后的簇中心向量 = ", mu)
colors = np.array(["red", "green", "black"])
print(C)
for i in range(len(C)):
c_i = C[i]
for c in c_i:
plt.scatter(c[0], c[1], c=colors[i])
plt.show()
输出
点击查看代码
迭代次数 = 0 簇中心向量 = [array([0.748, 0.232]), array([0.657, 0.198]), array([0.483, 0.312])]
迭代次数 = 1 簇中心向量 = [array([0.7424, 0.3776]), array([0.63771429, 0.15342857]), array([0.41394118, 0.28347059])]
迭代次数 = 2 簇中心向量 = [array([0.7144, 0.3948]), array([0.6515 , 0.16325]), array([0.4018125, 0.2813125])]
迭代次数 = 3 簇中心向量 = [array([0.684 , 0.40766667]), array([0.6515 , 0.16325]), array([0.39313333, 0.2686 ])]
迭代次数 = 4 簇中心向量 = [array([0.66128571, 0.40214286]), array([0.6515 , 0.16325]), array([0.38371429, 0.26142857])]
迭代次数 = 5 簇中心向量 = [array([0.638375, 0.4065 ]), array([0.6515 , 0.16325]), array([0.37646154, 0.24792308])]
迭代次数 = 6 簇中心向量 = [array([0.617 , 0.41233333]), array([0.6515 , 0.16325]), array([0.37066667, 0.23033333])]
迭代次数 = 7 簇中心向量 = [array([0.6026, 0.4087]), array([0.6515 , 0.16325]), array([0.36136364, 0.21709091])]
迭代次数 = 8 簇中心向量 = [array([0.59172727, 0.39990909]), array([0.6515 , 0.16325]), array([0.3492, 0.2076])]
最后的簇中心向量 = [array([0.59172727, 0.39990909]), array([0.6515 , 0.16325]), array([0.3492, 0.2076])]
[[array([0.774, 0.376]), array([0.608, 0.318]), array([0.714, 0.346]), array([0.483, 0.312]), array([0.478, 0.437]), array([0.525, 0.369]), array([0.751, 0.489]), array([0.532, 0.472]), array([0.473, 0.376]), array([0.725, 0.445]), array([0.446, 0.459])], [array([0.634, 0.264]), array([0.556, 0.215]), array([0.666, 0.091]), array([0.639, 0.161]), array([0.657, 0.198]), array([0.593, 0.042]), array([0.719, 0.103]), array([0.748, 0.232])], [array([0.403, 0.237]), array([0.481, 0.149]), array([0.437, 0.211]), array([0.243, 0.267]), array([0.245, 0.057]), array([0.343, 0.099]), array([0.36, 0.37]), array([0.359, 0.188]), array([0.339, 0.241]), array([0.282, 0.257])]]
效果展示
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· .NET10 - 预览版1新功能体验(一)