K-means
1.距离计算
给定样本
\(x_i=(x_{i1},x_{i2},...,x_{in}),x_j=(x_{j1},x_{j2},...,x_{jn})\)
连续属性的距离计算
闵可夫斯基距离
\(dist_{mk}(x_i,x_j)=(\sum\limits_{u=1}^{n}|x_{iu}-x_{ju}|^{p})^{\frac{1}{p}}\)
离散属性的距离计算
VDM距离
\(m_{u,a}表示在属性u上取值为a的样本数\)
\(m_{u,a,i}表示在第i个样本簇中在属性u上取值为a的样本数,k为样本簇数,则属性u上两个离散值a与b之间的VDM距离为\)
\(VDM_{p}(a,b)=\sum\limits_{i=1}^{k}|\frac{m_{u,a,i}}{m_{u,a}}-\frac{m_{u,b,i}}{m_{u,b}}|^p\)
混合合计
假设有\(n_c\)个连续属性,有\(n-n_c\)个离散属性,令连续属性排列在离散属性之前,则两个样本之间的距离有
\(MinkovDM_p(x_i,x_j)=(\sum\limits_{u=1}^{n_n}|x_{iu}-x_{ju}|^{p}+\sum\limits_{u=n_c+1}^{n}VDM_p(x_{iu},x_{ju}))^{\frac{1}{p}}\)
K均值
目标
给定样本集\(D=\{x_1,x_2,...,x_m\},\)"K均值"算法针对聚类所得簇划分\(\mathcal{C}=\{\mathcal{C}_1,\mathcal{C}_2,...,\mathcal{C}_k\}\)最小化平方误差
其中\(\mu_i=\frac{1}{|\mathcal{C}_i|}\sum_{x\in \mathcal{C}_i} x是簇\mathcal{C}_i\)的均值向量
\(|\mathcal{C}_i|\)这个符号即计算簇中元素的个数
算法
输入:样本集\(D={x_1,x_2,...,x_m}\),聚类簇数k
聚类簇数是个超参数
- 从\(D\)中速记选择k个样本作为初始均值向量\(\{\mu_1,\mu_2,...,\mu_k\}\)
- repeat
- 令\(C_i=\empty(1\le i\le k)\)
- for j=1,2,...,m do
-
- 计算样本\(x_j\)与各均值向量\(\mu_i(1\le i\le k)的距离:d_{ji}=||x_j-\mu_i||_2\)
-
- 根据距离最近的均值向量确定\(x_j\)的簇标记:\(\lambda_j=argmin_{i\in \{1,2,...,k\}}d_{ji}\)
-
- 将样本\(x_i\)划入相应的簇\(\mathcal{C}_{\lambda_i}=\mathcal{C}_{\lambda_j}\cup \{x_j\}\)
- end for
- for i =1,2,...,k do
-
- 计算新均值向量\(\mu_i'=\frac{1}{|\mathcal{C}_i|}\sum_{x\in \mathcal{C}_i}x\)
-
- if \(\mu_i'\ne \mu_i\) then
-
-
- 将当前均值向量\(\mu_i\)更新为\(\mu_i'\)
-
-
- else
-
-
- 保持当前均值向量不变
-
-
- end if
- end for
until 当前均值向量均未更新
python代码
西瓜书数据集
点击查看代码
0.697,0.460
0.774,0.376
0.634,0.264
0.608,0.318
0.556,0.215
0.403,0.237
0.481,0.149
0.437,0.211
0.666,0.091
0.243,0.267
0.245,0.057
0.343,0.099
0.639,0.161
0.657,0.198
0.360,0.370
0.593,0.042
0.719,0.103
0.359,0.188
0.339,0.241
0.282,0.257
0.748,0.232
0.714,0.346
0.483,0.312
0.478,0.437
0.525,0.369
0.751,0.489
0.532,0.472
0.473,0.376
0.725,0.445
0.446,0.459
代码实现
点击查看代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(98765)
def getIntRandom(max, n):
'''
随机采样一组正整数
:param max: 最大值,随机范围是0-最大值-1
:param n: 随机数的个数
:return:
'''
ret = set()
while len(ret) < n:
rdm = np.random.randint(0, max - 1)
ret.add(rdm)
return list(ret)
def dist_minkowski(x, mu, p=2):
'''
计算闵可夫斯基距离
:param x: 样本点
:param mu: 中心点
:param p:p=2就是欧式距离
:return:
'''
return np.sum((x - mu) ** p) ** (1 / p)
def gen_train_data():
# X_train = np.array(pd.read_excel("./datasets/kmeans数据.xls", header=0)) # 样本
X_train = np.array(pd.read_csv("./datasets/waterlemon.txt")) # 样本
return X_train
if __name__ == "__main__":
X_train = gen_train_data()
print(X_train)
max_iter = 10000 # 最大迭代次数
M = X_train.shape[1] # 维度
N = X_train.shape[0] # 样本量
K = 3 # 超参数,分类数
# 初始化簇中心 mu ,随机选择
int_rdm = getIntRandom(N, K)
mu = [X_train[i] for i in int_rdm]
loop = 0
while (True):
print("迭代次数 = ", loop, " 簇中心向量 = ", mu)
C = [[] for i in range(K)] # 簇 C_i # 可以尝试把C做成一个空矩阵,下面用矩阵运算比较方便
for j in range(N):
min_dist = None
min_k = 0
for k in range(K):
# 计算样本𝑥𝑗与各均值向量𝜇𝑖(1≤𝑖≤𝑘)的距离:𝑑𝑗𝑖=||𝑥𝑗−𝜇𝑖||2
dist = dist_minkowski(X_train[j], mu[k])
# 根据距离最近的均值向量确定𝑥𝑗的簇标记:𝜆𝑗=𝑎𝑟𝑔𝑚𝑖𝑛𝑖∈{1,2,...,𝑘}𝑑𝑗𝑖
# 将样本𝑥𝑖划入相应的簇𝜆𝑖=𝜆𝑗∪{𝑥𝑗}
if min_dist == None or (dist < min_dist):
min_dist = dist
min_k = k
# 将样本𝑥𝑖划入相应的簇𝜆𝑖 =𝜆𝑗∪{𝑥𝑗}
C[min_k].append(X_train[j])
stable_k = 0
for k in range(K):
# 计算新均值向量𝜇′𝑖=1|𝑖|∑𝑥∈𝑖𝑥
mu_new = np.sum(np.array(C[k]), axis=0) / len(C[k])
if not all(mu_new == mu[k]):
mu[k] = mu_new
else:
stable_k += 1
loop += 1
if loop > max_iter or stable_k == K:
break
print("最后的簇中心向量 = ", mu)
colors = np.array(["red", "green", "black"])
print(C)
for i in range(len(C)):
c_i = C[i]
for c in c_i:
plt.scatter(c[0], c[1], c=colors[i])
plt.show()
输出
点击查看代码
迭代次数 = 0 簇中心向量 = [array([0.748, 0.232]), array([0.657, 0.198]), array([0.483, 0.312])]
迭代次数 = 1 簇中心向量 = [array([0.7424, 0.3776]), array([0.63771429, 0.15342857]), array([0.41394118, 0.28347059])]
迭代次数 = 2 簇中心向量 = [array([0.7144, 0.3948]), array([0.6515 , 0.16325]), array([0.4018125, 0.2813125])]
迭代次数 = 3 簇中心向量 = [array([0.684 , 0.40766667]), array([0.6515 , 0.16325]), array([0.39313333, 0.2686 ])]
迭代次数 = 4 簇中心向量 = [array([0.66128571, 0.40214286]), array([0.6515 , 0.16325]), array([0.38371429, 0.26142857])]
迭代次数 = 5 簇中心向量 = [array([0.638375, 0.4065 ]), array([0.6515 , 0.16325]), array([0.37646154, 0.24792308])]
迭代次数 = 6 簇中心向量 = [array([0.617 , 0.41233333]), array([0.6515 , 0.16325]), array([0.37066667, 0.23033333])]
迭代次数 = 7 簇中心向量 = [array([0.6026, 0.4087]), array([0.6515 , 0.16325]), array([0.36136364, 0.21709091])]
迭代次数 = 8 簇中心向量 = [array([0.59172727, 0.39990909]), array([0.6515 , 0.16325]), array([0.3492, 0.2076])]
最后的簇中心向量 = [array([0.59172727, 0.39990909]), array([0.6515 , 0.16325]), array([0.3492, 0.2076])]
[[array([0.774, 0.376]), array([0.608, 0.318]), array([0.714, 0.346]), array([0.483, 0.312]), array([0.478, 0.437]), array([0.525, 0.369]), array([0.751, 0.489]), array([0.532, 0.472]), array([0.473, 0.376]), array([0.725, 0.445]), array([0.446, 0.459])], [array([0.634, 0.264]), array([0.556, 0.215]), array([0.666, 0.091]), array([0.639, 0.161]), array([0.657, 0.198]), array([0.593, 0.042]), array([0.719, 0.103]), array([0.748, 0.232])], [array([0.403, 0.237]), array([0.481, 0.149]), array([0.437, 0.211]), array([0.243, 0.267]), array([0.245, 0.057]), array([0.343, 0.099]), array([0.36, 0.37]), array([0.359, 0.188]), array([0.339, 0.241]), array([0.282, 0.257])]]
效果展示