K-means

1.距离计算

给定样本
\(x_i=(x_{i1},x_{i2},...,x_{in}),x_j=(x_{j1},x_{j2},...,x_{jn})\)

连续属性的距离计算

闵可夫斯基距离
\(dist_{mk}(x_i,x_j)=(\sum\limits_{u=1}^{n}|x_{iu}-x_{ju}|^{p})^{\frac{1}{p}}\)

离散属性的距离计算

VDM距离
\(m_{u,a}表示在属性u上取值为a的样本数\)
\(m_{u,a,i}表示在第i个样本簇中在属性u上取值为a的样本数,k为样本簇数,则属性u上两个离散值a与b之间的VDM距离为\)
\(VDM_{p}(a,b)=\sum\limits_{i=1}^{k}|\frac{m_{u,a,i}}{m_{u,a}}-\frac{m_{u,b,i}}{m_{u,b}}|^p\)

混合合计

假设有\(n_c\)个连续属性,有\(n-n_c\)个离散属性,令连续属性排列在离散属性之前,则两个样本之间的距离有
\(MinkovDM_p(x_i,x_j)=(\sum\limits_{u=1}^{n_n}|x_{iu}-x_{ju}|^{p}+\sum\limits_{u=n_c+1}^{n}VDM_p(x_{iu},x_{ju}))^{\frac{1}{p}}\)

K均值

目标

给定样本集\(D=\{x_1,x_2,...,x_m\},\)"K均值"算法针对聚类所得簇划分\(\mathcal{C}=\{\mathcal{C}_1,\mathcal{C}_2,...,\mathcal{C}_k\}\)最小化平方误差

\[E=\sum\limits_{i=1}^{k}\sum\limits_{x\in \mathcal{C}_i}^{k}||x-\mu_i||_2^2 \]

其中\(\mu_i=\frac{1}{|\mathcal{C}_i|}\sum_{x\in \mathcal{C}_i} x是簇\mathcal{C}_i\)的均值向量


\(|\mathcal{C}_i|\)这个符号即计算簇中元素的个数

算法

输入:样本集\(D={x_1,x_2,...,x_m}\),聚类簇数k


聚类簇数是个超参数

  1. \(D\)中速记选择k个样本作为初始均值向量\(\{\mu_1,\mu_2,...,\mu_k\}\)
  2. repeat
    • \(C_i=\empty(1\le i\le k)\)
    • for j=1,2,...,m do
      • 计算样本\(x_j\)与各均值向量\(\mu_i(1\le i\le k)的距离:d_{ji}=||x_j-\mu_i||_2\)
      • 根据距离最近的均值向量确定\(x_j\)的簇标记:\(\lambda_j=argmin_{i\in \{1,2,...,k\}}d_{ji}\)
      • 将样本\(x_i\)划入相应的簇\(\mathcal{C}_{\lambda_i}=\mathcal{C}_{\lambda_j}\cup \{x_j\}\)
    • end for
    • for i =1,2,...,k do
      • 计算新均值向量\(\mu_i'=\frac{1}{|\mathcal{C}_i|}\sum_{x\in \mathcal{C}_i}x\)
      • if \(\mu_i'\ne \mu_i\) then
        • 将当前均值向量\(\mu_i\)更新为\(\mu_i'\)
      • else
        • 保持当前均值向量不变
      • end if
    • end for
      until 当前均值向量均未更新

python代码

西瓜书数据集

点击查看代码

0.697,0.460
0.774,0.376
0.634,0.264
0.608,0.318
0.556,0.215
0.403,0.237
0.481,0.149
0.437,0.211
0.666,0.091
0.243,0.267
0.245,0.057
0.343,0.099
0.639,0.161
0.657,0.198
0.360,0.370
0.593,0.042
0.719,0.103
0.359,0.188
0.339,0.241
0.282,0.257
0.748,0.232
0.714,0.346
0.483,0.312
0.478,0.437
0.525,0.369
0.751,0.489
0.532,0.472
0.473,0.376
0.725,0.445
0.446,0.459

代码实现

点击查看代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(98765)


def getIntRandom(max, n):
    '''
    随机采样一组正整数
    :param max:  最大值,随机范围是0-最大值-1
    :param n:  随机数的个数
    :return:
    '''
    ret = set()
    while len(ret) < n:
        rdm = np.random.randint(0, max - 1)
        ret.add(rdm)
    return list(ret)


def dist_minkowski(x, mu, p=2):
    '''
    计算闵可夫斯基距离
    :param x: 样本点
    :param mu: 中心点
    :param p:p=2就是欧式距离
    :return:
    '''
    return np.sum((x - mu) ** p) ** (1 / p)


def gen_train_data():
    # X_train = np.array(pd.read_excel("./datasets/kmeans数据.xls", header=0))  # 样本
    X_train = np.array(pd.read_csv("./datasets/waterlemon.txt"))  # 样本
    return X_train


if __name__ == "__main__":
    X_train = gen_train_data()
    print(X_train)

    max_iter = 10000  # 最大迭代次数

    M = X_train.shape[1]  # 维度
    N = X_train.shape[0]  # 样本量
    K = 3  # 超参数,分类数

    # 初始化簇中心 mu ,随机选择
    int_rdm = getIntRandom(N, K)
    mu = [X_train[i] for i in int_rdm]

    loop = 0
    while (True):
        print("迭代次数 = ", loop, " 簇中心向量 = ", mu)
        C = [[] for i in range(K)]  # 簇 C_i  # 可以尝试把C做成一个空矩阵,下面用矩阵运算比较方便
        for j in range(N):
            min_dist = None
            min_k = 0
            for k in range(K):
                # 计算样本𝑥𝑗与各均值向量𝜇𝑖(1≤𝑖≤𝑘)的距离:𝑑𝑗𝑖=||𝑥𝑗−𝜇𝑖||2
                dist = dist_minkowski(X_train[j], mu[k])
                # 根据距离最近的均值向量确定𝑥𝑗的簇标记:𝜆𝑗=𝑎𝑟𝑔𝑚𝑖𝑛𝑖∈{1,2,...,𝑘}𝑑𝑗𝑖
                # 将样本𝑥𝑖划入相应的簇𝜆𝑖=𝜆𝑗∪{𝑥𝑗}
                if min_dist == None or (dist < min_dist):
                    min_dist = dist
                    min_k = k
                    # 将样本𝑥𝑖划入相应的簇𝜆𝑖 =𝜆𝑗∪{𝑥𝑗}
            C[min_k].append(X_train[j])

        stable_k = 0
        for k in range(K):
            # 计算新均值向量𝜇′𝑖=1|𝑖|∑𝑥∈𝑖𝑥
            mu_new = np.sum(np.array(C[k]), axis=0) / len(C[k])
            if not all(mu_new == mu[k]):
                mu[k] = mu_new
            else:
                stable_k += 1

        loop += 1
        if loop > max_iter or stable_k == K:
            break

    print("最后的簇中心向量 = ", mu)

    colors = np.array(["red", "green", "black"])
    print(C)
    for i in range(len(C)):
        c_i = C[i]
        for c in c_i:
            plt.scatter(c[0], c[1], c=colors[i])
    plt.show()


输出

点击查看代码

迭代次数 =  0  簇中心向量 =  [array([0.748, 0.232]), array([0.657, 0.198]), array([0.483, 0.312])]
迭代次数 =  1  簇中心向量 =  [array([0.7424, 0.3776]), array([0.63771429, 0.15342857]), array([0.41394118, 0.28347059])]
迭代次数 =  2  簇中心向量 =  [array([0.7144, 0.3948]), array([0.6515 , 0.16325]), array([0.4018125, 0.2813125])]
迭代次数 =  3  簇中心向量 =  [array([0.684     , 0.40766667]), array([0.6515 , 0.16325]), array([0.39313333, 0.2686    ])]
迭代次数 =  4  簇中心向量 =  [array([0.66128571, 0.40214286]), array([0.6515 , 0.16325]), array([0.38371429, 0.26142857])]
迭代次数 =  5  簇中心向量 =  [array([0.638375, 0.4065  ]), array([0.6515 , 0.16325]), array([0.37646154, 0.24792308])]
迭代次数 =  6  簇中心向量 =  [array([0.617     , 0.41233333]), array([0.6515 , 0.16325]), array([0.37066667, 0.23033333])]
迭代次数 =  7  簇中心向量 =  [array([0.6026, 0.4087]), array([0.6515 , 0.16325]), array([0.36136364, 0.21709091])]
迭代次数 =  8  簇中心向量 =  [array([0.59172727, 0.39990909]), array([0.6515 , 0.16325]), array([0.3492, 0.2076])]
最后的簇中心向量 =  [array([0.59172727, 0.39990909]), array([0.6515 , 0.16325]), array([0.3492, 0.2076])]
[[array([0.774, 0.376]), array([0.608, 0.318]), array([0.714, 0.346]), array([0.483, 0.312]), array([0.478, 0.437]), array([0.525, 0.369]), array([0.751, 0.489]), array([0.532, 0.472]), array([0.473, 0.376]), array([0.725, 0.445]), array([0.446, 0.459])], [array([0.634, 0.264]), array([0.556, 0.215]), array([0.666, 0.091]), array([0.639, 0.161]), array([0.657, 0.198]), array([0.593, 0.042]), array([0.719, 0.103]), array([0.748, 0.232])], [array([0.403, 0.237]), array([0.481, 0.149]), array([0.437, 0.211]), array([0.243, 0.267]), array([0.245, 0.057]), array([0.343, 0.099]), array([0.36, 0.37]), array([0.359, 0.188]), array([0.339, 0.241]), array([0.282, 0.257])]]


效果展示

posted @ 2022-04-06 21:59  筷点雪糕侠  阅读(35)  评论(0编辑  收藏  举报