K-means

1.距离计算

给定样本
xi=(xi1,xi2,...,xin),xj=(xj1,xj2,...,xjn)

连续属性的距离计算

闵可夫斯基距离
distmk(xi,xj)=(u=1n|xiuxju|p)1p

离散属性的距离计算

VDM距离
mu,aua
mu,a,iiuakuabVDM
VDMp(a,b)=i=1k|mu,a,imu,amu,b,imu,b|p

混合合计

假设有nc个连续属性,有nnc个离散属性,令连续属性排列在离散属性之前,则两个样本之间的距离有
MinkovDMp(xi,xj)=(u=1nn|xiuxju|p+u=nc+1nVDMp(xiu,xju))1p

K均值

目标

给定样本集D={x1,x2,...,xm},"K均值"算法针对聚类所得簇划分C={C1,C2,...,Ck}最小化平方误差

E=i=1kxCik||xμi||22

其中μi=1|Ci|xCixCi的均值向量


|Ci|这个符号即计算簇中元素的个数

算法

输入:样本集D=x1,x2,...,xm,聚类簇数k


聚类簇数是个超参数

  1. D中速记选择k个样本作为初始均值向量{μ1,μ2,...,μk}
  2. repeat
    • Ci=(1ik)
    • for j=1,2,...,m do
      • 计算样本xj与各均值向量μi(1ik):dji=||xjμi||2
      • 根据距离最近的均值向量确定xj的簇标记:λj=argmini{1,2,...,k}dji
      • 将样本xi划入相应的簇Cλi=Cλj{xj}
    • end for
    • for i =1,2,...,k do
      • 计算新均值向量μi=1|Ci|xCix
      • if μiμi then
        • 将当前均值向量μi更新为μi
      • else
        • 保持当前均值向量不变
      • end if
    • end for
      until 当前均值向量均未更新

python代码

西瓜书数据集

点击查看代码

0.697,0.460
0.774,0.376
0.634,0.264
0.608,0.318
0.556,0.215
0.403,0.237
0.481,0.149
0.437,0.211
0.666,0.091
0.243,0.267
0.245,0.057
0.343,0.099
0.639,0.161
0.657,0.198
0.360,0.370
0.593,0.042
0.719,0.103
0.359,0.188
0.339,0.241
0.282,0.257
0.748,0.232
0.714,0.346
0.483,0.312
0.478,0.437
0.525,0.369
0.751,0.489
0.532,0.472
0.473,0.376
0.725,0.445
0.446,0.459

代码实现

点击查看代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(98765)


def getIntRandom(max, n):
    '''
    随机采样一组正整数
    :param max:  最大值,随机范围是0-最大值-1
    :param n:  随机数的个数
    :return:
    '''
    ret = set()
    while len(ret) < n:
        rdm = np.random.randint(0, max - 1)
        ret.add(rdm)
    return list(ret)


def dist_minkowski(x, mu, p=2):
    '''
    计算闵可夫斯基距离
    :param x: 样本点
    :param mu: 中心点
    :param p:p=2就是欧式距离
    :return:
    '''
    return np.sum((x - mu) ** p) ** (1 / p)


def gen_train_data():
    # X_train = np.array(pd.read_excel("./datasets/kmeans数据.xls", header=0))  # 样本
    X_train = np.array(pd.read_csv("./datasets/waterlemon.txt"))  # 样本
    return X_train


if __name__ == "__main__":
    X_train = gen_train_data()
    print(X_train)

    max_iter = 10000  # 最大迭代次数

    M = X_train.shape[1]  # 维度
    N = X_train.shape[0]  # 样本量
    K = 3  # 超参数,分类数

    # 初始化簇中心 mu ,随机选择
    int_rdm = getIntRandom(N, K)
    mu = [X_train[i] for i in int_rdm]

    loop = 0
    while (True):
        print("迭代次数 = ", loop, " 簇中心向量 = ", mu)
        C = [[] for i in range(K)]  # 簇 C_i  # 可以尝试把C做成一个空矩阵,下面用矩阵运算比较方便
        for j in range(N):
            min_dist = None
            min_k = 0
            for k in range(K):
                # 计算样本𝑥𝑗与各均值向量𝜇𝑖(1≤𝑖≤𝑘)的距离:𝑑𝑗𝑖=||𝑥𝑗−𝜇𝑖||2
                dist = dist_minkowski(X_train[j], mu[k])
                # 根据距离最近的均值向量确定𝑥𝑗的簇标记:𝜆𝑗=𝑎𝑟𝑔𝑚𝑖𝑛𝑖∈{1,2,...,𝑘}𝑑𝑗𝑖
                # 将样本𝑥𝑖划入相应的簇𝜆𝑖=𝜆𝑗∪{𝑥𝑗}
                if min_dist == None or (dist < min_dist):
                    min_dist = dist
                    min_k = k
                    # 将样本𝑥𝑖划入相应的簇𝜆𝑖 =𝜆𝑗∪{𝑥𝑗}
            C[min_k].append(X_train[j])

        stable_k = 0
        for k in range(K):
            # 计算新均值向量𝜇′𝑖=1|𝑖|∑𝑥∈𝑖𝑥
            mu_new = np.sum(np.array(C[k]), axis=0) / len(C[k])
            if not all(mu_new == mu[k]):
                mu[k] = mu_new
            else:
                stable_k += 1

        loop += 1
        if loop > max_iter or stable_k == K:
            break

    print("最后的簇中心向量 = ", mu)

    colors = np.array(["red", "green", "black"])
    print(C)
    for i in range(len(C)):
        c_i = C[i]
        for c in c_i:
            plt.scatter(c[0], c[1], c=colors[i])
    plt.show()


输出

点击查看代码

迭代次数 =  0  簇中心向量 =  [array([0.748, 0.232]), array([0.657, 0.198]), array([0.483, 0.312])]
迭代次数 =  1  簇中心向量 =  [array([0.7424, 0.3776]), array([0.63771429, 0.15342857]), array([0.41394118, 0.28347059])]
迭代次数 =  2  簇中心向量 =  [array([0.7144, 0.3948]), array([0.6515 , 0.16325]), array([0.4018125, 0.2813125])]
迭代次数 =  3  簇中心向量 =  [array([0.684     , 0.40766667]), array([0.6515 , 0.16325]), array([0.39313333, 0.2686    ])]
迭代次数 =  4  簇中心向量 =  [array([0.66128571, 0.40214286]), array([0.6515 , 0.16325]), array([0.38371429, 0.26142857])]
迭代次数 =  5  簇中心向量 =  [array([0.638375, 0.4065  ]), array([0.6515 , 0.16325]), array([0.37646154, 0.24792308])]
迭代次数 =  6  簇中心向量 =  [array([0.617     , 0.41233333]), array([0.6515 , 0.16325]), array([0.37066667, 0.23033333])]
迭代次数 =  7  簇中心向量 =  [array([0.6026, 0.4087]), array([0.6515 , 0.16325]), array([0.36136364, 0.21709091])]
迭代次数 =  8  簇中心向量 =  [array([0.59172727, 0.39990909]), array([0.6515 , 0.16325]), array([0.3492, 0.2076])]
最后的簇中心向量 =  [array([0.59172727, 0.39990909]), array([0.6515 , 0.16325]), array([0.3492, 0.2076])]
[[array([0.774, 0.376]), array([0.608, 0.318]), array([0.714, 0.346]), array([0.483, 0.312]), array([0.478, 0.437]), array([0.525, 0.369]), array([0.751, 0.489]), array([0.532, 0.472]), array([0.473, 0.376]), array([0.725, 0.445]), array([0.446, 0.459])], [array([0.634, 0.264]), array([0.556, 0.215]), array([0.666, 0.091]), array([0.639, 0.161]), array([0.657, 0.198]), array([0.593, 0.042]), array([0.719, 0.103]), array([0.748, 0.232])], [array([0.403, 0.237]), array([0.481, 0.149]), array([0.437, 0.211]), array([0.243, 0.267]), array([0.245, 0.057]), array([0.343, 0.099]), array([0.36, 0.37]), array([0.359, 0.188]), array([0.339, 0.241]), array([0.282, 0.257])]]


效果展示

posted @   筷点雪糕侠  阅读(44)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· .NET10 - 预览版1新功能体验(一)
点击右上角即可分享
微信分享提示