聚类算法

聚类算法

​ 聚类算法是机器学习中涉及对数据进行分组的一种算法。在给定的数据集中,我们可以通过聚类算法将其分成一些不同的组。在理论上,相同的组的数据之间有相同的属性或者是特征,不同组数据之间的属性或者特征相差就会比较大。聚类算法是一种非监督学习算法,并且作为一种常用的数据分析算法在很多领域上得到应用。雷达信号处理中具有广泛的应用。雷达信号聚类可以帮助识别和分类不同类型的目标、检测异常或故障,并提取有用的信息。本文章用来总结一些常用的聚类算法。

K-means聚类

K-means的实现步骤

K-means算法是一种常见的聚类算法,其原理可以简单描述为以下几个步骤:

  1. 初始化:选择要分成的簇的数量k,并随机选择k个中心点作为初始簇中心。
  2. 分配数据点:对于每个数据点,计算其与各个簇中心点之间的距离,将数据点分配给最近的簇。
  3. 更新簇中心:对于每个簇,计算簇中所有数据点的平均值,更新该簇的中心点位置。
  4. 重复步骤2和步骤3,直到达到终止条件(例如,簇中心不再变化或达到最大迭代次数)。
  5. 输出结果:返回簇的分配情况和最终的簇中心。

​ K-means算法的目标是最小化数据点与其所属簇中心之间的总平方距离(也称为误差平方和)。算法通过迭代优化来找到最优的簇中心位置,使得每个数据点都被分配到与其最近的簇。

​ K-means算法的关键参数是要分成的簇的数量k。选择合适的k值对于聚类结果的准确性和可解释性非常重要。通常会使用启发式方法(如肘部法则、轮廓系数等)来帮助选择最佳的k值。

​ 需要注意的是,K-means算法对初始簇中心的选择敏感。不同的初始簇中心可能导致不同的聚类结果。为了获得更稳定和可靠的结果,可以多次运行算法,并从多个随机初始点开始进行聚类,并选择具有最小误差的结果。

K-means的优缺点

​ 根据以上的实现原理,我们不难分析K-means算法具有以下优点和缺点:

​ 优点:

  1. 简单而有效:K-means算法是一种简单而直观的聚类算法,易于实现和理解。
  2. 可扩展性:K-means算法对大规模数据集的处理能力较强,可以有效地处理大量数据。
  3. 高效性:K-means算法的计算速度通常比其他聚类算法快,尤其是在高维数据集上。
  4. 聚类结果可解释性强:K-means算法的聚类结果较为直观,每个数据点都被分配到最近的簇中心,并可以根据簇中心来解释和理解聚类结果。

​ 缺点:

  1. 对初始中心点敏感:K-means算法对初始中心点的选择非常敏感,不同的初始中心可能导致不同的聚类结果。因此,需要进行多次运行以获取稳定的结果。
  2. 需要预先确定簇的数量:K-means算法需要预先指定要分成的簇的数量k,这对于某些情况下可能并不容易确定,选择不合适的k值可能会导致不佳的聚类结果。
  3. 不适用于非凸形状的聚类:K-means算法假设簇为凸形状,即每个簇可以用一个中心点来表示。因此,对于非凸形状的聚类问题,K-means算法可能会产生不准确的结果。
  4. 噪声和异常值敏感:K-means算法对噪声和异常值较为敏感,这些离群点可能会显著影响簇的分配结果。

​ 综合考虑,K-means算法是一种简单而有效的聚类算法,适用于大规模数据集和对可解释性要求较高的场景。但是在应用时需要注意初始中心点的选择和确定合适的簇数量,以及对噪声和异常值的处理。对于非凸形状的聚类问题,可能需要考虑其他更适合的聚类算法。

代码实现

matlab代码实现

% 生成随机数据
rng(1); % 设置随机种子,以便结果可复现
data = randn(100, 2); % 生成100个二维随机数据点

% 设置参数
k = 3; % 簇的数量
maxIterations = 100; % 最大迭代次数

% 初始化簇中心
centerIndices = randperm(size(data, 1), k); % 随机选择k个数据作为初始簇中心
centers = data(centerIndices, :);

% 迭代更新簇中心和分配数据点
for iter = 1:maxIterations
    % 计算每个数据点与各个簇中心的距离
    distances = pdist2(data, centers);
    
    % 分配数据点到最近的簇
    [~, clusterIndices] = min(distances, [], 2);
    
    % 更新簇中心
    for i = 1:k
        centers(i, :) = mean(data(clusterIndices == i, :));
    end
end

% 绘制聚类结果
colors = {'r', 'g', 'b'}; % 簇的颜色
figure;
hold on;
for i = 1:k
    clusterData = data(clusterIndices == i, :);
    scatter(clusterData(:, 1), clusterData(:, 2), colors{i});
end
scatter(centers(:, 1), centers(:, 2), 'kx', 'LineWidth', 2);
title('K-means Clustering');

​ k-means效果

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <stdbool.h>
#include <float.h>

#define MAX_ITERATIONS                (100U)
#define INFINITY                      (FLT_MAX)

/* return the Euclidean distance between two points */
float point_distance(float x1, float y1, float x2, float y2)
{
    return sqrt(pow(x2 - x1, 2) + pow(y2 - y1, 2));
}

void kmeans(float data[][2], const int data_size, const int clusters_num, float centroids[][2])
{
    int nearest_cluster = 0;
    float* cluster_counts = (float*)malloc(clusters_num * sizeof(float));
    float* cluster_sums = (float*)malloc(2 * clusters_num * sizeof(float));
    float* new_centroids = (float*)malloc(2 * clusters_num *sizeof(float));

    int iterations = 0;

    /* Initialize the clusters for clustering */
    for (int i = 0; i < clusters_num; i++)
    {
        *(cluster_counts+i) = 0;
        *(cluster_sums+(2*i)) = 0.0f;
        *(cluster_sums+(2*i)+1) = 0.0f;
    }
    /* Iteratively update the cluster centers until convergence or the maximum number of iterations is reached */
    while(iterations < MAX_ITERATIONS)
    {
        /* 1 Clear the count and sum of new cluster centers */
        for (int i = 0; i < clusters_num; i++)
        {
            *(new_centroids + (2 * i)) = 0.0f;
            *(new_centroids + (2 * i) + 1) = 0.0f;
            *(cluster_counts + i) = 0;
            *(cluster_sums + (2 * i)) = 0.0f;
            *(cluster_sums + (2 * i) + 1) = 0.0f;
        }

        /* 2 Assign each point to the nearest cluster */
        for (int i = 0; i < data_size; i++)
        {
            float min_dist = INFINITY;
            nearest_cluster = -1;
            for (int j = 0; j < clusters_num; j++)
            {
                float dist = point_distance(data[i][0], data[i][1], centroids[j][0], centroids[j][1]);
                if(dist < min_dist)
                {
                    min_dist = dist;
                    nearest_cluster = j;
                }
            }
            /* 3 Update the count and sum of the cluster to which this point belongs. */
            cluster_counts[nearest_cluster]++;
            *(cluster_sums + (2 * nearest_cluster)) += data[i][0];
            *(cluster_sums + (2 * nearest_cluster) + 1) += data[i][1];
        }

        /* 4 Calculate new cluster centers. */
        for (int k = 0; k < clusters_num; k++) 
        {
            *(new_centroids + (2 * k)) = ((*(cluster_sums+2*k))/cluster_counts[k]);
            *(new_centroids + (2 * k) + 1) = ((*(cluster_sums + 2*k+1)) / cluster_counts[k]);
        }

        /* 5 Check if the cluster centers have changed, and if not, stop iterating. */
        bool centroids_changed = 0;
        for (int i = 0; i < clusters_num; i++) 
        {
            if (point_distance(centroids[i][0], centroids[i][1], *(new_centroids + (2 * i)), *(new_centroids + (2 * i) + 1)) > 0.001)
            {
                centroids_changed = true;
                break;
            }
        }
        if (!centroids_changed) 
        {
            break;
        }
        /* 6 Update cluster centers */
        for (int i = 0; i < clusters_num; i++) 
        {
            centroids[i][0] = *(new_centroids + (2 * i));
            centroids[i][1] = *(new_centroids + (2 * i) + 1);
        }

        iterations++;
    }
    for (int i = 0; i < clusters_num; i++) 
    {
        printf("i %d,centroids[i][0]:%f, centroids[i][1]:%f\r\n", i, centroids[i][0], centroids[i][1]);
    }


float data[][2] =
{
    -0.649013765191241,0.365489884844483,
    1.18116604196553,-1.09706118485406,
    -0.758453297283692,1.93021303957797,
    -1.10961303850152,0.622936146753982,
    -0.845551240007797,0.657284334157919,
    -0.572664866457950,-1.46338342995195,
    -0.558680764473972,0.853935439880633,
    0.178380225849766,0.580489017953016,
    -0.196861446475943,-0.918600977803510,
    0.586442621667069,0.794865099567556,
    -0.851886969622469,0.517535108228584,
    0.800320709801823,0.494614479368152,
    -1.50940472473439,0.663929749577619,
    0.875874147834533,-0.710171982852718,
    -0.242789536333340,-1.30683763858013,
    0.166813439453503,-0.741588798931374,
    -1.96541870928278,-1.46765924999935,
    -1.27007139263854,-0.391674702884971,
    1.17517126546302,0.841658856202464,
    2.02916018474976,0.0827838447500025,
    -0.275157240675694,0.314671414095507,
    0.603658445825815,0.789805048123364,
    1.78125189324250,-0.801223905939208,
    1.77365832632615,-0.325654259400935,
    -1.86512257453063,0.284676319250753,
    -1.05110705924059,1.30961815303393,
    -0.417382047996795,0.160373337872621,
    1.40216228633781,-2.11818831683966,
    -1.36774699097611,0.707080597764518,
    -0.292534999151874,-1.04341356999711,
    1.27084843418894,1.06820660928836,
    0.0660093412882059,-0.317233553000261,
    0.451290213630776,1.47967667627648,
    -0.322209718011896,0.699087988758093,
    0.788409216227425,0.159099278102292,
    0.928736046813314,-0.945480508446840,
    -0.490790376269763,-0.793006508081058,
    1.79720058425494,-2.04923899287211,
    0.590696551205452,-2.35883505694189,
    -0.635785737847226,-1.65926925495582,
    0.603346612845761,-0.958123762757588,
    -0.535247967775900,0.225729866287135,
    -0.155080385492789,0.217665351270991,
    0.612122370772160,-0.823239284502371,
    -1.04434349451734,-1.01276803434617,
    -0.345631908307050,1.21525791551081,
    -1.17140482049761,0.156275375341672,
    -0.685586780437283,-0.400257312164849,
    0.926216394168962,-0.441779076742656,
    -1.48167521167231,0.448102274787604,
    -0.558057808685045,-1.66459080006458,
    -0.0284531115706568,0.214890241522152,
    -1.47629235201010,0.549563133368974,
    0.258899957160403,1.39233785356506,
    -2.01869095243834,-0.619227629415020,
    0.199740262298379,-0.0126012833273179,
    0.425864319131210,0.773612390473314,
    -1.27004345059705,1.62921206030486,
    -0.485218835743043,-1.40997503421309,
    0.594307616829848,-1.74728260275247,
    -0.276464906639256,-0.472246143478651,
    -1.85758288592737,-0.0600882414964135,
    0.0407308117494288,0.438878570591521,
    0.282970177161990,0.201222206924178,
    0.0635612193024994,-0.583298376637002,
    0.433430065111595,0.764796690674390,
    0.422860364487685,0.140769815626260,
    1.29952829655200,-0.372936894896390,
    -1.04979323447507,0.105466931719111,
    -1.78641172211092,1.27062430143441,
    0.816043081031918,0.499129927163972,
    -0.328208543142512,-0.397024986016533,
    -1.21456561358767,-1.78996819380666,
    1.11183287253465,-0.266894331854867,
    -0.507496954829846,0.178431074753012,
    0.898730486034072,-0.434192174480561,
    0.377215659958544,0.464513248320177,
    1.45239164558790,-1.12144527588722,
    0.446945073178942,-0.359075138759723,
    0.645824788453030,0.532266690356129,
    -0.623677409296163,-1.64350780294275,
    -0.595236431548712,0.466899224814047,
    1.61132368718055,0.112107258838170,
    -0.348998045314167,1.49654440857240,
    0.164167484938754,-0.586502255515034,
    -1.63657708517891,-1.71893202713396,
    0.581365555343623,0.741040148219291,
    -0.128905996910632,1.08769551523333,
    0.432858634222399,0.756670133381815,
    -0.245109040039237,1.62969395971859,
    -1.08543038934632,-1.37499337757164,
    1.68080151955536,-1.05201055264159,
    0.176411940863882,0.477517893688704,
    -2.07143962693628,1.22217693430612,
    0.211089334851037,2.37073224069742,
    -0.582847822547194,0.114585880363316,
    0.0181688430923922,0.279069293453606,
    1.49477799287395,0.752079823603291,
    -0.424796733441211,-0.260256948894717,
    1.68624315536028,-0.0259932042401371
};

float centroids[3][2] =
{
    -0.535247967775900,0.225729866287135,
    -0.328208543142512,-0.397024986016533,
    -0.649013765191241,0.365489884844483,
};

int main(void)
{
    kmeans(data, sizeof(data)/sizeof(data[0]), 3, centroids);
    return 0;
}


C语言算法和MATLAB对比后,运行结果一致。仿真实现完成

posted @ 2023-07-29 23:14  Kroner  阅读(161)  评论(0编辑  收藏  举报