k-means算法 - 数据挖掘算法(5)

(2017-05-02 银河统计)

k-means算法,也被称为k-平均或k-均值,是数据挖掘技术中一种广泛使用的聚类算法。 它是将各个聚类子集内的所有数据样本的均值作为该聚类的代表点,算法的主要思想是通过迭代过程把数据集划分为不同的类别,使得评价聚类性能的准则函数达到最优,从而使生成的每个聚类内紧凑,类间独立。

一、计算步骤

设有n个m维向量\((X_{k1},X_{k2},\dots,X_{km}), k=1,2,\dots,n\)

1、在n个样本中随机选k个样本为簇心或类;
2、选定某种距离(如欧氏距离)作为样本间的相似性度量,计算各样本和k个簇心之间的距离,将和簇心距离最小的样本和对应簇心归为一类;
3、根据误差准则,计算类(组内)方差(各簇中心点到其它聚类点的方差)和总方差;
4、计算k个类中样本重心,从而生成新的簇心或类。

重复上面的过程,直至簇心不变。

二、算法举例

简单样本数据如下表:

SXY
102
200
31.50
450
552

试用k-means算法进行聚类分析(\(k=2\))。

解、

1、选择\(S_1(0,2)\)\(S_2(0,0)\)为初始的簇中心,即\(M_1=S_1(0,2)\)\(M_2=S_2(0,0)\)

2、计算各样本和2个簇心之间的欧氏距离;

\(S_3\)

\[D(S_3,M_1)=\sqrt{(1.5-0)^2+(0-2)^2}=2.5,\hspace{0.5cm}D(S_3,M_2)=\sqrt{(1.5-0)^2+(0-0)^2}=1.5 \]

\(D(S_3,M_2)<D(S_3,M_1)\),故将\(S_3\)分配给第2个簇心\(C_2\)

\(S_4\)

\[D(S_4,M_1)=\sqrt{(5-0)^2+(0-2)^2}=5.385,\hspace{0.5cm}D(S_4,M_2)=\sqrt{(5-0)^2+(0-0)^2}=5 \]

\(D(S_4,M_2)<D(S_4,M_1)\),故将\(S_4\)分配给第2个簇心\(C_2\)

\(S_5\)

\[D(S_5,M_1)=\sqrt{(5-0)^2+(2-2)^2}=5,\hspace{0.5cm}D(S_5,M_2)=\sqrt{(5-0)^2+(2-0)^2}=5.385 \]

\(D(S_2,M_2)>D(S_5,M_1)\),故将\(S_5\)分配给第1个簇心\(C_1\)

得到新簇\(C_1{S_1,S_5}\)\(C_2{S_2,S_3,S_4}\)

3、计算类(组内)方差(各簇中心点到其它聚类点的方差)和总方差;

\[E_1=[(5-0)^2+(2-2)^2]=25\hspace{0.5cm}E_2=[(1.5-0)^2+(0-0)^2]+[(5-0)^2+(0-0)^2]=27.25 \]

总体平均方差是:\(E=E_1+E2=25+27.25=52.25\)

4、计算2个类中样本重心,从而生成新的簇心或类

\[M_1=(\frac{0+5}{2},\frac{2+2}{2})=(2.5,2),\hspace{0.5cm}M_2=(\frac{0+1.5+5}{3},\frac{0+0+0}{3})=(2.17,0) \]

\(M_1=(2.5,2)\)\(M_2=(2.17,0)\)为新的簇心,重复II、III计算步骤,得到新簇\(C_1{S_1,S_5}\)\(C_2{S_2,S_3,S_4}\)

2个类中样本重心仍为,\(M_1=(2.5,2)\)\(M_2=(2.17,0)\),类(组内)方差为,

\[E_1=[(0-2.5)^2+(2-2)^2]+[(5-2.5)^2+(2-2)^2]=12.5 \]

\[E_2=[(0-2.17)^2+(0-0)^2]+[(1.5-2.17)^2+(0-0)^2]+[(5-2.17)^2+(0-0)^2]=13.17 \]

总体平均误差是:\(E=E_1+E2=12.5+13.17=25.67\)

第一次迭代后,总体平均误差值由52.25降到25.67,显著减小。由于在两次迭代中,簇中心不变,所以停止迭代过程,算法停止。

三、样例代码

样例中采用鸢尾花数据,

鸢尾花[iris]数据(R语言经典聚类、分类案例数据)

IDSepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa
314.83.11.60.2setosa
325.43.41.50.4setosa
335.24.11.50.1setosa
345.54.21.40.2setosa
354.93.11.50.2setosa
365.03.21.20.2setosa
375.53.51.30.2setosa
384.93.61.40.1setosa
394.43.01.30.2setosa
405.13.41.50.2setosa
415.03.51.30.3setosa
424.52.31.30.3setosa
434.43.21.30.2setosa
445.03.51.60.6setosa
455.13.81.90.4setosa
464.83.01.40.3setosa
475.13.81.60.2setosa
484.63.21.40.2setosa
495.33.71.50.2setosa
505.03.31.40.2setosa
517.03.24.71.4versicolor
526.43.24.51.5versicolor
536.93.14.91.5versicolor
545.52.34.01.3versicolor
556.52.84.61.5versicolor
565.72.84.51.3versicolor
576.33.34.71.6versicolor
584.92.43.31.0versicolor
596.62.94.61.3versicolor
605.22.73.91.4versicolor
615.02.03.51.0versicolor
625.93.04.21.5versicolor
636.02.24.01.0versicolor
646.12.94.71.4versicolor
655.62.93.61.3versicolor
666.73.14.41.4versicolor
675.63.04.51.5versicolor
685.82.74.11.0versicolor
696.22.24.51.5versicolor
705.62.53.91.1versicolor
715.93.24.81.8versicolor
726.12.84.01.3versicolor
736.32.54.91.5versicolor
746.12.84.71.2versicolor
756.42.94.31.3versicolor
766.63.04.41.4versicolor
776.82.84.81.4versicolor
786.73.05.01.7versicolor
796.02.94.51.5versicolor
805.72.63.51.0versicolor
815.52.43.81.1versicolor
825.52.43.71.0versicolor
835.82.73.91.2versicolor
846.02.75.11.6versicolor
855.43.04.51.5versicolor
866.03.44.51.6versicolor
876.73.14.71.5versicolor
886.32.34.41.3versicolor
895.63.04.11.3versicolor
905.52.54.01.3versicolor
915.52.64.41.2versicolor
926.13.04.61.4versicolor
935.82.64.01.2versicolor
945.02.33.31.0versicolor
955.62.74.21.3versicolor
965.73.04.21.2versicolor
975.72.94.21.3versicolor
986.22.94.31.3versicolor
995.12.53.01.1versicolor
1005.72.84.11.3versicolor
1016.33.36.02.5virginica
1025.82.75.11.9virginica
1037.13.05.92.1virginica
1046.32.95.61.8virginica
1056.53.05.82.2virginica
1067.63.06.62.1virginica
1074.92.54.51.7virginica
1087.32.96.31.8virginica
1096.72.55.81.8virginica
1107.23.66.12.5virginica
1116.53.25.12.0virginica
1126.42.75.31.9virginica
1136.83.05.52.1virginica
1145.72.55.02.0virginica
1155.82.85.12.4virginica
1166.43.25.32.3virginica
1176.53.05.51.8virginica
1187.73.86.72.2virginica
1197.72.66.92.3virginica
1206.02.25.01.5virginica
1216.93.25.72.3virginica
1225.62.84.92.0virginica
1237.72.86.72.0virginica
1246.32.74.91.8virginica
1256.73.35.72.1virginica
1267.23.26.01.8virginica
1276.22.84.81.8virginica
1286.13.04.91.8virginica
1296.42.85.62.1virginica
1307.23.05.81.6virginica
1317.42.86.11.9virginica
1327.93.86.42.0virginica
1336.42.85.62.2virginica
1346.32.85.11.5virginica
1356.12.65.61.4virginica
1367.73.06.12.3virginica
1376.33.45.62.4virginica
1386.43.15.51.8virginica
1396.03.04.81.8virginica
1406.93.15.42.1virginica
1416.73.15.62.4virginica
1426.93.15.12.3virginica
1435.82.75.11.9virginica
1446.83.25.92.3virginica
1456.73.35.72.5virginica
1466.73.05.22.3virginica
1476.32.55.01.9virginica
1486.53.05.22.0virginica
1496.23.45.42.3virginica
1505.93.05.11.8virginica
## 函数 - k-means算法
    webTJ.Datamining.setKmeans(arrs,k);
##参数
    【arrs,k】
    【样本数组,聚类簇数】

代码样例

var oTxt="5.1,3.5,1.4,0.2|4.9,3,1.4,0.2|4.7,3.2,1.3,0.2|4.6,3.1,1.5,0.2|5,3.6,1.4,0.2|5.4,3.9,1.7,0.4|4.6,3.4,1.4,0.3|5,3.4,1.5,0.2|4.4,2.9,1.4,0.2|4.9,3.1,1.5,0.1|5.4,3.7,1.5,0.2|4.8,3.4,1.6,0.2|4.8,3,1.4,0.1|4.3,3,1.1,0.1|5.8,4,1.2,0.2|5.7,4.4,1.5,0.4|5.4,3.9,1.3,0.4|5.1,3.5,1.4,0.3|5.7,3.8,1.7,0.3|5.1,3.8,1.5,0.3|5.4,3.4,1.7,0.2|5.1,3.7,1.5,0.4|4.6,3.6,1,0.2|5.1,3.3,1.7,0.5|4.8,3.4,1.9,0.2|5,3,1.6,0.2|5,3.4,1.6,0.4|5.2,3.5,1.5,0.2|5.2,3.4,1.4,0.2|4.7,3.2,1.6,0.2|4.8,3.1,1.6,0.2|5.4,3.4,1.5,0.4|5.2,4.1,1.5,0.1|5.5,4.2,1.4,0.2|4.9,3.1,1.5,0.2|5,3.2,1.2,0.2|5.5,3.5,1.3,0.2|4.9,3.6,1.4,0.1|4.4,3,1.3,0.2|5.1,3.4,1.5,0.2|5,3.5,1.3,0.3|4.5,2.3,1.3,0.3|4.4,3.2,1.3,0.2|5,3.5,1.6,0.6|5.1,3.8,1.9,0.4|4.8,3,1.4,0.3|5.1,3.8,1.6,0.2|4.6,3.2,1.4,0.2|5.3,3.7,1.5,0.2|5,3.3,1.4,0.2|7,3.2,4.7,1.4|6.4,3.2,4.5,1.5|6.9,3.1,4.9,1.5|5.5,2.3,4,1.3|6.5,2.8,4.6,1.5|5.7,2.8,4.5,1.3|6.3,3.3,4.7,1.6|4.9,2.4,3.3,1|6.6,2.9,4.6,1.3|5.2,2.7,3.9,1.4|5,2,3.5,1|5.9,3,4.2,1.5|6,2.2,4,1|6.1,2.9,4.7,1.4|5.6,2.9,3.6,1.3|6.7,3.1,4.4,1.4|5.6,3,4.5,1.5|5.8,2.7,4.1,1|6.2,2.2,4.5,1.5|5.6,2.5,3.9,1.1|5.9,3.2,4.8,1.8|6.1,2.8,4,1.3|6.3,2.5,4.9,1.5|6.1,2.8,4.7,1.2|6.4,2.9,4.3,1.3|6.6,3,4.4,1.4|6.8,2.8,4.8,1.4|6.7,3,5,1.7|6,2.9,4.5,1.5|5.7,2.6,3.5,1|5.5,2.4,3.8,1.1|5.5,2.4,3.7,1|5.8,2.7,3.9,1.2|6,2.7,5.1,1.6|5.4,3,4.5,1.5|6,3.4,4.5,1.6|6.7,3.1,4.7,1.5|6.3,2.3,4.4,1.3|5.6,3,4.1,1.3|5.5,2.5,4,1.3|5.5,2.6,4.4,1.2|6.1,3,4.6,1.4|5.8,2.6,4,1.2|5,2.3,3.3,1|5.6,2.7,4.2,1.3|5.7,3,4.2,1.2|5.7,2.9,4.2,1.3|6.2,2.9,4.3,1.3|5.1,2.5,3,1.1|5.7,2.8,4.1,1.3|6.3,3.3,6,2.5|5.8,2.7,5.1,1.9|7.1,3,5.9,2.1|6.3,2.9,5.6,1.8|6.5,3,5.8,2.2|7.6,3,6.6,2.1|4.9,2.5,4.5,1.7|7.3,2.9,6.3,1.8|6.7,2.5,5.8,1.8|7.2,3.6,6.1,2.5|6.5,3.2,5.1,2|6.4,2.7,5.3,1.9|6.8,3,5.5,2.1|5.7,2.5,5,2|5.8,2.8,5.1,2.4|6.4,3.2,5.3,2.3|6.5,3,5.5,1.8|7.7,3.8,6.7,2.2|7.7,2.6,6.9,2.3|6,2.2,5,1.5|6.9,3.2,5.7,2.3|5.6,2.8,4.9,2|7.7,2.8,6.7,2|6.3,2.7,4.9,1.8|6.7,3.3,5.7,2.1|7.2,3.2,6,1.8|6.2,2.8,4.8,1.8|6.1,3,4.9,1.8|6.4,2.8,5.6,2.1|7.2,3,5.8,1.6|7.4,2.8,6.1,1.9|7.9,3.8,6.4,2|6.4,2.8,5.6,2.2|6.3,2.8,5.1,1.5|6.1,2.6,5.6,1.4|7.7,3,6.1,2.3|6.3,3.4,5.6,2.4|6.4,3.1,5.5,1.8|6,3,4.8,1.8|6.9,3.1,5.4,2.1|6.7,3.1,5.6,2.4|6.9,3.1,5.1,2.3|5.8,2.7,5.1,1.9|6.8,3.2,5.9,2.3|6.7,3.3,5.7,2.5|6.7,3,5.2,2.3|6.3,2.5,5,1.9|6.5,3,5.2,2|6.2,3.4,5.4,2.3|5.9,3,5.1,1.8";
var oArrs=webTJ.getArrs(oTxt,"|",",");
oArrs=webTJ.Array.getQuantify(oArrs); //样本值数量化
webTJ.Datamining.setKmeans(oArrs,3); //将样本聚为3类

注:代码中鸢尾花数据被转换为格式化字符串,不包括序列号和属性列(最后一列)

四、案例分析

案例一:人口文化程度聚类分析

为了更深入了解我国人口的文化程度状况,现利用1990年全国人口普查数据对全国30个省、直辖市、自治区进行聚类分析。分析选用了三个指标:(1)大学以上文化程度的人口占全部人口的比例(DXBZ);(2)初中文化程度的人口占全部人口的比例(CZBZ);(3)文盲半文盲人口占全部人口的比例(WMBZ)、分别用来反映较高、中等、较低文化程度人口的状况,原始数据如下表:

1990年全国人口普查文化程度人口比例(%)
地区序号DXBZCZBZWMBZ
北京19.330.558.7
天津24.6729.388.92
河北30.9624.6915.21
山西41.3829.2411.3
内蒙51.4825.4715.39
辽宁62.632.328.81
吉林72.1526.3110.49
黑龙江82.1428.4610.87
上海96.5331.5911.04
江苏101.4726.4317.23
浙江111.1723.7417.46
安徽120.8819.9724.43
福建131.2316.8715.63
江西140.9918.8416.22
山东150.9825.1816.87
河南160.8526.5516.15
河北171.5723.1615.79
湖南181.1422.5712.1
广东191.3423.0410.45
广西200.7919.1410.61
海南211.2422.5313.97
四川220.9621.6516.24
贵州230.7814.6524.27
云南240.8113.8525.44
西藏250.573.8544.43
陕西261.6724.3617.62
甘肃271.116.8527.93
青海281.4917.7627.7
宁夏291.6120.2722.06
新疆301.8520.6612.75

将表格中数据部分转换为格式字符串(列由“,”分割、行由“|”分割),

9.3,30.55,8.7|4.67,29.38,8.92|0.96,24.69,15.21|1.38,29.24,11.3|1.48,25.47,15.39|2.6,32.32,8.81|2.15,26.31,10.49|2.14,28.46,10.87|6.53,31.59,11.04|1.47,26.43,17.23|1.17,23.74,17.46|0.88,19.97,24.43|1.23,16.87,15.63|0.99,18.84,16.22|0.98,25.18,16.87|0.85,26.55,16.15|1.57,23.16,15.79|1.14,22.57,12.1|1.34,23.04,10.45|0.79,19.14,10.61|1.24,22.53,13.97|0.96,21.65,16.24|0.78,14.65,24.27|0.81,13.85,25.44|0.57,3.85,44.43|1.67,24.36,17.62|1.1,16.85,27.93|1.49,17.76,27.7|1.61,20.27,22.06|1.85,20.66,12.75

代码样例

var oTxt="9.3,30.55,8.7|4.67,29.38,8.92|0.96,24.69,15.21|1.38,29.24,11.3|1.48,25.47,15.39|2.6,32.32,8.81|2.15,26.31,10.49|2.14,28.46,10.87|6.53,31.59,11.04|1.47,26.43,17.23|1.17,23.74,17.46|0.88,19.97,24.43|1.23,16.87,15.63|0.99,18.84,16.22|0.98,25.18,16.87|0.85,26.55,16.15|1.57,23.16,15.79|1.14,22.57,12.1|1.34,23.04,10.45|0.79,19.14,10.61|1.24,22.53,13.97|0.96,21.65,16.24|0.78,14.65,24.27|0.81,13.85,25.44|0.57,3.85,44.43|1.67,24.36,17.62|1.1,16.85,27.93|1.49,17.76,27.7|1.61,20.27,22.06|1.85,20.66,12.75";
var oArrs=webTJ.getArrs(oTxt,"|",",");
oArrs=webTJ.Array.getQuantify(oArrs);         //样本值数量化
//oArrs=webTJ.Datamining.getYZarrs(oArrs,1);  //按均值、标准差将数据标准化
webTJ.Datamining.setKmeans(oArrs,2);          //将样本聚为2类

注:代码webTJ.Datamining.setKmeans(oArrs,2)中可以将2改为3、4、5,观察组间误差比的变化

案例二:根据信息基础设施的发展状况,对世界20个国家和地区进行聚类分析

这里选取了发达国家、新兴工业化国家、拉美国家、亚洲发展中国家、转型国家等不同类型的20个国家作Q型聚类分析。描述信息基础设施的变量主要有六个:

I、 Call—每千人拥有电话线数,
II、 movecall—每千房居民蜂窝移动电话数,
III、fee—高峰时期每三分钟国际电话的成本,
IV、 Computer—每千人拥有的计算机数,
V、 mips—每千人中计算机功率《每秒百万指令》,
VI、 net—每千人互联网络户主数。数据摘自《世界竞争力报告—1997》。

20个国家信息基础设施表
IDcountrycallmovecallfeecomputermipsnet
1美国631.6161.90.364032607335.34
2日本498.4143.23.57176102236.26
3德国557.670.62.18199115719.48
4瑞典684.1281.81.42861666029.39
5瑞士64493.51.982341362122.68
6丹麦620.3248.62.562961721021.84
7新加坡498.4147.52.52841357813.49
8中国台湾469.456.13.6811969111.72
9韩国434.5733.369957951.68
10巴西81.916.33.02198760.52
11智利138.68.21.43114111.28
12墨西哥92.29.82.613117510.35
13俄罗斯174.955.122411010.48
14波兰1696.53.684017961.45
15匈牙利262.249.42.666830673.09
16马来西亚195.588.44.195327341.25
17泰国78.627.84.952216620.11
18印度13.600.306.282.00101.000.01
19法国559.1042.901.27201.0011702.004.76
20英国521.10122.500.98248.0014461.0011.91

代码样例

var oTxt="631.6,161.9,0.36,403,26073,35.34|498.4,143.2,3.57,176,10223,6.26|557.6,70.6,2.18,199,11571,9.48|684.1,281.8,1.4,286,16660,29.39|644,93.5,1.98,234,13621,22.68|620.3,248.6,2.56,296,17210,21.84|498.4,147.5,2.5,284,13578,13.49|469.4,56.1,3.68,119,6911,1.72|434.5,73,3.36,99,5795,1.68|81.9,16.3,3.02,19,876,0.52|138.6,8.2,1.4,31,1411,1.28|92.2,9.8,2.61,31,1751,0.35|174.9,5,5.12,24,1101,0.48|169,6.5,3.68,40,1796,1.45|262.2,49.4,2.66,68,3067,3.09|195.5,88.4,4.19,53,2734,1.25|78.6,27.8,4.95,22,1662,0.11|13.6,0.3,6.28,2,101,0.01|559.1,42.9,1.27,201,11702,4.76|521.1,122.5,0.98,248,14461,11.91";
var oArrs=webTJ.getArrs(oTxt,"|",",");
oArrs=webTJ.Array.getQuantify(oArrs);        //样本值数量化
//oArrs=webTJ.Datamining.getYZarrs(oArrs,1); //按均值、标准差将数据标准化
webTJ.Datamining.setKmeans(oArrs,3);         //将样本聚为3类
posted @ 2017-05-02 09:50  银河统计  阅读(1112)  评论(0编辑  收藏  举报