题目:通过给出的驾驶员行为数据(trip.csv),对驾驶员不同时段的驾驶类型进行聚类,聚成普通驾驶类型,激进类型和超冷静型3类 。
利用Python的scikit-learn包中的Kmeans算法进行聚类算法的应用练习。
并利用scikitlearn包中的PCA算法来对数据进行降维,然后画图展示出聚类效果。
通过调节聚类算法的参数,来观察聚类效果的变化,练习调参。
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第5次实训\\trip.csv")
df.head()
|
driver |
trip |
v_avg |
a_avg |
r_avg |
v_var |
a_var |
r_var |
v_a |
v_b |
v_c |
v_d |
a_a |
a_b |
a_c |
r_a |
r_b |
r_c |
0 |
41030402427 |
1 |
6 |
0.218219 |
1209.078947 |
33.465922 |
0.154504 |
242766.4531 |
0.564121 |
0.224947 |
0.163280 |
0.047652 |
0.594954 |
0.288718 |
0.116328 |
0.585144 |
0.348283 |
0.066573 |
1 |
41030402427 |
2 |
3 |
0.305416 |
1064.181818 |
24.574448 |
0.283866 |
185456.3409 |
0.575369 |
0.291626 |
0.133005 |
0.000000 |
0.577340 |
0.210837 |
0.211823 |
0.577340 |
0.365517 |
0.057143 |
2 |
41030402427 |
3 |
5 |
0.121377 |
1168.500000 |
24.310541 |
0.012078 |
224469.1400 |
0.574566 |
0.269364 |
0.156069 |
0.000000 |
0.531792 |
0.393064 |
0.075145 |
0.567630 |
0.354913 |
0.077457 |
3 |
41030402427 |
4 |
7 |
0.185244 |
1175.392593 |
41.511023 |
0.323999 |
260512.1507 |
0.498039 |
0.196078 |
0.214994 |
0.090888 |
0.685582 |
0.236217 |
0.078201 |
0.432757 |
0.505882 |
0.061361 |
4 |
41030402427 |
5 |
9 |
0.255851 |
1311.179487 |
53.369580 |
0.440556 |
309291.7347 |
0.397380 |
0.131823 |
0.318504 |
0.152293 |
0.543395 |
0.299945 |
0.156659 |
0.323690 |
0.607260 |
0.069050 |
(注:其中的driver 和trip_no 不参与聚类)
df=df.iloc[:,2:]
df.head()
|
v_avg |
a_avg |
r_avg |
v_var |
a_var |
r_var |
v_a |
v_b |
v_c |
v_d |
a_a |
a_b |
a_c |
r_a |
r_b |
r_c |
0 |
6 |
0.218219 |
1209.078947 |
33.465922 |
0.154504 |
242766.4531 |
0.564121 |
0.224947 |
0.163280 |
0.047652 |
0.594954 |
0.288718 |
0.116328 |
0.585144 |
0.348283 |
0.066573 |
1 |
3 |
0.305416 |
1064.181818 |
24.574448 |
0.283866 |
185456.3409 |
0.575369 |
0.291626 |
0.133005 |
0.000000 |
0.577340 |
0.210837 |
0.211823 |
0.577340 |
0.365517 |
0.057143 |
2 |
5 |
0.121377 |
1168.500000 |
24.310541 |
0.012078 |
224469.1400 |
0.574566 |
0.269364 |
0.156069 |
0.000000 |
0.531792 |
0.393064 |
0.075145 |
0.567630 |
0.354913 |
0.077457 |
3 |
7 |
0.185244 |
1175.392593 |
41.511023 |
0.323999 |
260512.1507 |
0.498039 |
0.196078 |
0.214994 |
0.090888 |
0.685582 |
0.236217 |
0.078201 |
0.432757 |
0.505882 |
0.061361 |
4 |
9 |
0.255851 |
1311.179487 |
53.369580 |
0.440556 |
309291.7347 |
0.397380 |
0.131823 |
0.318504 |
0.152293 |
0.543395 |
0.299945 |
0.156659 |
0.323690 |
0.607260 |
0.069050 |
1、聚类算法
建立并训练模型,参数全部默认
kmeans = KMeans(n_clusters=3)
kmeans.fit(df)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
(1)统计各个类别的数目
label_count= pd.Series(kmeans.labels_).value_counts()
label_count
0 54
1 33
2 4
dtype: int64
(2)找出聚类中心
centroids =pd.DataFrame(kmeans.cluster_centers_)
centroids
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
0 |
5.537037 |
0.277437 |
1168.842229 |
25.205092 |
0.314492 |
169896.171576 |
0.537669 |
0.248422 |
0.175849 |
0.038060 |
0.565810 |
0.286437 |
0.147753 |
0.510406 |
0.447028 |
0.042566 |
1 |
6.030303 |
0.311931 |
1178.091475 |
36.235528 |
0.540281 |
302047.587585 |
0.530479 |
0.200719 |
0.204316 |
0.064485 |
0.531772 |
0.296036 |
0.172192 |
0.517129 |
0.410236 |
0.072636 |
2 |
5.250000 |
1.073941 |
1068.375000 |
38.914793 |
1.753758 |
639407.213100 |
0.788223 |
0.059943 |
0.143262 |
0.008572 |
0.241813 |
0.536941 |
0.221247 |
0.803339 |
0.074829 |
0.121832 |
(3)将每条数据聚成的类别(该列命名为jllable )和聚类数据集进行合并,形成新的dataframe,命名为new_df ,并输出到本地,命名为new_df.csv。
labels =pd.DataFrame({"jllable": kmeans.labels_})
new_df=pd.concat([df,labels],axis=1)
new_df.head()
|
v_avg |
a_avg |
r_avg |
v_var |
a_var |
r_var |
v_a |
v_b |
v_c |
v_d |
a_a |
a_b |
a_c |
r_a |
r_b |
r_c |
jllable |
0 |
6 |
0.218219 |
1209.078947 |
33.465922 |
0.154504 |
242766.4531 |
0.564121 |
0.224947 |
0.163280 |
0.047652 |
0.594954 |
0.288718 |
0.116328 |
0.585144 |
0.348283 |
0.066573 |
1 |
1 |
3 |
0.305416 |
1064.181818 |
24.574448 |
0.283866 |
185456.3409 |
0.575369 |
0.291626 |
0.133005 |
0.000000 |
0.577340 |
0.210837 |
0.211823 |
0.577340 |
0.365517 |
0.057143 |
0 |
2 |
5 |
0.121377 |
1168.500000 |
24.310541 |
0.012078 |
224469.1400 |
0.574566 |
0.269364 |
0.156069 |
0.000000 |
0.531792 |
0.393064 |
0.075145 |
0.567630 |
0.354913 |
0.077457 |
0 |
3 |
7 |
0.185244 |
1175.392593 |
41.511023 |
0.323999 |
260512.1507 |
0.498039 |
0.196078 |
0.214994 |
0.090888 |
0.685582 |
0.236217 |
0.078201 |
0.432757 |
0.505882 |
0.061361 |
1 |
4 |
9 |
0.255851 |
1311.179487 |
53.369580 |
0.440556 |
309291.7347 |
0.397380 |
0.131823 |
0.318504 |
0.152293 |
0.543395 |
0.299945 |
0.156659 |
0.323690 |
0.607260 |
0.069050 |
1 |
new_df.to_csv("new_df.csv",index=False)
2、PCA算法
(1)将用于聚类的数据的维度降至2维,并输出降维后的数据,形成一个dataframe名字new_pca,然后将降维数据和聚类结果,通过绘图的形式来展示出来,并保存聚类结果图片。
pca = PCA(n_components=2)
new_pca = pd.DataFrame(pca.fit_transform(df))
new_pca.head()
|
0 |
1 |
0 |
4309.394730 |
41.169810 |
1 |
-53000.724767 |
-101.095608 |
2 |
-13987.920711 |
1.256170 |
3 |
22055.091027 |
6.848376 |
4 |
70834.682035 |
140.500470 |
plt.figure(figsize=(9,8))
cluster0 = new_pca[new_df['jllable'] == 0]
plt.plot(cluster0[0], cluster0[1], 'rs')
cluster1 = new_pca[new_df['jllable'] == 1]
plt.plot(cluster1[0], cluster1[1], 'go')
cluster2 = new_pca[new_df['jllable'] == 2]
plt.plot(cluster2[0], cluster2[1], 'b*')
plt.legend(['cluster0','cluster1','cluster2'])
plt.savefig('kmeans.png')
大家好,我是[爱做梦的子浩](https://blog.csdn.net/weixin_43124279),我是东北大学大数据实验班大三的小菜鸡,非常向往优秀,羡慕优秀的人,已拿两个暑假offer,欢迎大家找我进行交流😂😂😂
这是我的博客地址:[子浩的博客https://blog.csdn.net/weixin_43124279]
——
版权声明:本文为CSDN博主「爱做梦的子浩」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。