|NO.Z.00018|——————————|BigDataEnd|——|Arithmetic&Machine.v18|——|Machine：无监督学习算法.v03|

一、使用sklearn实现K-Means

### --- 使用sklearn实现K-Means

class sklearn.cluster.KMeans (n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001,
precompute_distances=’auto’, verbose=0, random_state=None,
copy_x=True,
n_jobs=None,
algorithm=’auto’)

### --- 重要参数：n_clusters

~~~     n_clusters 是 KMeans 中的 k，表示着我们告诉模型我们要分几类。
~~~     这是 KMeans 当中唯一一个必填的参数，默认为 8 类，
~~~     但通常我们的聚类结果会是一个小于 8 的结果。
~~~     通常，在开始聚类之前，我们并不知道n_clusters 究竟是多少，因此我们要对它进行探索。

~~~     当我们拿到一个数据集，如果可能的话，
~~~     我们希望能够通过绘图先观察一下这个数据集的数据分布，以此来为我们聚类时输入的 n_clusters 做一个参考。

~~~     首先，我们来自己创建一个数据集。这样的数据集是我们自己创建，所以是有标签的。

from sklearn.datasets import make_blobs
#自己创建数据集
X, y = make_blobs(n_samples=500,n_features=2,centers=4,random_state=1)
plt.scatter(X[:, 0], X[:, 1], marker='o')#点的形状 ,s=8 #点的大小

#查看分布
color = ["red","pink","orange","gray"]
for i in range(4):
plt.scatter(X[y==i, 0], X[y==i, 1]
,marker='o' #点的形状
,s=8 #点的大小
,c=color[i]
)
plt.show()

~~~     基于这个分布，我们来使用 Kmeans 进行聚类。首先，我们要猜测一下，这个数据中有几簇？

### --- 重要属性 cluster.labels_
~~~     重要属性 labels_，查看聚好的类别，每个样本所对应的类

from sklearn.cluster import KMeans
n_clusters = 3
cluster = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
y_pred = cluster.labels_
y_pred

array([0, 0, 2, 1, 2, 1, 2, 2, 2, 2, 0, 0, 2, 1, 2, 0, 2, 0, 1, 2, 2, 2,
2, 1, 2, 2, 1, 1, 2, 2, 0, 1, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 1, 2,
2, 0, 2, 2, 1, 1, 1, 2, 2, 2, 0, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2,
0, 2, 2, 2, 0, 2, 2, 0, 2…])

~~~     KMeans 因为并不需要建立模型或者预测结果，因此我们只需要 fit 就能够得到聚类结果了
~~~     KMeans 也有接口 predict 和 fit_predict:
~~~     predict 表示学习数据 X 并对 X 的类进行预测（对分类器.fit()之后，再预测)
~~~     fit_predict 不需要分类器.fit()之后都可以预测
~~~     对于全数据而言，分类器.fit().predict 的结果= 分类器.fit_predict(X)=cluster.labels

fit_pre = KMeans(n_clusters=3, random_state=0).fit_predict(X)
(cluster.predict(X)==fit_pre).sum()
(fit_pre== cluster.labels_).sum()

~~~     # 输出参数
500
500

~~~     我们什么时候需要 predict 呢? 当数据量太大的时候！
~~~     当我们数据量非常大的时候，为了提高模型学习效率，
~~~     我们可以使用部分数据来帮助我们确认质心剩下的数据的聚类结果，使用 predict 来调用

cluster_smallsub = KMeans(n_clusters=3, random_state=0).fit(X[:200])
sample_pred = cluster_smallsub.predict(X)
y_pred == sample_pred

array([False, False, True, False, True, False, True, True, True,
True, False, False, True, False, True, False, True, False,
False, True, True, True, True, False…])

~~~     但这样的结果，肯定与直接 fit 全部数据会不一致。有时候，
~~~     当我们不要求那么精确，或者我们的数据量实在太大，那我们可以使用这样的方法。

### --- 重要属性 cluster.cluster_centers_

~~~     # 查看质心
centroid = cluster.cluster_centers_
centroid
centroid.shape

### --- 重要属性 cluster.inertia_

~~~     # 查看总距离平方和
inertia = cluster.inertia_
inertia
~~~     # 输出参数
1903.4503741659223

~~~     # 如果我们把猜测的簇数换成 4，Inertia 会怎么样？
n_clusters = 4
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_

~~~     # 输出参数
908.3855684760613

n_clusters = 5
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_

~~~     # 输出参数
811.0841324482415

n_clusters = 6
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_

~~~     # 输出参数
733.153835008308

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart

——W.S.Landor