层次聚类,转自http://blog.sina.com.cn/s/blog_62f3c4ef01014uhe.html
Matlab提供系列函数用于聚类分析,归纳起来具体方法有如下:
方法一:直接聚类,利用clusterdata函数对样本数据进行一次聚类,其缺点为可供用户选择的面较窄,不能更改距离的计算方法,该方法的使用者无需了解聚类的原理和过程,但是聚类效果受限制。
方法二:层次聚类,该方法较为灵活,需要进行细节了解聚类原理,具体需要进行如下过程处理:(1)找到数据集合中变量两两之间的相似性和非相似性,用pdist函数计算变量之间的距离;(2)用 linkage函数定义变量之间的连接;(3)用 cophenetic函数评价聚类信息;(4)用cluster函数创建聚类。
方法三:划分聚类,包括K均值聚类和K中心聚类,同样需要系列步骤完成该过程,要求使用者对聚类原理和过程有较清晰的认识。
接下来介绍一下Matlab中的相关函数和相关聚类方法。
1.Matlab中相关函数介绍
1.1 pdist函数
调用格式:Y=pdist(X,’metric’)
说明:用 ‘metric’指定的方法计算 X 数据矩阵中对象之间的距离。
X:一个m×n的矩阵,它是由m个对象组成的数据集,每个对象的大小为n(即n个特征值)。
metric’取值如下:
‘euclidean’:欧氏距离(默认);‘seuclidean’:标准化欧氏距离;
‘mahalanobis’:马氏距离;‘cityblock’:布洛克距离;
‘minkowski’:明可夫斯基距离;‘cosine’:
‘correlation’:
‘jaccard’: ‘chebychev’:Chebychev距离。
1.2 squareform 函数
调用格式:Z=squareform(Y,..)
对于M个点的数据集X,pdist之后的Y将是具有M*(M-1)/2个元素的行向量。
Y这样的显示虽然节省了内存空间,但对用户来说不是很易懂,如果需要对这些距离进行特定操作的话,也不太好索引。MATLAB中可以用squareform把Y转换成方阵形式,方阵中<i,j>位置的数值就是X中第i和第j点之间的距离,显然这个方阵应该是个对角元素为0的对称阵。
对于M个点的数据集X,pdist之后的Y将是具有M*(M-1)/2个元素的行向量。
Y这样的显示虽然节省了内存空间,但对用户来说不是很易懂,如果需要对这些距离进行特定操作的话,也不太好索引。MATLAB中可以用squareform把Y转换成方阵形式,方阵中<i,j>位置的数值就是X中第i和第j点之间的距离,显然这个方阵应该是个对角元素为0的对称阵。
1.3 linkage函数
调用格式:Z=linkage(Y,‘method’)
输入值说明:Y为pdist函数返回的M*(M-1)/2个元素的行向量,用‘method’参数指定的算法计算系统聚类树。
method:可取值如下:
‘single’:最短距离法(默认);
‘complete’:最长距离法;
‘average’:未加权平均距离法;
‘weighted’: 加权平均法;
‘centroid’:质心距离法;
‘median’:加权质心距离法;
‘ward’:内平方距离法(最小方差算法)
返回值说明:Z为一个包含聚类树信息的(m-1)×3的矩阵,其中前两列为索引标识,表示哪两个序号的样本可以聚为同一类,第三列为这两个样本之间的距离。另外,除了M个样本以外,对于每次新产生的类,依次用M+1、M+2、…来标识。
1.4 dendrogram函数
调用格式:[H,T,…]=dendrogram(Z,p,…)
说明:生成只有顶部p个节点的冰柱图(谱系图)。
为了表示Z矩阵,我们可以用更直观的聚类数来展示,方法为:dendrogram(Z),
产生的聚类数是一个n型树,最下边表示样本,然后一级一级往上聚类,最终成为最顶端的一类。纵轴高度代表距离列。
另外,还可以设置聚类数最下端的样本数,默认为30,可以根据修改dendrogram(Z,n)参数n来实现,1<n<M。dendrogram(Z,0)则表n=M的情况,显示所有叶节点。
1.5 cophenet函数
调用格式:c=cophenet(Z,Y)
说明:利用pdist函数生成的Y和linkage函数生成的Z计算cophenet相关系数。
cophene检验一定算法下产生的二叉聚类树和实际情况的相符程度,就是检测二叉聚类树中各元素间的距离和pdist计算产生的实际的距离之间有多大的相关性,另外也可以用inconsistent表示量化某个层次的聚类上的节点间的差异性。
1.6 cluster 函数
调用格式:T=cluster(Z,…)
说明:根据linkage函数的输出Z 创建分类。
1.7 clusterdata 函数
调用格式:T=clusterdata(X,…)
说明:根据数据创建分类。
CLUSTERDATA Construct clusters from data.
T = CLUSTERDATA(X, CUTOFF) constructs clusters from data X.
X is a matrix of size M by N, treated as M observations of N
variables. CUTOFF is a threshold for cutting the hierarchical
tree generated by LINKAGE into clusters. When 0 < CUTOFF < 2,
clusters are formed when inconsistent values are greater than
CUTOFF (see INCONSISTENT). When CUTOFF is an integer and CUTOFF >= 2,
then CUTOFF is considered as the maximum number of clusters to
keep in the hierarchical tree generated by LINKAGE. The output T is
a vector of size M containing a cluster number for each observation.
When 0 < CUTOFF < 2, T = CLUSTERDATA(X,CUTOFF) is equivalent to:
Y = pdist(X, 'euclid');
Z = linkage(Y, 'single');
T = cluster(Z, 'cutoff', CUTOFF);
When CUTOFF is an integer >= 2, T = CLUSTERDATA(X,CUTOFF) is equivalent
to:
Y = pdist(X,'euclid');
Z = linkage(Y,'single');
T = cluster(Z,'maxclust',CUTOFF)
1.8 Inconsistent
INCONSISTENT Inconsistent values of a cluster tree.
Y = INCONSISTENT(Z) computes the inconsistent value of each non-leaf
node in the hierarchical cluster tree Z. Z is a (M-1)-by-3 matrix
generated by the function LINKAGE. Each inconsistent value is a
measure of separation between the two clusters whose merge is
represented by that node, compared to the separation between
subclusters merged within those clusters.
Y = INCONSISTENT(Z, DEPTH) computes inconsistent values by looking
to a depth DEPTH below each node.
Y is a (M-1)-by-4 matrix, with rows corresponding to each of the
non-leaf nodes represented in Z. INCONSISTENT computes the
inconsistent value for node (M+i) using S_i, the set of nodes less than
DEPTH branches below node (M+i), excluding any leaf nodes.
S_i是除了叶节点外,所有深度低于(M+i)不超过DEPTH的节点(包括M+i节点自身)
而Inconsistent计算的是S_i的距离的平均值。
Then
Y(i,1) = mean(Z(S_i,3)), the mean height of nodes in S_i
Y(i,2) = std(Z(S_i,3)), the standard deviation of node heights in S_i
Y(i,3) = length(S_i), the number of nodes in S_i
Y(i,4) = (Z(i,3) - Y(i,1))/Y(i,2), the inconsistent value
The default value for DEPTH is 2.
计算深度会影响不一致系数的计算结果,计算深度比较大时,不一致系数的增量能反映出当前步引入的新样品与该类中心(涉及该类中所有样品)的距离远近,计算深度比较小时,不一致系数的增量仅能反映出当前步引入的新样品与上几步聚类中涉及的样品的中心的距离远近。
CLUSTERDATA Construct clusters from data.
T = CLUSTERDATA(X, CUTOFF) constructs clusters from data X.
X is a matrix of size M by N, treated as M observations of N
variables. CUTOFF is a threshold for cutting the hierarchical
tree generated by LINKAGE into clusters. When 0 < CUTOFF < 2,
clusters are formed when inconsistent values are greater than
CUTOFF (see INCONSISTENT). When CUTOFF is an integer and CUTOFF >= 2,
then CUTOFF is considered as the maximum number of clusters to
keep in the hierarchical tree generated by LINKAGE. The output T is
a vector of size M containing a cluster number for each observation.
When 0 < CUTOFF < 2, T = CLUSTERDATA(X,CUTOFF) is equivalent to:
Y = pdist(X, 'euclid');
Z = linkage(Y, 'single');
T = cluster(Z, 'cutoff', CUTOFF);
When CUTOFF is an integer >= 2, T = CLUSTERDATA(X,CUTOFF) is equivalent
to:
Y = pdist(X,'euclid');
Z = linkage(Y,'single');
T = cluster(Z,'maxclust',CUTOFF)
1.8 Inconsistent
INCONSISTENT Inconsistent values of a cluster tree.
Y = INCONSISTENT(Z) computes the inconsistent value of each non-leaf
node in the hierarchical cluster tree Z. Z is a (M-1)-by-3 matrix
generated by the function LINKAGE. Each inconsistent value is a
measure of separation between the two clusters whose merge is
represented by that node, compared to the separation between
subclusters merged within those clusters.
Y = INCONSISTENT(Z, DEPTH) computes inconsistent values by looking
to a depth DEPTH below each node.
Y is a (M-1)-by-4 matrix, with rows corresponding to each of the
non-leaf nodes represented in Z. INCONSISTENT computes the
inconsistent value for node (M+i) using S_i, the set of nodes less than
DEPTH branches below node (M+i), excluding any leaf nodes.
S_i是除了叶节点外,所有深度低于(M+i)不超过DEPTH的节点(包括M+i节点自身)
而Inconsistent计算的是S_i的距离的平均值。
Then
Y(i,1) = mean(Z(S_i,3)), the mean height of nodes in S_i
Y(i,2) = std(Z(S_i,3)), the standard deviation of node heights in S_i
Y(i,3) = length(S_i), the number of nodes in S_i
Y(i,4) = (Z(i,3) - Y(i,1))/Y(i,2), the inconsistent value
The default value for DEPTH is 2.
计算深度会影响不一致系数的计算结果,计算深度比较大时,不一致系数的增量能反映出当前步引入的新样品与该类中心(涉及该类中所有样品)的距离远近,计算深度比较小时,不一致系数的增量仅能反映出当前步引入的新样品与上几步聚类中涉及的样品的中心的距离远近。
2. Matlab聚类程序的设计
2.1 方法一:一次聚类法(直接使用clusterdata函数)
X=[11978 12.5 93.5 31908;…;57500 67.6 238.0 15900];
T=clusterdata(X,0.9)
结果:
结果:
ans =
1
1
表明两个观测属于同一类。
再如下面的例子:
x1=randn(10,1);
x2=randn(10,1)+10;
x3=randn(10,1)+20;
x=[x1;x2;x3];
y=randn(30,1);
T=clusterdata([x,y],3)
temp1=find(T==1)
plot(x(temp1),y(temp1),'rd','markersize',10,'markerfacecolor','r')
hold on
temp1=find(T==2)
plot(x(temp1),y(temp1),'yd','markersize',10,'markerfacecolor','y')
temp1=find(T==3)
plot(x(temp1),y(temp1),'kd','markersize',10,'markerfacecolor','k')
legend('cluster 1','cluster 2','cluster 3')
结果如下图:
1
1
表明两个观测属于同一类。
再如下面的例子:
x1=randn(10,1);
x2=randn(10,1)+10;
x3=randn(10,1)+20;
x=[x1;x2;x3];
y=randn(30,1);
T=clusterdata([x,y],3)
temp1=find(T==1)
plot(x(temp1),y(temp1),'rd','markersize',10,'markerfacecolor','r')
hold on
temp1=find(T==2)
plot(x(temp1),y(temp1),'yd','markersize',10,'markerfacecolor','y')
temp1=find(T==3)
plot(x(temp1),y(temp1),'kd','markersize',10,'markerfacecolor','k')
legend('cluster 1','cluster 2','cluster 3')
结果如下图:
2.2 方法二和方法三设计流程:分步聚类
Step1
用pdist函数计算相似矩阵,有多种方法可以计算距离,进行计算之前有时会先将数据用zscore函数进行标准化。
---------------------------------------------------------------------
ZSCORE Standardized z score.
Z = ZSCORE(X) returns a centered, scaled version of X, the same size as X.
For vector input X, Z is the vector of z-scores (X-MEAN(X)) ./ STD(X). For
matrix X, z-scores are computed using the mean and standard deviation
along each column of X. For higher-dimensional arrays, z-scores are
computed using the mean and standard deviation along the first
non-singleton dimension.
The columns of Z have sample mean zero and sample standard deviation one
(unless a column of X is constant, in which case that column of Z is
constant at 0).
---------------------------------------------------------------------
X2=zscore(X);
ZSCORE Standardized z score.
Z = ZSCORE(X) returns a centered, scaled version of X, the same size as X.
For vector input X, Z is the vector of z-scores (X-MEAN(X)) ./ STD(X). For
matrix X, z-scores are computed using the mean and standard deviation
along each column of X. For higher-dimensional arrays, z-scores are
computed using the mean and standard deviation along the first
non-singleton dimension.
The columns of Z have sample mean zero and sample standard deviation one
(unless a column of X is constant, in which case that column of Z is
constant at 0).
---------------------------------------------------------------------
X2=zscore(X);
Y2=pdist(X2); %计算距离
Step2
Z2=linkage(Y2);
Step3
C2=cophenet(Z2,Y2); //0.94698
Step4 创建聚类,并作出谱系图
T=cluster(Z2,6);
MATLAB中提供了cophenet, inconsistent等表示相关性的函数。cophenet和inconsistent用来计算某些系数,前者用于检验一定算法下产生的二叉聚类树和实际情况的相符程度(就是检测二叉聚类树中各元素间的距离和pdist计算产生的实际的距离之间有多大的相关性),inconsistent则是量化某个层次的聚类上的节点间的差异性(可用于作为cluster的剪裁标准)。