进行混合高斯聚类时遇到的ill-Condition问题
最近在用java写混合高斯聚类(Mixture Of Gaussian Clustering)问题,为了验证结果,找了一点数据分别在本人的程序上和Matlab上运行,最后进行比较。先简单讲一下调试的过程。
一.数据准备
准备了两组数据,第一组使用Matlab生成的2个二维高斯分布的随机数据,这段是直接参照了Matlab的官方文档,代码如下:
mu1 = [1 2];
sigma1 = [3 .2; .2 2];
mu2 = [-1 -2];
sigma2 = [2 0; 0 1];
X = [mvnrnd(mu1,sigma1,200);mvnrnd(mu2,sigma2,100)];
scatter(X(:,1),X(:,2),10,'ko')
以上生成了200个满足均值为mu1,方差为sigma1,100个均值为mu2,方差sigma2的随机数据,sigma为协方差矩阵,最后将数据集成到X中,即X共300个2维的数据。
第二组数据为实际测得的水质抽样数据,数据中主要包括水质的各项指标,比如大肠菌群数量,总菌群数量等,共11维,均进行了-1到1之间的归一化处理以便统一量纲。
二.训练模型
之后通过Matlab进行多维高斯分布建模:
options = statset('Display','final');
gm = gmdistribution.fit(X,2,'Options',options);
得出2个估计的均值和方差分别为:mu1[1.07579392892986 2.04257931151116],mu2[-0.829222490608191 -1.84815127578882]
sigma1 [3.6619 0.1834;0.1834 1.5133] sigma2[ 1.6663 0.1349;0.1349 0.9793],总共迭代了33次。
通过本人的代码运行结果为(当然中间出现了许多次错误,经过2个下午的调试得出了较满意结果):
mu1[1.0820056998422323, 2.0566580446915914]
mu2[-0.8192719366513878, -1.830152540943631]
sigma1[3.667380264809185, 0.1716386749846842;0.1716386749846842, 1.4882695982781156]
sigma2[1.6775332037685982, 0.14863643273576532;0.14863643273576532, 1.0047714450108307]
结果基本令人满意,不过接下来的现实数据测试则不能让人满意了,自己的代码运行到最后数值全变成了NaN,而专业的Matlab则告诉我,在第2次迭代的时候出现了ill-Condition问题:
??? Error using ==> gmcluster at 181
Ill-conditioned covariance created at iteration 3.
到网上去查问题,在Matlab文档中心查到了以下内容:
In some cases, gmdistribution may converge to a solution where one or more of the components has an ill-conditioned or singular covariance matrix.
The following issues may result in an ill-conditioned covariance matrix:
The number of dimension of your data is relatively high and there are not enough observations.
Some of the features (variables) of your data are highly correlated.
Some or all the features are discrete.
You tried to fit the data to too many components.
In general, you can avoid getting ill-conditioned covariance matrices by using one of the following precautions:
Pre-process your data to remove correlated features.
Set 'SharedCov' to true to use an equal covariance matrix for every component.
Set 'CovType' to 'diagonal'.
Use 'Regularize' to add a very small positive number to the diagonal of every covariance matrix.
Try another set of initial values.
In other cases gmdistribution may pass through an intermediate step where one or more of the components has an
ill-conditioned covariance matrix. Trying another set of initial values may avoid this issue without altering your data or model.
基本就是告诉我们,出现以下情况可能导致协方差矩阵,ill-Condition或者变为奇异阵(行列式为0,无法求逆矩阵,而计算多维高斯密度需要用到逆矩阵):
1.相对于已有的数据量来说,数据维度相对太高,
2.数据属性之间相关度过高,
3.某些或全部属性为离散,
4.或者试图分成太多的聚类。
解决方案大致为:
1.预处理数据,移除相关度太高的属性(这个一般可以使用mutual information,卡方检验,Pearsion product cofficient来处理)
2.让每个聚类分享同一个协方差矩阵
3.使协方差矩阵变成对角阵
4.给协方差矩阵的对角线加一个很小的正数值(这个在我的程序中进行了试验,发现他的意义在于在某次迭代过后协方差矩阵变成全部元素为0的矩阵,加上一个很小的正数后是过程可以继续)
我尝试了后面三种方法,其中方法4虽然可以进行,但是最后结果不是很令人满意,现在只能尝试去掉一些维度来重新尝试,或者使用PCA进行降维,后面等结果出来了再总结。