1、高斯混合模型
sklearn.mixture是一个能够学习高斯混合模型、抽样高斯模型和从数据中估计模型的包。同样,也提供了帮助决定正确组件数量的方法。
一个高斯混合模型是一个概率模型,它假设所有的数据点是从有限未知参数的高斯分布的混合生成的。可以将混合模型当作泛化的k均值聚类,以融合关于数据协方差和潜在高斯中心的信息。
高斯混合
GaussianMixture对象实现了expection-maximization算法来拟合高斯混合模型。它也能够得到多元模型的置信椭圆,计算贝叶斯信息准则来确定数据中聚集类别的数量。GaussianMixture.fit方法从训练数据中学习一个高斯混合模型,GaussianMixture.predict能够分配给每个样本最大可能属于的高斯分布。
GaussianMixture提供了不同的选项来限制不同类别估计的方差,包括,spherical、diagonal、tied或full方差。
2、变分贝叶斯高斯混合
BayesianGaussianMixture对象实现了一系列考虑不同推断算法的高斯混合模型。
估计算法:变分推断
变分推断(Variational Inference)是最大期望的扩展,它最大化模型证据的下界,而不是数据似然。其背后的原理与最大期望方法相同。但是变分推断方法通过集成先验分布的信息添加正则项。这可以避免在最大期望中经常发生的奇异性,但会引入偏差到模型中。
BayesianGaussianMixture类提供了两类权重的先验:使用Dirichlet分布的有限混合模型和使用Dirichlet过程的无限混合模型。
3、ITK中的GMM、EM
使用 ITK中的GMM、EM进行分布式采样
1 #include "itkVector.h"
2 #include "itkListSample.h"
3 #include "itkGaussianMixtureModelComponent.h"
4 #include "itkExpectationMaximizationMixtureModelEstimator.h"
5 #include "itkNormalVariateGenerator.h"
6
7 int
8 main(int, char *[])
9 {
10 unsigned int numberOfClasses = 2;
11 using MeasurementVectorType = itk::Vector<double, 1>;
12 using SampleType = itk::Statistics::ListSample<MeasurementVectorType>;
13 SampleType::Pointer sample = SampleType::New();
14
15 using NormalGeneratorType = itk::Statistics::NormalVariateGenerator;
16 NormalGeneratorType::Pointer normalGenerator = NormalGeneratorType::New();
17
18 normalGenerator->Initialize(101);
19
20 MeasurementVectorType mv;
21 double mean = 100;
22 double standardDeviation = 30;
23 for (unsigned int i = 0; i < 10; ++i)
24 {
25 mv[0] = (normalGenerator->GetVariate() * standardDeviation) + mean;
26 std::cout << "m[" << i << "] = " << mv[0] << std::endl;
27 sample->PushBack(mv);
28 }
29
30 normalGenerator->Initialize(3024);
31 mean = 200;
32 standardDeviation = 30;
33 for (unsigned int i = 0; i < 10; ++i)
34 {
35 mv[0] = (normalGenerator->GetVariate() * standardDeviation) + mean;
36 std::cout << "m[" << i << "] = " << mv[0] << std::endl;
37 sample->PushBack(mv);
38 }
39
40 using ParametersType = itk::Array<double>;
41 ParametersType params1(2);
42
43 std::vector<ParametersType> initialParameters(numberOfClasses);
44 params1[0] = 110.0;
45 params1[1] = 50.0;
46 initialParameters[0] = params1;
47
48 ParametersType params2(2);
49 params2[0] = 210.0;
50 params2[1] = 50.0;
51 initialParameters[1] = params2;
52
53 using ComponentType = itk::Statistics::GaussianMixtureModelComponent<SampleType>;
54
55 std::vector<ComponentType::Pointer> components;
56 for (unsigned int i = 0; i < numberOfClasses; i++)
57 {
58 components.push_back(ComponentType::New());
59 components[i]->SetSample(sample);
60 components[i]->SetParameters(initialParameters[i]);
61 }
62
63 using EstimatorType = itk::Statistics::ExpectationMaximizationMixtureModelEstimator<SampleType>;
64 EstimatorType::Pointer estimator = EstimatorType::New();
65
66 estimator->SetSample(sample);
67 estimator->SetMaximumIteration(500);
68
69 itk::Array<double> initialProportions(numberOfClasses);
70 initialProportions[0] = 0.5;
71 initialProportions[1] = 0.5;
72
73 estimator->SetInitialProportions(initialProportions);
74
75 for (unsigned int i = 0; i < numberOfClasses; i++)
76 {
77 estimator->AddComponent((ComponentType::Superclass *)components[i].GetPointer());
78 }
79
80 estimator->Update();
81
82 for (unsigned int i = 0; i < numberOfClasses; i++)
83 {
84 std::cout << "Cluster[" << i << "]" << std::endl;
85 std::cout << " Parameters:" << std::endl;
86 std::cout << " " << components[i]->GetFullParameters() << std::endl;
87 std::cout << " Proportion: ";
88 std::cout << " " << estimator->GetProportions()[i] << std::endl;
89 }
90
91 return EXIT_SUCCESS;
92 }
运行输出结果:
前面20行分别是生成的2类,各10个样本,1类样本均值100,方差30;2类样本均值200,方差30.
21~24行是分类结果的1类参数;
25~28行是分类结果的2类参数;
占比之和为1;
参数应该是均值和方差,均值比较接近生成时的值 100和200,方差与原来的30差别巨大,还不知道怎么理解。
1 m[0] = 156.311
2 m[1] = 205.464
3 m[2] = 80.8426
4 m[3] = 136.952
5 m[4] = 86.6091
6 m[5] = 80.3185
7 m[6] = 107.911
8 m[7] = 63.1748
9 m[8] = 107.082
10 m[9] = 112.343
11 m[0] = 189.946
12 m[1] = 174.951
13 m[2] = 243.387
14 m[3] = 169.488
15 m[4] = 261.163
16 m[5] = 215.278
17 m[6] = 212.506
18 m[7] = 150.613
19 m[8] = 186.132
20 m[9] = 213.155
21 Cluster[0]
22 Parameters:
23 [91.04822175454494, 385.98395103056583]
24 Proportion: 0.325826
25 Cluster[1]
26 Parameters:
27 [189.88473393439773, 1626.5175226586712]
28 Proportion: 0.674174
后来在代码中添加了打印信息:
estimator->Print(std::cout);
打印输出:
1 ExpectationMaximizationMixtureModelEstimator (0000015D6BE586B0)
2 RTTI typeinfo: class itk::Statistics::ExpectationMaximizationMixtureModelEstimator<class itk::Statistics::ListSample<class itk::Vector<double,1> > >
3 Reference Count: 1
4 Modified Time: 74
5 Debug: Off
6 Object Name:
7 Observers:
8 none
9 Maximum Iteration: 100
10 Sample: 0000015D6BE2B790
11 Number Of Components: 2
12 Component Membership Function[0]: 0000015D6BE53FC0
13 Component Membership Function[1]: 0000015D6BE58540
14 Termination Code: itk::Statistics::ExpectationMaximizationMixtureModelEstimatorEnums::TERMINATION_CODE::NOT_CONVERGED
15 Initial Proportions: [0.5, 0.5]
16 Proportions: [0.3258255562923341, 0.6741744437076659]
17 Calculated Expectation: -15.0301
从打印结果来看,EM算法更新了74次参数,设置的最大迭代次数是100,因此迭代次数是够用的,最终收敛期望是 -15.03, 退出代码说明 TERMINATION_CODE::NOT_CONVERGED 未收敛。
这是官方示例程序,所以不知道说啥。
正常理解应该是收敛到0附近。