Kmeans笔记

  最近因工作需要,折腾了一下opencv中的kmeans,网上关于opencv的kmeans比较少,说的也不好理解。无奈只能自己硬着头皮来。使用官方提供的demo,用cout把其中的points打印出来,来来回回对比,也就略懂一二。先上代码,然后慢慢分析。

 1 int main( int /*argc*/, char** /*argv*/ )
 2 {
 3     const int MAX_CLUSTERS = 5;
 4     Scalar colorTab[] =
 5     {
 6         Scalar(0, 0, 255),
 7         Scalar(0,255,0),
 8         Scalar(255,100,100),
 9         Scalar(255,0,255),
10         Scalar(0,255,255)
11     };
12 
13     Mat img(500, 500, CV_8UC3);
14     RNG rng(12345);
15 
16     for(;;)
17     {
18         int k, clusterCount = rng.uniform(2, MAX_CLUSTERS+1);
19         int i, sampleCount = rng.uniform(1, 1001);
20         Mat points(sampleCount, 1, CV_32FC2), labels;
21 
22         clusterCount = MIN(clusterCount, sampleCount);
23         Mat centers;
24 
25         /* generate random sample from multigaussian distribution */
26         for( k = 0; k < clusterCount; k++ )
27         {
28             Point center;
29             center.x = rng.uniform(0, img.cols);
30             center.y = rng.uniform(0, img.rows);
31             Mat pointChunk = points.rowRange(k*sampleCount/clusterCount,
32                                              k == clusterCount - 1 ? sampleCount :
33                                              (k+1)*sampleCount/clusterCount);
34             rng.fill(pointChunk, RNG::NORMAL, Scalar(center.x, center.y), Scalar(img.cols*0.05, img.rows*0.05));
35         }
36 
37         randShuffle(points, 1, &rng);
38 
39         kmeans(points, clusterCount, labels,
40             TermCriteria( TermCriteria::EPS+TermCriteria::COUNT, 10, 1.0),
41                3, KMEANS_PP_CENTERS, centers);
42 
43         img = Scalar::all(0);
44 
45         for( i = 0; i < sampleCount; i++ )
46         {
47             int clusterIdx = labels.at<int>(i);
48             Point ipt = points.at<Point2f>(i);
49             circle( img, ipt, 2, colorTab[clusterIdx], FILLED, LINE_AA );
50         }
51 
52         imshow("clusters", img);
53 
54         char key = (char)waitKey();
55         if( key == 27 || key == 'q' || key == 'Q' ) // 'ESC'
56             break;
57     }
58 
59     return 0;
60 }

第20行,创建一个mat实例points。长度为1001内的随机值,类型是2通道的浮点数据,纬度1。可以使用语句来看效果,建议将第19行的值减少,例如101。

cout << "points = " << endl << points << endl;

你可以看到效果是这样子的,通道值为0。

1 points =
2 [0, 0;
3  0, 0;
4  0, 0;
5  0, 0;]

第26~35行的for循环将会产生clusterCount个中心,并将points分成clusterCount段且填充随机数。

第37行,则是将points重新打乱。可以在37行前后使用语句,看看randShuffle函数的效果。

cout << "points = " << endl << points << endl;

第39行,kmeans函数,对样本进行聚类,CPP的函数原型是这样子的

double kmeans(InputArray data, int K, InputOutputArray bestLabels, TermCriteria criteria, int attempts, int flags, OutputArray centers=noArray() )

其中,

data – Data for clustering. An array of N-Dimensional points with float coordinates is needed. Examples of this array can be:

       – Mat points(count, 2, CV_32F);
       – Mat points(count, 1, CV_32FC2);
       – Mat points(1, count, CV_32FC2);
       – std::vector<cv::Point2f> points(sampleCount);

输入是一个N维度的浮点型的点,包括1个维度。

K           – Number of clusters to split the set by.

需要人工指定类的数量

labels    – Input/output integer array that stores the cluster indices for every sample.

存储每个样本的类型

criteria  – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. The accuracy is specified as criteria.epsilon. As soon as each of the cluster centers moves by less than criteria.epsilon on some iteration, the algorithm stops.

终止条件

attempts – Flag to specify the number of times the algorithm is executed using different initial labellings. The algorithm returns the labels that yield the best compactness (see the last function parameter).

执行次数,配合 flags 使用

flags – Flag that can take the following values:

  – KMEANS_RANDOM_CENTERS Select random initial centers in each attempt.

随机中心

  – KMEANS_PP_CENTERS Use kmeans++ center initialization by Arthur and Vassilvitskii [Arthur2007].

  – KMEANS_USE_INITIAL_LABELS During the first (and possibly the only) attempt,use the user-supplied labels instead of computing them from the initial centers. For the second and further attempts, use the random or semi-random centers. Use one of KMEANS_*_CENTERS flag to specify the exact method.

用户自定义中心

centers – Output matrix of the cluster centers, one row per each cluster center.

中心值,使用cout可以看到centers的值和for循环中的centers值接近。

 


 

内容大概这么多,只要把样本转换成规定的points,并制定类的数据量,还有定义label和centers两个变量,其他的可以和demo一样即可。

posted @ 2016-01-23 18:16  hanfengcan  阅读(329)  评论(0编辑  收藏  举报