EX7：K-均值聚类和PCA

EX7：K-均值聚类和PCA
- - 1、K-均值聚类
  - 2.PCA-中心成分分析法

EX7：K-均值聚类和PCA

前言：本练习中，我们将利用K-均值算法压缩一张图片，第二部分中，将使用PCA为面部图片找寻低维描述。

1、K-均值聚类

在第一个练习中，主要实现K-means算法并将其用于图像压缩。首先从2D数据集样本开始，目的了解K-means算法如何工作的直观性。之后，将使用K-means算法进行图像压缩，将图像中发生的颜色数量减少为仅图像中最常见的颜色。

K-means算法是一种自动将类似数据样本聚类在一起的方法。具体来说，给定一个训练集{\(x^{(1)},x^{(2)},...,x^{(m)}\)}（其中\(x^{(i)} \in \mathbb{R}^n\)），希望将数据分组为几个集群。K-means的直观体现是一个迭代过程，它通过猜测初始聚类中心点（centroids）开始，然后重复地将样本分配给最接近的中心，基于分配来重新计算中心点。

K-均值算法流程如下：

%初始化中心点
centroids = kMeansInitCentroids(x,K);
for iter = 1:iterations
	%聚类分配步，将数据集分配到最邻近的中心点.
	%idx(i)对应于c^(i),为样本点所属中心点的索引
	idx = findClosestCentroids(X,centroids);
	
	%中心点移动步，计算所属样本下的均值点
	centroids = computeMeans(X,idx,K);
end

算法的内循环重复执行两个步骤：（1）将每个训练样本x（i）分配到其最近的中心点，（2）使用分配给它的点重新计算每个中心点的平均值。注意，融合解决方案可能并不总是理想的，并且取决于质心的初始设置。因此，实际上，K-means算法通常用不同的随机初始化运行几次，从不同的随机初始化中选择这些不同解决方案的一种方法是选择具有最低成本函数值（失真）的解决方案。

1.1找寻最近的中心点

在K-均值算法的聚类分配中，每个训练样本\(x^{(i)}\)被分配到最近的中心点，对于每一个样本i，我们设置\(c^{(i)}\):= j (其中j是 \(||x{(i)}-\mu_j||^2\)取得最小值时)。即\(c^{(i)}\)为最接近点\(x^{(i)}\)处中心点的索引，程序中以idx(i)表述。程序名为：findClosestCentroids.m

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%

% Set K
K = size(centroids, 1);

% You need to return the following variables correctly.
idx = zeros(size(X,1), 1)

% ====================== YOUR CODE HERE ======================
m = size(X,1);
temp = zeros(K, 1);
temp_p = [];

for i = 1:m
    for j = 1:K
        temp(j) = (X(i,:)-centroids(j,:))*(X(i,:)-centroids(j,:))';
    end
    temp_p = find(temp==min(temp));
    idx(i) = temp_p(1);%避免出现同时两个最小值    
end

end

1.2计算中心点-centroid

对于中心点所属的点，第二步就是通过对所属的点计算均值重新定义中心点，对于每个中心点k，我们有：，这里\(C_k\)为属于中心点k的样本集。

如：样本\(x^{(3)}、x^{(5)}\)都属于中心点k=2，这时我们应该更新样本点：\(\mu_2=\frac{1}{2}(x^{(3)}+y^{(5)})\)。

程序名为：computeCentroids.m

function centroids = computeCentroids(X, idx, K)

% Useful variables
[m n] = size(X);

% You need to return the following variables correctly.
centroids = zeros(K, n);

% ====================== YOUR CODE HERE ======================
temp = [];
for i = 1:K
    temp = X(find(idx==i),:);
    centroids(i,:) = sum(temp)/(size(temp,1));
end

end

1.3样本数据下的K-均值算法表现

完成上述两个函数（findClosestCentroids和computeCentroids）后，下一步将在二维数据集上运行K-means算法，以了解K-means如何工作。函数从runKmeans.m脚本中调用，具体如下：

function [centroids, idx] = runkMeans(X, initial_centroids, ...
                                      max_iters, plot_progress)

% Set default value for plot progress
if ~exist('plot_progress', 'var') || isempty(plot_progress)
    plot_progress = false;
end

% Plot the data if we are plotting progress
if plot_progress
    figure;
    hold on;
end

% Initialize values
[m n] = size(X);
K = size(initial_centroids, 1);
centroids = initial_centroids;
previous_centroids = centroids;
idx = zeros(m, 1);

% Run K-Means
for i=1:max_iters
    
    % Output progress
    fprintf('K-Means iteration %d/%d...\n', i, max_iters);
    if exist('OCTAVE_VERSION')
        fflush(stdout);
    end
    
    % 步一
    idx = findClosestCentroids(X, centroids);
    
    % Optionally, plot progress here
    if plot_progress
        plotProgresskMeans(X, centroids, previous_centroids, idx, K, i);
        previous_centroids = centroids;
        fprintf('Press enter to continue.\n');
        pause;
    end
    
    % 步二
    centroids = computeCentroids(X, idx, K);
end

% Hold off if we are plotting progress
if plot_progress
    hold off;
end

end

此外在本脚本中，对k-均值计算的每一步可视化中都调用了函数plotProgressMeans.m，code如下：

function plotProgresskMeans(X, centroids, previous, idx, K, i)
%It is intended for use only with 2D data.

% Plot the examples
plotDataPoints(X, idx, K);

%plotDataPoints如下：
%function plotDataPoints(X, idx, K)
%% Create palette-创建调色板
%palette = hsv(K + 1);
%colors = palette(idx, :);
%
%% Plot the data
%scatter(X(:,1), X(:,2), 15, colors);%散点图，又称气泡图

% Plot the centroids as black x's
plot(centroids(:,1), centroids(:,2), 'x', ...
     'MarkerEdgeColor','k', ...
     'MarkerSize', 10, 'LineWidth', 3);

% Plot the history of the centroids with lines
for j=1:size(centroids,1)
    drawLine(centroids(j, :), previous(j, :));%plot([p1(1) p2(1)], [p1(2) p2(2)], varargin{:});
end

% Title
title(sprintf('Iteration number %d', i))

end

在运行下一步时，K-means代码将产生可视化，对每次迭代的进度和进度可视化表现。按多次输入以查看K-means算法的每个步骤如何更改质心和集群分配。设置步长为10步，运行结果如下图所示：

![2.jpg](http://wx2.sinaimg.cn/mw690/7b8d2108gy1fhplfeflb1j20cw0b0dg8.jpg)

#### 1.4随机初始化

以上程序中的中心点的初始赋值是给定好的。实际上，初始化中心点的良好策略是从训练集的样本中随机挑选。

函数kMensInitCentroids.m给出挑选过程，具体如下：

% Initialize the centroids to be random examples

% 重新随机排序样本的索引
randidx = randperm(size(X, 1));%randperm函数

% Take the first K examples as centroids
centroids = X(randidx(1:K), :);

上面的代码首先随机排列了样本的索引（使用randperm）。然后，它基于索引的随机排列来选择前K个样本。保证了随机选择这些样本的同时，不会选择相同的样本两次的风险。

1.5基于k-均值的图片压缩

在下面的24位真彩色图中（如，0x000000为黑色，0xFFFFFF为白色），每个像素可表示为指定红色，绿色和蓝色强度值的三个8位无符号整数（范围从0到255）。这种编码通常被称为RGB编码。我们的图像包含数千种颜色，在这部分练习中，将减少为16种颜色的颜色数量。

![3.jpg](http://wx4.sinaimg.cn/mw690/7b8d2108gy1fhplffzrosj208y08rn2z.jpg)

通过这种减少，可以有效地表示（压缩）照片。具体来说，只需要存储16个所选颜色的RGB值，并且对于图像中的每个像素，现在只需要在该位置存储颜色的索引（只需要4位来表示16种可能性）。
练习中利用K-means算法来选择将用于表示压缩图像的16种颜色。具体来说，将原始图像中的每个像素视为数据示例，并使用K-means算法找到最佳组（聚类）三维RGB空间中的像素的16种颜色。一旦计算了图像上的聚类中心点，将使用16种颜色来替换原始图像中的像素。

单个像素下的K-均值运用

首先，在Octave/MATLAB中，图片可以通过以下的命令进行读取：

% Load 128x128 color image (bird small.png)
A  = imread('bird small.png');

% If you do not have the image package installed, 
% you should instead change the following line to
%	load('bird small.mat'); % Loads the image into the variable A

上述代码所创建的一个三维矩阵A，其前两个索引标识像素位置，其最后一个索引表示红色，绿色或蓝色。例如，A（50,33,3）给出行50和列33处的像素的蓝色强度。
ex7.m中的代码首先加载图像，然后重新构图，创建一个m×3像素颜色矩阵（其中m = 16384 = 128×128），并在其上调用K-means函数。

在寻找到前K=16个颜色代表图片后，现在可以使用函数findClosetCentroids将每个像素位置最近的中心点做替换。即使用每个像素的中心点来表示原始图像。

请注意，此时已经显着减少了描述图像所需的位数。原始图像对于128×128像素位置中的每一个像素的描述需要24位，总大小为128×128×24 = 393216比特。新的表示形式需要以字典形式的形式存储16种颜色，每种颜色需要24位，但是图像本身只需要每像素位置4位。因此，使用的最终位数为16×24 + 128×128×4 = 65920比特，其对应于将原始图像压缩约6倍。运行程序，我们可以看到以下图像，左侧为原图像，右图为压缩后的图像。

![4.jpg](http://wx2.sinaimg.cn/mw690/7b8d2108gy1fhplfhozkej20hl07vdls.jpg)

代码如下：

%  Load an image of a bird
A = double(imread('bird_small.png'));
A = A / 255; % Divide by 255 so that all values are in the range 0 - 1

% Size of the image
img_size = size(A);

% Reshape the image into an Nx3 matrix where N = number of pixels.
% Each row will contain the Red, Green and Blue pixel values
% This gives us our dataset matrix X that we will use K-Means on.
X = reshape(A, img_size(1) * img_size(2), 3);%学习之处

% Run your K-Means algorithm on this data
% You should try different values of K and max_iters here
K = 16; 
max_iters = 10;

% When using K-Means, it is important the initialize the centroids
% randomly. 
% You should complete the code in kMeansInitCentroids.m before proceeding
initial_centroids = kMeansInitCentroids(X, K);

% Run K-Means
[centroids, idx] = runkMeans(X, initial_centroids, max_iters);

% Find closest cluster members
idx = findClosestCentroids(X, centroids);

% Essentially, now we have represented the image X as in terms of the
% indices in idx. 

% We can now recover the image from the indices (idx) by mapping each pixel
% (specified by its index in idx) to the centroid value
X_recovered = centroids(idx,:);%直接写：centroids(idx,:)

% Reshape the recovered image into proper dimensions
X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);%值得学习

% Display the original image 
subplot(1, 2, 1);
imagesc(A); 
title('Original');

%imagesc(A) 将矩阵A中的元素数值按大小转化为不同颜色，并在坐标轴对应位置处以这种颜色染色

% Display compressed image side by side
subplot(1, 2, 2);
imagesc(X_recovered)
title(sprintf('Compressed, with %d colors.', K));

2.PCA-中心成分分析法

本节练习中，将使用PCA来完成维度缩减，将首先从二维的数据集开始，从而对PCA怎样工作形成直观的认识，接着对多达5000份的面部图像进行处理。

2.1数据集样本

为了从直观上了解PCA的工作原理，首先从2D数据集开始，该数据集具有一个大的变化方向和一个较小的变化。如下图为脚本ex7 pca.m所绘制训练数据。在这部分练习中，将可以看出使用PCA将数据从二维缩小到一维时会发生什么。

![5.jpg](http://wx2.sinaimg.cn/mw690/7b8d2108gy1fhplfibni7j206p06lt8m.jpg)

2.2实施PCA

PCA由两个计算步骤组成：首先，计算数据的协方差矩阵，然后使用Octave / MATLAB的SVD函数计算特征向量U1，U2，...，Un。

在使用PCA之前，首先通过从数据集中减去每个要素的平均值，并缩放每个维度使其处于相同的范围来首先对数据进行归一化。在脚本ex7 pca.m中，使用featureNormalize函数为此进行了归一化。归一化数据之后，可以运行PCA来计算主要成分。

可以由下式计算数据的协方差矩阵，\(\Sigma = \frac{1}{m}X^TX\)：其中X是以行为单位的矩阵，m是样本的数目，注意Σ是n×n矩阵，而不是求和运算符。

在计算过协方差矩阵之后，可以使用SVD函数：[U,S,V] = svd(Sigma)，这里U为主要成分，S为对角阵，之后运行可以得到下图：

![6.jpg](http://wx2.sinaimg.cn/mw690/7b8d2108gy1fhplfir6p1j206k06ma9y.jpg)

pca.m函数代码如下：

% Useful values
[m, n] = size(X);

% You need to return the following variables correctly.
U = zeros(n);
S = zeros(n);

Sigma = X'*X./m;
[U,S,~] = svd(Sigma);

2.3使用PCA维度缩减

在计算中心成分之后，可以使用它们通过将每个样本投影到较低维度空间，\(x^{(i)}->z^{(i)}\)（例如，将数据从2D投影到1D）。在本节中将使用PCA返回的特征向量，并将示例数据集投影到一维空间中。

实际上，如果正在使用线性回归或神经网络等学习算法，则可以使用投影数据而不是原始数据。通过使用投影数据，可以更快地训练模型速度，因为输入中的size较小。

将数据投影到主成分上

function Z = projectData(X, U, K)

% You need to return the following variables correctly.
Z = zeros(size(X, 1), K);

Z = X*U(:,1:K);

重构数据

function X_rec = recoverData(Z, U, K)

% You need to return the following variables correctly.
X_rec = zeros(size(Z, 1), size(U, 1));

X_rec = Z*U(:,1:K)';

在完成projectData和recoverData函数之后，ex7 pca.m将会把投影和近似值重建，以显示投影如何影响数据。原始数据点用蓝色圆圈表示，而投影数据点用红色圆圈表示。

![7.jpg](http://wx4.sinaimg.cn/mw690/7b8d2108gy1fhplfivf97j207t07pgln.jpg)

2.4面部图片数据集

ex7faces.mat中已经给出了32× 32的灰度面部图片数据集X，其中X矩阵的每一行表示一张面部图片，将数据集X的前100行即前100张面部图片可视化，如下图：

![1.jpg](http://wx1.sinaimg.cn/mw690/7b8d2108gy1fhqcrr6jcej209r09sacu.jpg)

2.4.1 PCA on Faces

为了在面部图片的数据集上运行PCA，首先应该先通过从数据矩阵X中减去每个特征的平均值来对数据集进行归一化处理。

运行PCA后，将获得数据集的中心成分。请注意，U（每行）中的每个主成分是长度为n的矢量（面部图片数据集的n = 1024）。这里我们可以通过将它们中的每一个重新形成-reshape与原始数据集中的像素对应的32×32矩阵来可视化这些计算出来的中心成分，可视化结果如下：(对比原图，眼眶、鼻子和嘴巴突出的很明显)

![2.jpg](http://wx3.sinaimg.cn/mw690/7b8d2108gy1fhqcrrrbkrj209u09sgmt.jpg)

%  Before running PCA, it is important to first normalize X by subtracting 
%  the mean value from each feature
[X_norm, mu, sigma] = featureNormalize(X);%

%  Run PCA
[U, S] = pca(X_norm);

%  绘出前36个特征向量
displayData(U(:, 1:36)');

2.4.2维度缩减

现在可以使用上面所计算的面部中心成分来应用与面部数据集上，这样就可以使用较小尺寸的输入（例如，100维）来学习算法，而不是原始的1024维度，以加快学习算法的速度。

ex7 pca.m脚本中将面部数据集投影到前100个中心成分上。具体地，现在每个脸部图像为向量\(z^{(i))} \in \mathbb{R}^{100}\)。

K = 100;
Z = projectData(X_norm, U, K);

要了解维度降低中丢失的内容，可以使用投影数据集来恢复数据。在脚本中，执行数据的近似恢复，原始和投影的脸部图像显示如下：

![3.jpg](http://wx3.sinaimg.cn/mw690/7b8d2108gy1fhqcrtj59ij20jj08ttde.jpg)

从重建--reconstructed中可以看出，在细节丢失的同时保留了脸部的一般结构和外观！ -->数据集大小显着降低（超过10倍），可以显着提高学习算法的速度，如果在训练神经网络中执行人物识别（gven面部图像，预测人物的身份识别），则可以使用只有100个维度的维度缩小输入而不是原始像素。

K = 100;
X_rec  = recoverData(Z, U, K);

% Display normalized data
subplot(1, 2, 1);
displayData(X_norm(1:100,:));
title('Original faces');
axis square;

% Display reconstructed data from only k eigenfaces
subplot(1, 2, 2);
displayData(X_rec(1:100,:));
title('Recovered faces');
axis square;

2.5 PCA在绘图的应用

在第一节中的K-means图像压缩练习中，已经通过对三维RGB空间中使用了K-means算法，这里在的脚本，我们使用scatter3函数（三维散点图函数）可视化此3D空间中的最终像素分配。每个数据点根据已分配给它的集群进行着色，如上图所示。

% Reload the image from the previous exercise and run K-Means on it
% For this to work, you need to complete the K-Means assignment first
A = double(imread('bird_small.png'));

% If imread does not work for you, you can try instead
%   load ('bird_small.mat');

A = A / 255;
img_size = size(A);
X = reshape(A, img_size(1) * img_size(2), 3);
K = 16; 
max_iters = 10;
initial_centroids = kMeansInitCentroids(X, K);
[centroids, idx] = runkMeans(X, initial_centroids, max_iters);

%  Sample 1000 random indexes (since working with all the data is
%  too expensive. If you have a fast computer, you may increase this.
sel = floor(rand(1000, 1) * size(X, 1)) + 1;

%  Setup Color Palette
palette = hsv(K);
colors = palette(idx(sel), :);

%  Visualize the data and centroid memberships in 3D
figure;
scatter3(X(sel, 1), X(sel, 2), X(sel, 3), 10, colors);
title('Pixel dataset plotted in 3D. Color shows centroid memberships');

事实证明，可视化数据集在3维及更高的维度可能很麻烦。因此，一般以丢失一些信息为代价，来仅显示二维数据。在实践中，PCA通常用于降低数据的维度以进行可视化目的。在ex7 pca.m的脚本中利用PCA实现三维数据的二维可视化，并在2D散点图中显示结果。 PCA投影可以被认为是选择最大化数据扩展的视图的旋转，这通常对应于“最佳”视图。

![5.jpg](http://wx1.sinaimg.cn/mw690/7b8d2108gy1fhqcruj4oaj20d90abq3d.jpg)

posted @ 2017-07-19 23:10 SrtFrmGNU 阅读(1731) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Luke's Blog.

欲穷千里目，事后拂衣归

EX7：K-均值聚类和PCA

EX7：K-均值聚类和PCA

1、K-均值聚类

1.1找寻最近的中心点

1.2计算中心点-centroid

1.3样本数据下的K-均值算法表现

1.5基于k-均值的图片压缩

单个像素下的K-均值运用

2.PCA-中心成分分析法

2.1数据集样本

2.2实施PCA

2.3使用PCA维度缩减

2.4面部图片数据集

2.4.1 PCA on Faces

2.4.2维度缩减

2.5 PCA在绘图的应用

公告