Exercise : Independent Componet Analysis

参考网页：http://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_Analysis

实验内容：

　　模型：ICA模型

　　数据集：STL_10，样本RGB三通道的8*8patch块，样本维度8*8*3 = 192

　　目标：学习一组不完备正交基进行特征表示

理论基础：

　　ICA模型目标函数：

$J(W,X)=\frac{1}{m}\left \| W^{T}WX-X \right \|_{2}^{2} + \lambda \sum{(\sqrt{(WX)\cdot^2+\epsilon})}$

$s.t :WW^{T}=I$

对W求偏导：使用BP算法求导，ng网页中的最终结果有误，本人已更正

$\bigtriangledown _{W}F = \frac{2}{m}( w(w^{T}wx-x)x^{T}+(wx)(w^{T}wx-x)^{T}) + \lambda ((wx)^{2}+\epsilon)^{-\frac{1}{2}}\cdot (wx)x^{T}$

一些问题：

在优过过程中使用backtracking方法，具体不太明白，大概意思是寻求步长t，使目标函数以尽量大的步长向最优解下降。通过梯度计算目前的收敛方向，那到底以多大的步长下降呢？backtracking就是在寻找这个步长，不停的增大步长，直到前后目标函数值相差小于某个值的时候停止。
使用梯度下降法求解过程学习率的设定问题，原始代码给的是0.5，不过学习一段时间后发散了（不到100轮，因为作者对每50轮学习的结果进行可视化，发现100轮时已经发散了），因此本人在学习50轮后将学习率改为0.1，不会出现发散，0.1是作者随意设的没有经过严密的推导
目标函数中的lambda设定：参照tornadomeet的博文，设置为1/m（m表示样本数目=20,000），发现收敛速度特别慢，参考sparse coding设为1e-2收敛速度明显加快，第一次体验到在深度学习中参数对模型的重要性。
使用ZCA白化进行预处理时不需要使用正则项，因为ICA模型学习一组不完备正交基，特征值不会出来0的情况，因此不需要正则化
因为要学习一组不完备基，因此学习的特征数目必须小于样本维度，本实验中样本维度为192，特征数目为121
当ICAExercise.m通过梯度验证后，将验证代码注释掉

实验结果：

学习步长不变时，50轮与100轮的结果，明显看出100结果已发散，说明学习率太大

调整步长后的结果：20轮与1000轮时的结果。电脑不给力跑了好久了还没跑出最终结果，不过1000轮时的结果已经比较给力了，NG建议的迭代次数为20000，tornadomeet采用的lambda比作者小，建议的迭代次数为50000次，不过个人感觉在作者的参数设定下10000次迭代应该会有比较理想的结果。等跑出结果再上图吧--！

最终结果：迭代了3000轮，结果貌似和1000轮区别不太大。。

实验主要代码：

ICAExercise.m

%% CS294A/CS294W Independent Component Analysis (ICA) Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  ICA exercise. In this exercise, you will need to modify
%  orthonormalICACost.m and a small part of this file, ICAExercise.m.

%%======================================================================
%% STEP 0: Initialization
%  Here we initialize some parameters used for the exercise.

numPatches = 20000;
numFeatures = 121;
imageChannels = 3;
patchDim = 8;
visibleSize = patchDim * patchDim * imageChannels;

outputDir = '.';
epsilon = 1e-6; % L1-regularisation epsilon |Wx| ~ sqrt((Wx).^2 + epsilon)

%%======================================================================
%% STEP 1: Sample patches

patches = load('stlSampledPatches.mat');
patches = patches.patches(:, 1:numPatches);
displayColorNetwork(patches(:, 1:100));


%%======================================================================
%% STEP 2: ZCA whiten patches
%  In this step, we ZCA whiten the sampled patches. This is necessary for
%  orthonormal ICA to work.

patches = patches / 255;
meanPatch = mean(patches, 2);
patches = bsxfun(@minus, patches, meanPatch);

sigma = patches * patches';
[u, s, v] = svd(sigma);
ZCAWhite = u * diag(1 ./ sqrt(diag(s))) * u';
patches = ZCAWhite * patches;
figure(2);
displayColorNetwork(patches(:, 1:100));
%%======================================================================
%% STEP 3: ICA cost functions
%  Implement the cost function for orthornomal ICA (you don't have to 
%  enforce the orthonormality constraint in the cost function) 
%  in the function orthonormalICACost in orthonormalICACost.m.
%  Once you have implemented the function, check the gradient.

% %Use less features and smaller patches for speed
% numFeatures = 5;
% patches = patches(1:3, 1:5);
% visibleSize = 3;
% numPatches = 5;
% 
% weightMatrix = rand(numFeatures, visibleSize);
% 
% [cost, grad] = orthonormalICACost(weightMatrix, visibleSize, numFeatures, patches, epsilon);
% 
% numGrad = computeNumericalGradient( @(x) orthonormalICACost(x, visibleSize, numFeatures, patches, epsilon), weightMatrix(:) );
% % Uncomment to display the numeric and analytic gradients side-by-side
% % disp([numGrad grad]); 
% diff = norm(numGrad-grad)/norm(numGrad+grad);
% fprintf('Orthonormal ICA difference: %g\n', diff);
% assert(diff < 1e-7, 'Difference too large. Check your analytic gradients.');
% 
% fprintf('Congratulations! Your gradients seem okay.\n');
% 

%%======================================================================
%% STEP 4: Optimization for orthonormal ICA
%  Optimize for the orthonormal ICA objective, enforcing the orthonormality
%  constraint. Code has been provided to do the gradient descent with a
%  backtracking line search using the orthonormalICACost function 
%  (for more information about backtracking line search, you can read the 
%  appendix of the exercise).
%
%  However, you will need to write code to enforce the orthonormality 
%  constraint by projecting weightMatrix back into the space of matrices 
%  satisfying WW^T  = I.
%
%  Once you are done, you can run the code. 10000 iterations of gradient
%  descent will take around 2 hours, and only a few bases will be
%  completely learned within 10000 iterations. This highlights one of the
%  weaknesses of orthonormal ICA - it is difficult to optimize for the
%  objective function while enforcing the orthonormality constraint - 
%  convergence using gradient descent and projection is very slow.

weightMatrix = rand(numFeatures, visibleSize);%121*192
[cost, grad] = orthonormalICACost(weightMatrix(:), visibleSize, numFeatures, patches, epsilon);
fprintf('%11s%16s%10s\n','Iteration','Cost','t');
startTime = tic();

% Initialize some parameters for the backtracking line search
alpha = 0.5;
t = 0.02;
lastCost = 1e40;

% Do 10000 iterations of gradient descent
for iteration = 1:10000
                       
    grad = reshape(grad, size(weightMatrix));
    newCost = Inf;        
    linearDelta = sum(sum(grad .* grad));
    
    if iteration == 50
        alpha = 0.1;
    end
    % Perform the backtracking line search
    %每一轮迭代计算梯度grad,通过backtracking方法计算最大的学习速度，即在当前梯度方向的最大下降步长，当前后两次的结果小于某个阈值时，停止步长搜索
    while 1
        considerWeightMatrix = weightMatrix - alpha * grad;
        % -------------------- YOUR CODE HERE --------------------
        % Instructions:
        %   Write code to project considerWeightMatrix back into the space
        %   of matrices satisfying WW^T = I.
        %   
        %   Once that is done, verify that your projection is correct by 
        %   using the checking code below. After you have verified your
        %   code, comment out the checking code before running the
        %   optimization.
        
        % Project considerWeightMatrix such that it satisfies WW^T = I
%         error('Fill in the code for the projection here');        
        considerWeightMatrix = (considerWeightMatrix*considerWeightMatrix')^(-0.5)*considerWeightMatrix;
        % Verify that the projection is correct
        temp = considerWeightMatrix * considerWeightMatrix';
        temp = temp - eye(numFeatures);
        assert(sum(temp(:).^2) < 1e-23, 'considerWeightMatrix does not satisfy WW^T = I. Check your projection again');
   %      error('Projection seems okay. Comment out verification code before running optimization.');
        
        % -------------------- YOUR CODE HERE --------------------                                        

        [newCost, newGrad] = orthonormalICACost(considerWeightMatrix(:), visibleSize, numFeatures, patches, epsilon);
        if newCost >= lastCost - alpha * t * linearDelta
            t = 0.9 * t;
        else
            break;
        end
    end
   
    lastCost = newCost;
    weightMatrix = considerWeightMatrix;
    
    fprintf('  %9d  %14.6f  %8.7g\n', iteration, newCost, t);
    
    t = 1.1 * t;
    
    cost = newCost;
    grad = newGrad;
           
    % Visualize the learned bases as we go along    
    if mod(iteration, 20) == 0
        duration = toc(startTime);
        % Visualize the learned bases over time in different figures so 
        % we can get a feel for the slow rate of convergence
        figure(floor(iteration /  20));
        displayColorNetwork(weightMatrix'); 
    end
                   
end

% Visualize the learned bases
displayColorNetwork(weightMatrix');

ICAExercisecost.m

function [cost, grad] = orthonormalICACost(theta, visibleSize, numFeatures, patches, epsilon)
%orthonormalICACost - compute the cost and gradients for orthonormal ICA
%                     (i.e. compute the cost ||Wx||_1 and its gradient)

    weightMatrix = reshape(theta, numFeatures, visibleSize);
    
    cost = 0;
    grad = zeros(numFeatures, visibleSize);
    num_samples = size(patches,2);
    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   weights given in weightMatrix.     
    % -------------------- YOUR CODE HERE -------------------- 
    lambda = 1e-2;
    m = size( patches , 2);
    w = weightMatrix;
    x = patches;
    error = w'*w*x - x;
    cost = 1/m * sum( error(:).^2 ) + lambda * sum( sum ( sqrt( (w*x).^2 + epsilon ) ) );
    grad = 1/m * ( 2*w*(w'*w*x-x)*x' + 2*(w*x)*(w'*w*x-x)' ) + ...
          lambda * ( (w*x).^2 + epsilon).^(-0.5).*(w*x) * x';
    grad = grad(:);
end

参考资料

http://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_idea

http://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_Analysis

http://deeplearning.stanford.edu/wiki/index.php/Independent_Component_Analysis

http://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_Interpretation

http://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding

posted @ 2014-12-09 18:09 dupuleng 阅读(441) 评论(0) 编辑收藏举报

刷新页面返回顶部

dupuleng

Exercise : Independent Componet Analysis

公告