Exercise : Independent Componet Analysis

参考网页:http://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_Analysis

 

实验内容:

   模型:ICA模型

  数据集:STL_10,样本RGB三通道的8*8patch块,样本维度8*8*3 = 192

  目标:学习一组不完备正交基进行特征表示

理论基础:

  ICA模型目标函数 :

      对W求偏导:使用BP算法求导,ng网页中的最终结果有误,本人已更正

 

一些问题:

  1. 在优过过程中使用backtracking方法,具体不太明白,大概意思是寻求步长t,使目标函数以尽量大的步长向最优解下降。通过梯度计算目前的收敛方向,那到底以多大的步长下降呢?backtracking就是在寻找这个步长,不停的增大步长,直到前后目标函数值相差小于某个值的时候停止。
  2. 使用梯度下降法求解过程学习率的设定问题,原始代码给的是0.5,不过学习一段时间后发散了(不到100轮,因为作者对每50轮学习的结果进行可视化,发现100轮时已经发散了),因此本人在学习50轮后将学习率改为0.1,不会出现发散,0.1是作者随意设的没有经过严密的推导
  3. 目标函数中的lambda设定:参照tornadomeet的博文,设置为1/m(m表示样本数目=20,000),发现收敛速度特别慢,参考sparse coding设为1e-2收敛速度明显加快,第一次体验到在深度学习中参数对模型的重要性。
  4. 使用ZCA白化进行预处理时不需要使用正则项,因为ICA模型学习一组不完备正交基,特征值不会出来0的情况,因此不需要正则化
  5. 因为要学习一组不完备基,因此学习的特征数目必须小于样本维度,本实验中样本维度为192,特征数目为121
  6. 当ICAExercise.m通过梯度验证后,将验证代码注释掉

实验结果:

学习步长不变时,50轮与100轮的结果,明显看出100结果已发散,说明学习率太大

调整步长后的结果:20轮与1000轮时的结果。电脑不给力跑了好久了还没跑出最终结果,不过1000轮时的结果已经比较给力了,NG建议的迭代次数为20000,tornadomeet采用的lambda比作者小,建议的迭代次数为50000次,不过个人感觉在作者的参数设定下10000次迭代应该会有比较理想的结果。等跑出结果再上图吧--!

 

最终结果:迭代了3000轮,结果貌似和1000轮区别不太大。。

 

实验主要代码:

ICAExercise.m

%% CS294A/CS294W Independent Component Analysis (ICA) Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  ICA exercise. In this exercise, you will need to modify
%  orthonormalICACost.m and a small part of this file, ICAExercise.m.

%%======================================================================
%% STEP 0: Initialization
%  Here we initialize some parameters used for the exercise.

numPatches = 20000;
numFeatures = 121;
imageChannels = 3;
patchDim = 8;
visibleSize = patchDim * patchDim * imageChannels;

outputDir = '.';
epsilon = 1e-6; % L1-regularisation epsilon |Wx| ~ sqrt((Wx).^2 + epsilon)

%%======================================================================
%% STEP 1: Sample patches

patches = load('stlSampledPatches.mat');
patches = patches.patches(:, 1:numPatches);
displayColorNetwork(patches(:, 1:100));


%%======================================================================
%% STEP 2: ZCA whiten patches
%  In this step, we ZCA whiten the sampled patches. This is necessary for
%  orthonormal ICA to work.

patches = patches / 255;
meanPatch = mean(patches, 2);
patches = bsxfun(@minus, patches, meanPatch);

sigma = patches * patches';
[u, s, v] = svd(sigma);
ZCAWhite = u * diag(1 ./ sqrt(diag(s))) * u';
patches = ZCAWhite * patches;
figure(2);
displayColorNetwork(patches(:, 1:100));
%%======================================================================
%% STEP 3: ICA cost functions
%  Implement the cost function for orthornomal ICA (you don't have to 
%  enforce the orthonormality constraint in the cost function) 
%  in the function orthonormalICACost in orthonormalICACost.m.
%  Once you have implemented the function, check the gradient.

% %Use less features and smaller patches for speed
% numFeatures = 5;
% patches = patches(1:3, 1:5);
% visibleSize = 3;
% numPatches = 5;
% 
% weightMatrix = rand(numFeatures, visibleSize);
% 
% [cost, grad] = orthonormalICACost(weightMatrix, visibleSize, numFeatures, patches, epsilon);
% 
% numGrad = computeNumericalGradient( @(x) orthonormalICACost(x, visibleSize, numFeatures, patches, epsilon), weightMatrix(:) );
% % Uncomment to display the numeric and analytic gradients side-by-side
% % disp([numGrad grad]); 
% diff = norm(numGrad-grad)/norm(numGrad+grad);
% fprintf('Orthonormal ICA difference: %g\n', diff);
% assert(diff < 1e-7, 'Difference too large. Check your analytic gradients.');
% 
% fprintf('Congratulations! Your gradients seem okay.\n');
% 

%%======================================================================
%% STEP 4: Optimization for orthonormal ICA
%  Optimize for the orthonormal ICA objective, enforcing the orthonormality
%  constraint. Code has been provided to do the gradient descent with a
%  backtracking line search using the orthonormalICACost function 
%  (for more information about backtracking line search, you can read the 
%  appendix of the exercise).
%
%  However, you will need to write code to enforce the orthonormality 
%  constraint by projecting weightMatrix back into the space of matrices 
%  satisfying WW^T  = I.
%
%  Once you are done, you can run the code. 10000 iterations of gradient
%  descent will take around 2 hours, and only a few bases will be
%  completely learned within 10000 iterations. This highlights one of the
%  weaknesses of orthonormal ICA - it is difficult to optimize for the
%  objective function while enforcing the orthonormality constraint - 
%  convergence using gradient descent and projection is very slow.

weightMatrix = rand(numFeatures, visibleSize);%121*192
[cost, grad] = orthonormalICACost(weightMatrix(:), visibleSize, numFeatures, patches, epsilon);
fprintf('%11s%16s%10s\n','Iteration','Cost','t');
startTime = tic();

% Initialize some parameters for the backtracking line search
alpha = 0.5;
t = 0.02;
lastCost = 1e40;

% Do 10000 iterations of gradient descent
for iteration = 1:10000
                       
    grad = reshape(grad, size(weightMatrix));
    newCost = Inf;        
    linearDelta = sum(sum(grad .* grad));
    
    if iteration == 50
        alpha = 0.1;
    end
    % Perform the backtracking line search
    %每一轮迭代计算梯度grad,通过backtracking方法计算最大的学习速度,即在当前梯度方向的最大下降步长,当前后两次的结果小于某个阈值时,停止步长搜索
    while 1
        considerWeightMatrix = weightMatrix - alpha * grad;
        % -------------------- YOUR CODE HERE --------------------
        % Instructions:
        %   Write code to project considerWeightMatrix back into the space
        %   of matrices satisfying WW^T = I.
        %   
        %   Once that is done, verify that your projection is correct by 
        %   using the checking code below. After you have verified your
        %   code, comment out the checking code before running the
        %   optimization.
        
        % Project considerWeightMatrix such that it satisfies WW^T = I
%         error('Fill in the code for the projection here');        
        considerWeightMatrix = (considerWeightMatrix*considerWeightMatrix')^(-0.5)*considerWeightMatrix;
        % Verify that the projection is correct
        temp = considerWeightMatrix * considerWeightMatrix';
        temp = temp - eye(numFeatures);
        assert(sum(temp(:).^2) < 1e-23, 'considerWeightMatrix does not satisfy WW^T = I. Check your projection again');
   %      error('Projection seems okay. Comment out verification code before running optimization.');
        
        % -------------------- YOUR CODE HERE --------------------                                        

        [newCost, newGrad] = orthonormalICACost(considerWeightMatrix(:), visibleSize, numFeatures, patches, epsilon);
        if newCost >= lastCost - alpha * t * linearDelta
            t = 0.9 * t;
        else
            break;
        end
    end
   
    lastCost = newCost;
    weightMatrix = considerWeightMatrix;
    
    fprintf('  %9d  %14.6f  %8.7g\n', iteration, newCost, t);
    
    t = 1.1 * t;
    
    cost = newCost;
    grad = newGrad;
           
    % Visualize the learned bases as we go along    
    if mod(iteration, 20) == 0
        duration = toc(startTime);
        % Visualize the learned bases over time in different figures so 
        % we can get a feel for the slow rate of convergence
        figure(floor(iteration /  20));
        displayColorNetwork(weightMatrix'); 
    end
                   
end

% Visualize the learned bases
displayColorNetwork(weightMatrix');

 

ICAExercisecost.m

function [cost, grad] = orthonormalICACost(theta, visibleSize, numFeatures, patches, epsilon)
%orthonormalICACost - compute the cost and gradients for orthonormal ICA
%                     (i.e. compute the cost ||Wx||_1 and its gradient)

    weightMatrix = reshape(theta, numFeatures, visibleSize);
    
    cost = 0;
    grad = zeros(numFeatures, visibleSize);
    num_samples = size(patches,2);
    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   weights given in weightMatrix.     
    % -------------------- YOUR CODE HERE -------------------- 
    lambda = 1e-2;
    m = size( patches , 2);
    w = weightMatrix;
    x = patches;
    error = w'*w*x - x;
    cost = 1/m * sum( error(:).^2 ) + lambda * sum( sum ( sqrt( (w*x).^2 + epsilon ) ) );
    grad = 1/m * ( 2*w*(w'*w*x-x)*x' + 2*(w*x)*(w'*w*x-x)' ) + ...
          lambda * ( (w*x).^2 + epsilon).^(-0.5).*(w*x) * x';
    grad = grad(:);
end

 

参考资料

http://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_idea

http://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_Analysis

http://deeplearning.stanford.edu/wiki/index.php/Independent_Component_Analysis

http://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_Interpretation

http://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding

 

posted @ 2014-12-09 18:09  dupuleng  阅读(441)  评论(0编辑  收藏  举报