Exercise : Independent Componet Analysis
参考网页:http://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_Analysis
实验内容:
模型:ICA模型
数据集:STL_10,样本RGB三通道的8*8patch块,样本维度8*8*3 = 192
目标:学习一组不完备正交基进行特征表示
理论基础:
ICA模型目标函数 :
对W求偏导:使用BP算法求导,ng网页中的最终结果有误,本人已更正
一些问题:
- 在优过过程中使用backtracking方法,具体不太明白,大概意思是寻求步长t,使目标函数以尽量大的步长向最优解下降。通过梯度计算目前的收敛方向,那到底以多大的步长下降呢?backtracking就是在寻找这个步长,不停的增大步长,直到前后目标函数值相差小于某个值的时候停止。
- 使用梯度下降法求解过程学习率的设定问题,原始代码给的是0.5,不过学习一段时间后发散了(不到100轮,因为作者对每50轮学习的结果进行可视化,发现100轮时已经发散了),因此本人在学习50轮后将学习率改为0.1,不会出现发散,0.1是作者随意设的没有经过严密的推导
- 目标函数中的lambda设定:参照tornadomeet的博文,设置为1/m(m表示样本数目=20,000),发现收敛速度特别慢,参考sparse coding设为1e-2收敛速度明显加快,第一次体验到在深度学习中参数对模型的重要性。
- 使用ZCA白化进行预处理时不需要使用正则项,因为ICA模型学习一组不完备正交基,特征值不会出来0的情况,因此不需要正则化
- 因为要学习一组不完备基,因此学习的特征数目必须小于样本维度,本实验中样本维度为192,特征数目为121
- 当ICAExercise.m通过梯度验证后,将验证代码注释掉
实验结果:
学习步长不变时,50轮与100轮的结果,明显看出100结果已发散,说明学习率太大
调整步长后的结果:20轮与1000轮时的结果。电脑不给力跑了好久了还没跑出最终结果,不过1000轮时的结果已经比较给力了,NG建议的迭代次数为20000,tornadomeet采用的lambda比作者小,建议的迭代次数为50000次,不过个人感觉在作者的参数设定下10000次迭代应该会有比较理想的结果。等跑出结果再上图吧--!
最终结果:迭代了3000轮,结果貌似和1000轮区别不太大。。
实验主要代码:
ICAExercise.m
%% CS294A/CS294W Independent Component Analysis (ICA) Exercise % Instructions % ------------ % % This file contains code that helps you get started on the % ICA exercise. In this exercise, you will need to modify % orthonormalICACost.m and a small part of this file, ICAExercise.m. %%====================================================================== %% STEP 0: Initialization % Here we initialize some parameters used for the exercise. numPatches = 20000; numFeatures = 121; imageChannels = 3; patchDim = 8; visibleSize = patchDim * patchDim * imageChannels; outputDir = '.'; epsilon = 1e-6; % L1-regularisation epsilon |Wx| ~ sqrt((Wx).^2 + epsilon) %%====================================================================== %% STEP 1: Sample patches patches = load('stlSampledPatches.mat'); patches = patches.patches(:, 1:numPatches); displayColorNetwork(patches(:, 1:100)); %%====================================================================== %% STEP 2: ZCA whiten patches % In this step, we ZCA whiten the sampled patches. This is necessary for % orthonormal ICA to work. patches = patches / 255; meanPatch = mean(patches, 2); patches = bsxfun(@minus, patches, meanPatch); sigma = patches * patches'; [u, s, v] = svd(sigma); ZCAWhite = u * diag(1 ./ sqrt(diag(s))) * u'; patches = ZCAWhite * patches; figure(2); displayColorNetwork(patches(:, 1:100)); %%====================================================================== %% STEP 3: ICA cost functions % Implement the cost function for orthornomal ICA (you don't have to % enforce the orthonormality constraint in the cost function) % in the function orthonormalICACost in orthonormalICACost.m. % Once you have implemented the function, check the gradient. % %Use less features and smaller patches for speed % numFeatures = 5; % patches = patches(1:3, 1:5); % visibleSize = 3; % numPatches = 5; % % weightMatrix = rand(numFeatures, visibleSize); % % [cost, grad] = orthonormalICACost(weightMatrix, visibleSize, numFeatures, patches, epsilon); % % numGrad = computeNumericalGradient( @(x) orthonormalICACost(x, visibleSize, numFeatures, patches, epsilon), weightMatrix(:) ); % % Uncomment to display the numeric and analytic gradients side-by-side % % disp([numGrad grad]); % diff = norm(numGrad-grad)/norm(numGrad+grad); % fprintf('Orthonormal ICA difference: %g\n', diff); % assert(diff < 1e-7, 'Difference too large. Check your analytic gradients.'); % % fprintf('Congratulations! Your gradients seem okay.\n'); % %%====================================================================== %% STEP 4: Optimization for orthonormal ICA % Optimize for the orthonormal ICA objective, enforcing the orthonormality % constraint. Code has been provided to do the gradient descent with a % backtracking line search using the orthonormalICACost function % (for more information about backtracking line search, you can read the % appendix of the exercise). % % However, you will need to write code to enforce the orthonormality % constraint by projecting weightMatrix back into the space of matrices % satisfying WW^T = I. % % Once you are done, you can run the code. 10000 iterations of gradient % descent will take around 2 hours, and only a few bases will be % completely learned within 10000 iterations. This highlights one of the % weaknesses of orthonormal ICA - it is difficult to optimize for the % objective function while enforcing the orthonormality constraint - % convergence using gradient descent and projection is very slow. weightMatrix = rand(numFeatures, visibleSize);%121*192 [cost, grad] = orthonormalICACost(weightMatrix(:), visibleSize, numFeatures, patches, epsilon); fprintf('%11s%16s%10s\n','Iteration','Cost','t'); startTime = tic(); % Initialize some parameters for the backtracking line search alpha = 0.5; t = 0.02; lastCost = 1e40; % Do 10000 iterations of gradient descent for iteration = 1:10000 grad = reshape(grad, size(weightMatrix)); newCost = Inf; linearDelta = sum(sum(grad .* grad)); if iteration == 50 alpha = 0.1; end % Perform the backtracking line search %每一轮迭代计算梯度grad,通过backtracking方法计算最大的学习速度,即在当前梯度方向的最大下降步长,当前后两次的结果小于某个阈值时,停止步长搜索 while 1 considerWeightMatrix = weightMatrix - alpha * grad; % -------------------- YOUR CODE HERE -------------------- % Instructions: % Write code to project considerWeightMatrix back into the space % of matrices satisfying WW^T = I. % % Once that is done, verify that your projection is correct by % using the checking code below. After you have verified your % code, comment out the checking code before running the % optimization. % Project considerWeightMatrix such that it satisfies WW^T = I % error('Fill in the code for the projection here'); considerWeightMatrix = (considerWeightMatrix*considerWeightMatrix')^(-0.5)*considerWeightMatrix; % Verify that the projection is correct temp = considerWeightMatrix * considerWeightMatrix'; temp = temp - eye(numFeatures); assert(sum(temp(:).^2) < 1e-23, 'considerWeightMatrix does not satisfy WW^T = I. Check your projection again'); % error('Projection seems okay. Comment out verification code before running optimization.'); % -------------------- YOUR CODE HERE -------------------- [newCost, newGrad] = orthonormalICACost(considerWeightMatrix(:), visibleSize, numFeatures, patches, epsilon); if newCost >= lastCost - alpha * t * linearDelta t = 0.9 * t; else break; end end lastCost = newCost; weightMatrix = considerWeightMatrix; fprintf(' %9d %14.6f %8.7g\n', iteration, newCost, t); t = 1.1 * t; cost = newCost; grad = newGrad; % Visualize the learned bases as we go along if mod(iteration, 20) == 0 duration = toc(startTime); % Visualize the learned bases over time in different figures so % we can get a feel for the slow rate of convergence figure(floor(iteration / 20)); displayColorNetwork(weightMatrix'); end end % Visualize the learned bases displayColorNetwork(weightMatrix');
ICAExercisecost.m
function [cost, grad] = orthonormalICACost(theta, visibleSize, numFeatures, patches, epsilon) %orthonormalICACost - compute the cost and gradients for orthonormal ICA % (i.e. compute the cost ||Wx||_1 and its gradient) weightMatrix = reshape(theta, numFeatures, visibleSize); cost = 0; grad = zeros(numFeatures, visibleSize); num_samples = size(patches,2); % -------------------- YOUR CODE HERE -------------------- % Instructions: % Write code to compute the cost and gradient with respect to the % weights given in weightMatrix. % -------------------- YOUR CODE HERE -------------------- lambda = 1e-2; m = size( patches , 2); w = weightMatrix; x = patches; error = w'*w*x - x; cost = 1/m * sum( error(:).^2 ) + lambda * sum( sum ( sqrt( (w*x).^2 + epsilon ) ) ); grad = 1/m * ( 2*w*(w'*w*x-x)*x' + 2*(w*x)*(w'*w*x-x)' ) + ... lambda * ( (w*x).^2 + epsilon).^(-0.5).*(w*x) * x'; grad = grad(:); end
参考资料
http://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_idea
http://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_Analysis
http://deeplearning.stanford.edu/wiki/index.php/Independent_Component_Analysis
http://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_Interpretation
http://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding