Deep Learning1:Sparse Autoencoder

 

学习stanford的课程http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial 一个月以来,对算法一知半解,Exercise也基本上是复制别人代码,现在想总结一下相关内容

1. Autoencoders and Sparsity

稀释编码:Sparsity parameter

隐藏层的平均激活参数为\textstyle \rho

\begin{align}
\hat\rho_j = \frac{1}{m} \sum_{i=1}^m \left[ a^{(2)}_j(x^{(i)}) \right]
\end{align}

约束为

\begin{align}
\hat\rho_j = \rho,
\end{align}

为实现这个目标,在cost Function上额外加上一项惩罚系数

\begin{align}
\sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}.
\end{align}

\begin{align}
\hat\rho_j = \rho,
\end{align}此项达到最小值

此时cost Function

\begin{align}
J_{\rm sparse}(W,b) = J(W,b) + \beta \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),
\end{align}

同时为了方便编程,将隐藏层时的后向传播参数也增加一项

\begin{align}
\delta^{(2)}_i =
  \left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)
+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .
\end{align}

为了得到Sparsity parameter,先对所有训练数据进行前向步骤,从而得到激活参数,再次前向步骤,进行反向传播调参,也就是要对所有训练数据进行两次的前向步骤

2.Backpropagation Algorithm

在计算过程中,简化了计算步骤
对于训练集\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \},cost Function如下


\begin{align}
J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2.
\end{align}

仅含方差项


\begin{align}
J(W,b)
&= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right]
                       + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2
 \\
&= \left[ \frac{1}{m} \sum_{i=1}^m \left( \frac{1}{2} \left\| h_{W,b}(x^{(i)}) - y^{(i)} \right\|^2 \right) \right]
                       + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2
\end{align}

第一部分是方差,第二部分是规范化项,也称为weight decay项,此公式为overall cost function

参数W,b的迭代公式如下


\begin{align}
W_{ij}^{(l)} &= W_{ij}^{(l)} - \alpha \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b) \\
b_{i}^{(l)} &= b_{i}^{(l)} - \alpha \frac{\partial}{\partial b_{i}^{(l)}} J(W,b)
\end{align}

α为学习率

那么,backpropagation algorithm在参数计算中极大提高了效率

目的:梯度下降法,迭代多次,得到优化参数

每次迭代都计算cost function和gradient,再进行下一次迭代

BP 前向传播后,定义误差项1.输出层是对cost function 对输出结果求导2.中间层 下一层误差项与网络系数相乘,实现逆向推导

cost function分别对W,b求导如下


\begin{align}
\frac{\partial}{\partial W_{ij}^{(l)}} J(W,b) &=
\left[ \frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x^{(i)}, y^{(i)}) \right] + \lambda W_{ij}^{(l)} \\
\frac{\partial}{\partial b_{i}^{(l)}} J(W,b) &=
\frac{1}{m}\sum_{i=1}^m \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x^{(i)}, y^{(i)})
\end{align}

 

首先,对训练对象进行前向网络激活运算,得到网络输入值hW,b(x)

接着,对网络层l 中每一个节点i ,计算误差项\delta^{(l)}_i,衡量该节点对于输出的误差所占权重,可用网络激活输出值与真实目标值之差来定义\delta^{(n_l)}_inl 是输出层,对于隐藏层,则用a^{(l)}_i作为输出的误差项的权重比来定义\delta^{(l)}_i

算法步骤如下

    • Perform a feedforward pass, computing the activations for layers L2, L3, and so on up to the output layer L_{n_l}.
    • For each output unit i in layer nl (the output layer), set
      
\begin{align}
\delta^{(n_l)}_i
= \frac{\partial}{\partial z^{(n_l)}_i} \;\;
        \frac{1}{2} \left\|y - h_{W,b}(x)\right\|^2 = - (y_i - a^{(n_l)}_i) \cdot f'(z^{(n_l)}_i)
\end{align}
    • For l = n_l-1, n_l-2, n_l-3, \ldots, 2
      For each node i in layer l, set
      
                 \delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) f'(z^{(l)}_i)
    • Compute the desired partial derivatives, which are given as:
      
\begin{align}
\frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y) &= a^{(l)}_j \delta_i^{(l+1)} \\
\frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}.
\end{align}

 对于矩阵,在MATLAB中如下

    • Perform a feedforward pass, computing the activations for layers \textstyle L_2, \textstyle L_3, up to the output layer \textstyle L_{n_l}, using the equations defining the forward propagation steps
    • For the output layer (layer \textstyle n_l), set
      \begin{align}
\delta^{(n_l)}
= - (y - a^{(n_l)}) \bullet f'(z^{(n_l)})
\end{align}
    • For \textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2
      Set
      \begin{align}
                 \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)})
                 \end{align}
    • Compute the desired partial derivatives:
      \begin{align}
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\
\nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}.
\end{align}

 

 注:\textstyle f(z)为sigmoid函数,则\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)

在此基础上,梯度下降算法gradient descent algorithm步骤如下

    • Set \textstyle \Delta W^{(l)} := 0, \textstyle \Delta b^{(l)} := 0 (matrix/vector of zeros) for all \textstyle l.
    • For \textstyle i = 1 to \textstyle m,
      1. Use backpropagation to compute \textstyle \nabla_{W^{(l)}} J(W,b;x,y) and \textstyle \nabla_{b^{(l)}} J(W,b;x,y).
      2. Set \textstyle \Delta W^{(l)} := \Delta W^{(l)} + \nabla_{W^{(l)}} J(W,b;x,y).
      3. Set \textstyle \Delta b^{(l)} := \Delta b^{(l)} + \nabla_{b^{(l)}} J(W,b;x,y).
    • Update the parameters:
      \begin{align}
W^{(l)} &= W^{(l)} - \alpha \left[ \left(\frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)}\right] \\
b^{(l)} &= b^{(l)} - \alpha \left[\frac{1}{m} \Delta b^{(l)}\right]
\end{align}

3.Visualizing a Trained Autoencoder

用pixel intensity values可视化编码器

已知输出\begin{align}
a^{(2)}_i = f\left(\sum_{j=1}^{100} W^{(1)}_{ij} x_j  + b^{(1)}_i \right).
\end{align}

约束\textstyle ||x||^2 = \sum_{i=1}^{100} x_i^2 \leq 1

定义pixel \textstyle x_j (for all 100 pixels, \textstyle j=1,\ldots, 100)

\begin{align}
x_j = \frac{W^{(1)}_{ij}}{\sqrt{\sum_{j=1}^{100} (W^{(1)}_{ij})^2}}.
\end{align}

posted on 2016-10-13 16:18  Beginnerpatienceless  阅读(241)  评论(0编辑  收藏  举报