Theano3.6-练习之消噪自动编码器
来自:http://deeplearning.net/tutorial/dA.html#daa
Denoising Autoencoders (dA)
note:该部分假设读者已经看过(Theano3.3-练习之逻辑回归)和(Theano3.4-练习之多层感知机)。另外需要了解这几个theano函数和概念: T.tanh, shared variables, basic arithmetic ops, T.grad, Random numbers, floatX. 如果想在GPU上跑代码,还需要了解 GPU.
note:这部分代码下载地址here.
消噪自动编码器(DA)是一个典型的ae的扩展,并且可以作为深度网络的构建块 [Vincent08].。这里先简单的介绍下自动编码器( Autoencoders,ae)。
一、Autoencoders
在 [Bengio09] 一书的4.6部分有ae的原理介绍。ae是将 作为输入,然后首先通过一个编码器将它映射成一个隐藏表征 ,其中映射的数学表示如下所示:
这里 是一个非线性函数例如sigmoid。潜在的表征 (或者说编码)随后可以通过一个解码器映射回和 x 有着同样形状的重构 z 。 这另一个映射也是通过和上面相似的变换得到的:
(这里的右上加单引号的符号不是表示矩阵转置)z 被看作是在给定编码y 的时候,x 的一个预测结果。一个观点认为,逆映射中的权重矩阵 其实就是前向映射的转置: . 这称为被绑定权重(tied weights)。 这个模型的参数 (即 , , ,而且,如果不使用tied weights的方式的话,还有一个 ) 是通过平均重构误差(average reconstruction error)最小化来实现的优化。
重构误差可以用很多种方法来衡量的, 使用哪种方法取决于在给定编码的基础上对输入数据分布合理的假设。通常有传统的平方误差 。如果输入是二值向量或者是位概率向量,那么可以使用交叉熵的方法:
我们希望编码 是一个分布式表征,并且抓取到了数据变化中主要因子方向上的坐标信息,这和PCA抓取数据变化中主要因子方向的映射方式是很相似的。确实,如果通过一个线性隐藏层(编码层)和均化平方误差评判标准来训练网络,那么 个隐藏单元就是将数据映射成前 k 个主成分。如果隐藏层是非线性的,那么ae的行为就不同于PCA了,因为它可以抓取输入分布的多模态(multi-modal)信息 。 当需要构建一个深度ae[Hinton06],从而需要堆叠多个编码器(还有对应的解码器)的时候,采取与PCA不同的方式就变得更加重要了(因为线性的堆叠还是线性,没意义)。
因为 被认为是 的一种有损压缩,所以没法得到无损的结果。对于训练数据来说,可以通过优化的方式来得到很好的压缩结果,并希望对其他输入也是这样,不过这不适用于任意输入(也就是输入还是需要指定的)。也就是可以在和训练样本一样分布的测试样本上得到低重构误差,不过通常来说,从输入空间中随机选取的样本上的重构误差却非常高。
我们如果想要以类的形式使用theano来实现一个ae,并在构建堆叠自动编码器的时候能够重用。第一步就是创建自动编码器参数 , 和 的共享变量。 (这里使用tied weights的方式,所以用 来代替 ):
def __init__( self, numpy_rng, theano_rng=None, input=None, n_visible=784, n_hidden=500, W=None, bhid=None,#隐藏层偏置 bvis=None #重构层偏置 ): """ Initialize the dA class by specifying the number of visible units (the dimension d of the input ), the number of hidden units ( the dimension d' of the latent or hidden space ) and the corruption level. The constructor also receives symbolic variables for the input, weights and bias. Such a symbolic variables are useful when, for example the input is the result of some computations, or when weights are shared between the dA and an MLP layer. When dealing with SdAs this always happens, the dA on layer 2 gets as input the output of the dA on layer 1, and the weights of the dA are used in the second stage of training to construct an MLP. :type numpy_rng: numpy.random.RandomState :param numpy_rng: number random generator used to generate weights :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams :param theano_rng: Theano random generator; if None is given one is generated based on a seed drawn from `rng` :type input: theano.tensor.TensorType :param input: a symbolic description of the input or None for standalone dA :type n_visible: int :param n_visible: number of visible units :type n_hidden: int :param n_hidden: number of hidden units :type W: theano.tensor.TensorType :param W: Theano variable pointing to a set of weights that should be shared belong the dA and another architecture; if dA should be standalone set this to None :type bhid: theano.tensor.TensorType :param bhid: Theano variable pointing to a set of biases values (for hidden units) that should be shared belong dA and another architecture; if dA should be standalone set this to None :type bvis: theano.tensor.TensorType :param bvis: Theano variable pointing to a set of biases values (for visible units) that should be shared belong dA and another architecture; if dA should be standalone set this to None """ self.n_visible = n_visible#可视层单元个数 self.n_hidden = n_hidden#隐藏层单元个数 # create a Theano random generator that gives symbolic random values if not theano_rng: theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) # note : W' was written as `W_prime` and b' as `b_prime` if not W: # W is initialized with `initial_W` which is uniformely sampled # from -4*sqrt(6./(n_visible+n_hidden)) and # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if # converted using asarray to dtype # theano.config.floatX so that the code is runable on GPU initial_W = numpy.asarray( numpy_rng.uniform( low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), size=(n_visible, n_hidden) ), dtype=theano.config.floatX ) W = theano.shared(value=initial_W, name='W', borrow=True)#将W送入GPU if not bvis:#如果重构层偏置没有传入,那就初始化0 bvis = theano.shared( value=numpy.zeros( n_visible, dtype=theano.config.floatX ), borrow=True ) if not bhid: bhid = theano.shared( value=numpy.zeros( n_hidden, dtype=theano.config.floatX ), name='b', borrow=True ) self.W = W # b corresponds to the bias of the hidden self.b = bhid # b_prime corresponds to the bias of the visible self.b_prime = bvis # tied weights, therefore W_prime is W transpose#使用tied 权重方式 self.W_prime = self.W.T self.theano_rng = theano_rng # if no input is given, generate a variable representing the input if input is None:#如果输入为空,那就用符号占着 # we use a matrix because we expect a minibatch of several # examples, each example being a row self.x = T.dmatrix(name='input') else: self.x = input self.params = [self.W, self.b, self.b_prime]#模型的参数w,b,b',
注意到我们是将符号 input 作为ae的一个参数。这样就能将多个ae的层进行连接从而形成一个深度网络:第 k 层的符号 output (也就是 )将会是第 K+1 层的符号输入。
现在可以介绍潜在表征和重构信号的计算了:
def get_hidden_values(self, input): """ Computes the values of the hidden layer """ return T.nnet.sigmoid(T.dot(input, self.W) + self.b)
def get_reconstructed_input(self, hidden): """Computes the reconstructed input given the values of the hidden layer """ return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)
使用这些函数,就能够计算cost和每一步中随机梯度下降法的更新了:
def get_cost_updates(self, corruption_level, learning_rate): """ This function computes the cost and the updates for one trainng step of the dA """ tilde_x = self.get_corrupted_input(self.x, corruption_level) y = self.get_hidden_values(tilde_x) z = self.get_reconstructed_input(y) # note : we sum over the size of a datapoint; if we are using # minibatches, L will be a vector, with one entry per # example in minibatch L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1) # note : L is now a vector, where each element is the # cross-entropy cost of the reconstruction of the # corresponding example of the minibatch. We need to # compute the average of all these to get the cost of # the minibatch cost = T.mean(L) # compute the gradients of the cost of the `dA` with respect # to its parameters gparams = T.grad(cost, self.params) # generate the list of updates updates = [ (param, param - learning_rate * gparam) for param, gparam in zip(self.params, gparams) ] return (cost, updates)
我们现在可以定义一个用来迭代更新参数 W, b 和 b_prime 函数了,从而使得重构的cost能够达到最小化。
da = dA( numpy_rng=rng, theano_rng=theano_rng, input=x, n_visible=28 * 28, n_hidden=500 ) cost, updates = da.get_cost_updates( corruption_level=0., learning_rate=learning_rate ) train_da = theano.function( [index], cost, updates=updates, givens={ x: train_set_x[index * batch_size: (index + 1) * batch_size] } )
如果在最小化和重构误差上没有约束的话,那么n 个输入的ae,其结果就是将n维的编码进行相同函数的映射,也就是将输入进行复制而已,并没有任何的变换。这样的一个ae对于从其他输入结构上得到的测试样本(来自于训练分布相同的样本)就没有任何的区分能力了。
可是令人惊讶的是,从 [Bengio07] 得到的实验表明,实际上,当用随机梯度下降法进行训练的时候,当隐藏单元数量比输入层单元数量还多的时候(超完备),非线性ae仍然可以得到有用的表征 (这里,“有用”说的是一个使用这种编码器的网络有着很低的分类错误。)
一个简单的解释是使用早期停止的随机梯度下降法相似与对参数使用L2正则化。为了得到连续输入的完美重构,对于有着单层隐藏层和非线性隐藏单元的ae ,需要第一层(编码层)的权重值非常小,从而在线性regime(不知道翻译成什么好)下代入隐藏层的非线性特性,然后在第二层(解码层)的权重值是非常大的 。对于二值输入来说,也同样需要非常大的权值,这样才能最大限度的减少重构误差。因为显式或者隐式正则化会使得很难达到大权重的解,用优化算法来找到编码的方式只适合于相似于训练集的样本,不过这也是我们想要的。也就是说这里表征是通过利用训练集中的统计正则化,而不是仅仅学会复制输入部分。
还有一些其他方法来让有着比输入层更多数量隐藏单元的情况下的ae来在隐藏层中抓取输入数据的一些有用的信息,而不是学习一个复制函数。一个方法就是增加稀疏性(强制一些隐藏单元为0或者接近于0)。在许多文献中 [Ranzato07] [Lee08] 稀疏被成功的用到ae上。另一个就是从输入到重构的变换中增加随机性。该技术也被用在RBM中 (在稍后的 Restricted Boltzmann Machines (RBM)部分会介绍), 下面介绍的是消噪自动编码器(Denoising Auto-Encoders,DAE)。
二、Denoising Autoencoders
消噪自动编码器背后的原理其实很简单的。为了让隐藏层能够发掘更多鲁棒性的特征和避免陷入复制学习的境地,我们通过对输入进行增加噪音来腐蚀输入,然后对其进行重构来训练一个ae。
dae是一个ae的随机版本。直观上来说,一个dae做了两件事情:编码输入数据 (保留输入数据的信息),然后解除随机施加在输入数据上的腐蚀效果 。然后就可以抓取输入之间的统计依赖关系。dae可以从不同的角度(流行学习(manifold learning)角度上,随机操作角度,自底向上的信息论的角度,自顶向下生成模型的角度)来理解, 所有的这些都在 [Vincent08]有解释的,还可以看看 [Bengio09] 的7.2部分,那里有对ae的概述。
在 [Vincent08]中,随机腐蚀处理是通过对输入的某些部分(差不多是整个输入的一半)进行置零处理(个人:其实也就是让输入层的可视单元的某些单元保持0的输入,而不理会数据对应像素点的值)。对于随机选取的消失模式的子集来说,dae是通过从未腐蚀的值(完好的值)中来预测腐蚀后的值(消失的部分)。注意到如何从剩余部分中预测变量的任意子集对于完全抓取变量之间的联合分布是一个充分条件(这就是Gibbs采样的工作方式)。
为了将ae类转换成一个dae类,所有需要做的就是在输入的部分增加一个随机腐蚀操作。输入部分也可以有多种被腐蚀的方式,不过在本教程中是严格按照最初的腐蚀机制来实现的,也就是对输入部分的某些单元随机置零。代码如下:
def get_corrupted_input(self, input, corruption_level): """This function keeps ``1-corruption_level`` entries of the inputs the same and zero-out randomly selected subset of size ``coruption_level`` Note : first argument of theano.rng.binomial is the shape(size) of random numbers that it should produce second argument is the number of trials third argument is the probability of success of any trial this will produce an array of 0s and 1s where 1 has a probability of 1 - ``corruption_level`` and 0 with ``corruption_level`` The binomial function return int64 data type by default. int64 multiplicated by the input type(floatX) always return float64. To keep all data in floatX when floatX is float32, we set the dtype of the binomial to floatX. As in our case the value of the binomial is always 0 or 1, this don't change the result. This is needed to allow the gpu to work correctly as it only support float32 for now. """ return self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level, dtype=theano.config.floatX) * input
在堆叠自动编码器类中 (Stacked Autoencoders) dA 类的权重会和对应的sigmoid层进行共享。所以 dA 的构造同样会得到指向共享参数的theano变量。如果这些参数都是 None, 那么就会进行自动创建
dae类的代码如下:
class dA(object): """Denoising Auto-Encoder class (dA) A denoising autoencoders tries to reconstruct the input from a corrupted version of it by projecting it first in a latent space and reprojecting it afterwards back in the input space. Please refer to Vincent et al.,2008 for more details. If x is the input then equation (1) computes a partially destroyed version of x by means of a stochastic mapping q_D. Equation (2) computes the projection of the input into the latent space. Equation (3) computes the reconstruction of the input, while equation (4) computes the reconstruction error. .. math:: \tilde{x} ~ q_D(\tilde{x}|x) (1) y = s(W \tilde{x} + b) (2) x = s(W' y + b') (3) L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)] (4) """ def __init__( self, numpy_rng, theano_rng=None, input=None, n_visible=784, n_hidden=500, W=None, bhid=None, bvis=None ): """ Initialize the dA class by specifying the number of visible units (the dimension d of the input ), the number of hidden units ( the dimension d' of the latent or hidden space ) and the corruption level. The constructor also receives symbolic variables for the input, weights and bias. Such a symbolic variables are useful when, for example the input is the result of some computations, or when weights are shared between the dA and an MLP layer. When dealing with SdAs this always happens, the dA on layer 2 gets as input the output of the dA on layer 1, and the weights of the dA are used in the second stage of training to construct an MLP. :type numpy_rng: numpy.random.RandomState :param numpy_rng: number random generator used to generate weights :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams :param theano_rng: Theano random generator; if None is given one is generated based on a seed drawn from `rng` :type input: theano.tensor.TensorType :param input: a symbolic description of the input or None for standalone dA :type n_visible: int :param n_visible: number of visible units :type n_hidden: int :param n_hidden: number of hidden units :type W: theano.tensor.TensorType :param W: Theano variable pointing to a set of weights that should be shared belong the dA and another architecture; if dA should be standalone set this to None :type bhid: theano.tensor.TensorType :param bhid: Theano variable pointing to a set of biases values (for hidden units) that should be shared belong dA and another architecture; if dA should be standalone set this to None :type bvis: theano.tensor.TensorType :param bvis: Theano variable pointing to a set of biases values (for visible units) that should be shared belong dA and another architecture; if dA should be standalone set this to None """ self.n_visible = n_visible self.n_hidden = n_hidden # create a Theano random generator that gives symbolic random values if not theano_rng: theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) # note : W' was written as `W_prime` and b' as `b_prime` if not W: # W is initialized with `initial_W` which is uniformely sampled # from -4*sqrt(6./(n_visible+n_hidden)) and # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if # converted using asarray to dtype # theano.config.floatX so that the code is runable on GPU initial_W = numpy.asarray( numpy_rng.uniform( low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), size=(n_visible, n_hidden) ), dtype=theano.config.floatX ) W = theano.shared(value=initial_W, name='W', borrow=True) if not bvis: bvis = theano.shared( value=numpy.zeros( n_visible, dtype=theano.config.floatX ), borrow=True ) if not bhid: bhid = theano.shared( value=numpy.zeros( n_hidden, dtype=theano.config.floatX ), name='b', borrow=True ) self.W = W # b corresponds to the bias of the hidden self.b = bhid # b_prime corresponds to the bias of the visible self.b_prime = bvis # tied weights, therefore W_prime is W transpose self.W_prime = self.W.T self.theano_rng = theano_rng # if no input is given, generate a variable representing the input if input is None: # we use a matrix because we expect a minibatch of several # examples, each example being a row self.x = T.dmatrix(name='input') else: self.x = input self.params = [self.W, self.b, self.b_prime] def get_corrupted_input(self, input, corruption_level): """This function keeps ``1-corruption_level`` entries of the inputs the same and zero-out randomly selected subset of size ``coruption_level`` Note : first argument of theano.rng.binomial is the shape(size) of random numbers that it should produce second argument is the number of trials third argument is the probability of success of any trial this will produce an array of 0s and 1s where 1 has a probability of 1 - ``corruption_level`` and 0 with ``corruption_level`` The binomial function return int64 data type by default. int64 multiplicated by the input type(floatX) always return float64. To keep all data in floatX when floatX is float32, we set the dtype of the binomial to floatX. As in our case the value of the binomial is always 0 or 1, this don't change the result. This is needed to allow the gpu to work correctly as it only support float32 for now. """ return self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level, dtype=theano.config.floatX) * input def get_hidden_values(self, input): """ Computes the values of the hidden layer """ return T.nnet.sigmoid(T.dot(input, self.W) + self.b) def get_reconstructed_input(self, hidden): """Computes the reconstructed input given the values of the hidden layer """ return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime) def get_cost_updates(self, corruption_level, learning_rate): """ This function computes the cost and the updates for one trainng step of the dA """ tilde_x = self.get_corrupted_input(self.x, corruption_level) y = self.get_hidden_values(tilde_x) z = self.get_reconstructed_input(y) # note : we sum over the size of a datapoint; if we are using # minibatches, L will be a vector, with one entry per # example in minibatch L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1) # note : L is now a vector, where each element is the # cross-entropy cost of the reconstruction of the # corresponding example of the minibatch. We need to # compute the average of all these to get the cost of # the minibatch cost = T.mean(L) # compute the gradients of the cost of the `dA` with respect # to its parameters gparams = T.grad(cost, self.params) # generate the list of updates updates = [ (param, param - learning_rate * gparam) for param, gparam in zip(self.params, gparams) ] return (cost, updates)
三、将上面的进行合并
现在就很容易的构建一个dA类,然后对它进行训练了:
# allocate symbolic variables for the data index = T.lscalar() # index to a [mini]batch x = T.matrix('x') # the data is presented as rasterized images
##################################### # BUILDING THE MODEL CORRUPTION 30% # ##################################### rng = numpy.random.RandomState(123)#随机种子 theano_rng = RandomStreams(rng.randint(2 ** 30)) da = dA( numpy_rng=rng, theano_rng=theano_rng, input=x, #输入数据 n_visible=28 * 28, #可视层单元个数 n_hidden=500 #隐藏层单元个数 ) cost, updates = da.get_cost_updates( corruption_level=0.3, #增加噪音的百分比 learning_rate=learning_rate #学习率保持不变 ) train_da = theano.function( [index], cost, updates=updates, givens={ x: train_set_x[index * batch_size: (index + 1) * batch_size] } ) start_time = time.clock() #当前时间作为开始时间 ############ # TRAINING # ############ # go through training epochs for epoch in xrange(training_epochs): # go through trainng set c = [] for batch_index in xrange(n_train_batches): c.append(train_da(batch_index)) print 'Training epoch %d, cost ' % epoch, numpy.mean(c) end_time = time.clock()#读取当前时间作为结束时间 training_time = (end_time - start_time)#训练时间 print >> sys.stderr, ('The 30% corruption code for file ' + os.path.split(__file__)[1] + ' ran for %.2fm' % (training_time / 60.))
为了得到网络学到的是什么东西的直观理解,需要将过滤器(由权重矩阵定义的)plot出来。然而记住,这里不提供完整的部分,因为我们忽略了偏置和并且将权重绘制成一个乘法常量(权重被转换成0和1)。
为了plot出过滤器,需要 tile_raster_images 的帮助文档(see Plotting Samples and Filters) ,所以我们鼓励读者去好好研究下。同样可以使用PIL( Python Image Library),下面的代码就是如何将过滤器保存成一幅图像的部分:
image = Image.fromarray(tile_raster_images( X=da.W.get_value(borrow=True).T, img_shape=(28, 28), tile_shape=(10, 10), tile_spacing=(1, 1))) image.save('filters_corruption_30.png')
四、运行代码
python dA.py
没有加噪音的结果:
加了30%的噪音的结果:
参考资料:
[1] 官网:http://deeplearning.net/tutorial/dA.html#daa