复现梯度爆炸与梯度消失

在训练rnn模型时,很可能会遇到一段时间后,无论怎么训练,损失函数都不变化的情况.仿佛时间静止了一样.这时候很可能是大多数参数都不变了.也就是遇到了梯度消失的问题.

原理

\[\begin{equation*} \sigma(x)=\frac{1}{1+e^{-x}} \end{equation*} \]

有以下方程组

\[\begin{equation*} \begin{cases} z_0=2w_0+b_0 \\ z_1=\sigma(z_0)w_1+b_1 \\ z_2=\sigma(z_1)w_2+b_2 \\ z_3=\sigma(z_2)w_3+b_3 \\ z_4=\sigma(z_3)w_4+b_4 \\ z_5=\sigma(z_4)w_5+b_5 \\ z=\sigma(z_5) \\ \end{cases} \end{equation*} \]

设某个模型的损失函数为\(z\),现在求\(z\)\(w_0\)的偏导.

\[\frac{\partial z}{\partial w_0}=2\sigma'(z_5)w_5\sigma'(z_4)w_4\sigma'(z_3)w_3\sigma'(z_2)w_2\sigma'(z_1)w_1\sigma'(z_0) \]

\(\sigma'(z_i)\) 的范围在(0,1)之间.6个小于1的数相乘,会导致结果非常接近0.这就造成\(w_0\)的梯度很小,在训练的过程中\(w_0\)变化特别慢,这就是梯度消失的原因.反过来看,当\(w_i\)很大时,5个很大的数相乘,会造成\(w_0\)的梯度特别大,更新一次就可能越过极值.这就是梯度爆炸的原因.

用tensorflow求偏导

为了重现梯度消失,要把梯度的值打印出来.所以先介绍怎么用tensorflow求偏导.假设某个模型的损失函数如下所示:

\[\begin{equation*} z=x^2+y^2 \end{equation*} \]

现在从(1,1)开始用反向传播算法,让\(z\)逐渐减少.因为这个函数简单,可以先自己编码求偏导,代码如下:

# coding:utf-8
from __future__ import unicode_literals
from __future__ import print_function
from __future__ import division
learning_rate = 0.1
def z(x, y):
    return x * x + y * y
x = 1
y = 1
for i in range(10):
    dx = 2 * x
    dy = 2 * y
    x -= dx * learning_rate
    y -= dy * learning_rate
    print("step={},dx={:.4f},x={:.4f},z={:.4f}".format(i, dx, x, z(x, y)))
    print("step={},dy={:.4f},y={:.4f},z={:.4f}".format(i, dy, y, z(x, y)))

再用tensorflow实现:

# coding:utf-8
from __future__ import unicode_literals
from __future__ import print_function
from __future__ import division
import tensorflow as tf
x = tf.Variable(1, dtype=tf.float32, name="x")
y = tf.Variable(1, dtype=tf.float32, name="y")
z = x * x + y * y
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
gra_and_var = optimizer.compute_gradients(z, [x, y])
train_step = optimizer.apply_gradients(gra_and_var)
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for i in range(10):
        sess.run(train_step)
        r_z, r_gv = sess.run([z, gra_and_var])
        print("step={},dx={:.4f},x={:.4f},z={:.4f}".format(i, r_gv[0][0], r_gv[0][1], r_z))
        print("step={},dy={:.4f},y={:.4f},z={:.4f}".format(i, r_gv[1][0], r_gv[1][1], r_z))

两段代码的结果是一样的.optimizer.compute_gradients(z, [x, y])是对 \(z\) 分别求\(x\),\(y\)的偏导,返回一个list,list的每个元素是1个2元组.2元组第1个元素是变量的梯度,第2个元素是变量的值.可以用这种方法看到训练过程中梯度变化的情况.

复现梯度消失

定义一个不太简单也不太复杂的损失函数:

\[\begin{equation*} \begin{cases} z_0=2w_0+b_0 \\ z_1=\sigma(z_0)w_1+b_1 \\ z_2=\sigma(z_1)w_2+b_2 \\ z=\sigma(z_2) \\ \end{cases} \end{equation*} \]

用反向传播算法调整各变量让\(z\)不断减少.代码如下:

# coding:utf-8
from __future__ import unicode_literals
from __future__ import print_function
from __future__ import division

import tensorflow as tf


w0 = tf.Variable(0.5, dtype=tf.float32, name="w0")
b0 = tf.Variable(0.5, dtype=tf.float32, name="b0")

w1 = tf.Variable(0.5, dtype=tf.float32, name="w1")
b1 = tf.Variable(0.5, dtype=tf.float32, name="b1")

w2 = tf.Variable(0.5, dtype=tf.float32, name="w2")
b2 = tf.Variable(0.5, dtype=tf.float32, name="b2")


z = tf.sigmoid(tf.sigmoid(tf.sigmoid(2 * w0 + b0) * w1 + b1) * w2 + b2)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
gra_and_var = optimizer.compute_gradients(z, [w0, b0, w1, b1, w2, b2])
train_step = optimizer.apply_gradients(gra_and_var)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    for i in range(1000):
        sess.run(train_step)
        r_z, r_gv = sess.run([z, gra_and_var])
        print("step={},dw0={:.4f},w0={:.4f},z={:.4f}".format(i, r_gv[0][0], r_gv[0][1], r_z))
        print("step={},db0={:.4f},b0={:.4f},z={:.4f}".format(i, r_gv[1][0], r_gv[1][1], r_z))
        print("step={},dw1={:.4f},w1={:.4f},z={:.4f}".format(i, r_gv[2][0], r_gv[2][1], r_z))
        print("step={},db1={:.4f},b1={:.4f},z={:.4f}".format(i, r_gv[3][0], r_gv[3][1], r_z))
        print("step={},dw2={:.4f},w2={:.4f},z={:.4f}".format(i, r_gv[4][0], r_gv[4][1], r_z))
        print("step={},db3={:.4f},b2={:.4f},z={:.4f}".format(i, r_gv[5][0], r_gv[5][1], r_z))

把最后几次的输出列出来:

step=998,dw0=-0.0004,w0=0.5947,z=0.0061
step=998,db0=-0.0002,b0=0.5474,z=0.0061
step=998,dw1=-0.0014,w1=0.9078,z=0.0061
step=998,db1=-0.0017,b1=0.9895,z=0.0061
step=998,dw2=0.0052,w2=-2.2514,z=0.0061
step=998,db2=0.0061,b2=-3.1738,z=0.0061
step=999,dw0=-0.0004,w0=0.5948,z=0.0061
step=999,db0=-0.0002,b0=0.5474,z=0.0061
step=999,dw1=-0.0014,w1=0.9080,z=0.0061
step=999,db1=-0.0017,b1=0.9897,z=0.0061
step=999,dw2=0.0052,w2=-2.2519,z=0.0061
step=999,db2=0.0060,b2=-3.1744,z=0.0061

当step=999时,dw2为0.0052,dw1为-0.0014,dw0为-0.0004.dw0已经非常接近0.也就是在训练过程中\(w_0\)变化非常慢.这还是3层网络,假定是100层.那\(w_0\)就基本等于0个.无论怎么训练\(w_0\)都不会怎么变化.想复现梯度爆炸,可以把各个w调大一点,比如100以上.同样的代码,看看效果.

posted on 2018-01-18 17:57  荷楠仁  阅读(773)  评论(0编辑  收藏  举报

导航