基于浅层神经网络(全连接网络)的强化学习算法(Reinforce) 在训练过程中出现梯度衰退(degenerate)的现象

首先给出一个代码地址:

https://gitee.com/devilmaycry812839668/CartPole-PolicyNetwork

 

强化学习中的策略网络算法。《TensorFlow实战》一书中强化学习部分的策略网络算法,仿真环境为gym的CartPole,本项目是对原书代码进行了部分重构,并加入了些中文注释,同时给出了30次试验的运行结果。

 

 

=======================================

 

 

可以看到,上面的代码是比较简单的Reinforce算法,其中策略函数使用浅层的三层神经网络(全连接),激活函数使用Relu,进行了30次试验,每次试验进行了10000 个episodes的训练,但是神奇的发现这30次试验中居然第5次试验,第21次试验出现了严重的梯度衰退的想象。

 

给出梯度衰退时部分训练结果:

Average reward for episode 1375 : 200.000000.
Average reward for episode 1400 : 200.000000.
Average reward for episode 1425 : 200.000000.
Average reward for episode 1450 : 200.000000.
Average reward for episode 1475 : 200.000000.
Average reward for episode 1500 : 200.000000.
Average reward for episode 1525 : 200.000000.
Average reward for episode 1550 : 192.480000.
Average reward for episode 1575 : 140.440000.
Average reward for episode 1600 : 104.240000.
Average reward for episode 1625 : 20.080000.
Average reward for episode 1650 : 12.560000.
Average reward for episode 1675 : 10.720000.
Average reward for episode 1700 : 11.080000.
Average reward for episode 1725 : 12.000000.
Average reward for episode 1750 : 10.560000.
Average reward for episode 1775 : 11.040000.
Average reward for episode 1800 : 10.360000.
Average reward for episode 1825 : 10.080000.
Average reward for episode 1850 : 10.640000.
Average reward for episode 1875 : 10.360000.
Average reward for episode 1900 : 10.360000.
Average reward for episode 1925 : 10.480000.
Average reward for episode 1950 : 10.360000.
Average reward for episode 1975 : 9.680000.
Average reward for episode 2000 : 10.000000.
Average reward for episode 2025 : 10.720000.
Average reward for episode 2050 : 10.000000.
Average reward for episode 2075 : 10.000000.
Average reward for episode 2100 : 10.520000.
Average reward for episode 2125 : 10.640000.
Average reward for episode 2150 : 9.760000.
Average reward for episode 2175 : 11.040000.
View Code

 

可以看到在第5次和第21次试验中当训练到一定episodes后训练结果下降到极坏的水平(远低于随机策略的结果,随机策略结果应该在26左右),因此我们可以发现这时的训练已经发生了梯度衰退问题,degenerate问题。以前一直以为衰退问题只会出现在深层网络中,没有想到在浅层网络中也发现了衰退现象。

 

 

 

 

查阅相关论文《Skip connections eliminate signulairites》 发现浅层网络也是会出现衰退现象的,解答了自己的疑问,原来浅层神经网络也是可能会出现衰退问题的。

 

posted on 2020-12-07 15:22  Angry_Panda  阅读(470)  评论(1编辑  收藏  举报

导航