强化学习
强化学习笔记(一)
1 强化学习概述
随着 Alpha Go 的成功,强化学习(Reinforcement Learning,RL)成为了当下机器学习中最热门的研究领域之一。与常见的监督学习和非监督学习不同,强化学习强调智能体(agent)与环境(environment)的交互,交互过程中智能体需要根据自身所处的状态(state)选择接下来采取的动作(action),执行动作后,智能体会进入下一个状态,同时从环境中得到这次状态转移的奖励(reward)。
强化学习的目标就是从智能体与环境的交互过程中获取信息,学出状态与动作之间的映射,指导智能体根据状态做出最佳决策,最大化获得的奖励。
2 强化学习要素
强化学习通常使用马尔科夫决策过程(Markov Decision Process,MDP)来描述。MDP数学上通常表示为五元组的形式,分别是状态集合,动作集合,状态转移函数,奖励函数以及折扣因子。
近些年有研究工作将强化学习应用到更为复杂的MDP形式,如部分可观察马尔科夫决策过程(Partially Observable Markov Decision Process,POMDP),参数化动作马尔科夫决策过程(Parameterized Action Markov Decision Process,PAMDP)以及随机博弈(Stochastic Game,SG)。
状态(S):一个任务中可以有很多个状态,且我们设每个状态在时间上是等距的;
动作(A):针对每一个状态,应该有至少1个操作可选;
奖励(R):针对每一个状态,环境会在下一个状态直接给予一个数值回馈,这个值越高,说明该状态越值得青睐;
策略(π):给定一个状态,经过π的处理,总是能产生唯一一个操作a,即a=π(s),π可以是个查询表,也可以是个函数;
3 强化学习的算法分类
强化学习的算法分类众多,比较常见的算法有马尔科夫决策过程算法(MDP),Q-Learning算法等。在阿法狗人机大战中,就得益于强化学习算法。
同时,强化学习也引发了博弈论的讨论,用强化学习算法求解博弈论,用博弈论指导强化学习算法。二者是相辅相成的关系。在这些强化学习算法中都可以看到博弈论的思想。
4 强化学习应用
强化学习的经典应用案例有:非线性二级摆系统(非线性控制问题)、棋类游戏、机器人学习站立和走路、无人驾驶、机器翻译、人机对话,博弈论等。概括来说,强化学习所能解决的问题为序贯决策问题,就是需要连续不断做出决策,才能实现最终目标的问题。强化学习与其它的机器学习方法相比,专注于从交互中进行以目标为导向的学习。
图:强化学习-无人驾驶示意图
5 强化学习相关论文
一. 开山鼻祖DQN
1. Playing Atari with Deep Reinforcement Learning,V. Mnih et al., NIPS Workshop, 2013.
2. Human-level control through deep reinforcement learning, V. Mnih et al., Nature, 2015.
二. DQN的各种改进版本(侧重于算法上的改进)
1. Dueling Network Architectures for Deep Reinforcement Learning. Z. Wang et al., arXiv, 2015.
2. Prioritized Experience Replay, T. Schaul et al., ICLR, 2016.
3. Deep Reinforcement Learning with Double Q-learning, H. van Hasselt et al., arXiv, 2015.
5. Dynamic Frame skip Deep Q Network, A. S. Lakshminarayanan et al., IJCAI Deep RL Workshop, 2016.
6. Deep Exploration via Bootstrapped DQN, I. Osband et al., arXiv, 2016.
8. Learning functions across many orders of magnitudes,H Van Hasselt,A Guez,M Hessel,D Silver
9. Massively Parallel Methods for Deep Reinforcement Learning, A. Nair et al., ICML Workshop, 2015.
10. State of the Art Control of Atari Games using shallow reinforcement learning
11. Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening(11.13更新)
12. Deep Reinforcement Learning with Averaged Target DQN(11.14更新)
13. Safe and Efficient Off-Policy Reinforcement Learning(12.20更新)
14. The Predictron: End-To-End Learning and Planning (1.3更新)
三. DQN的各种改进版本(侧重于模型的改进)
1. Deep Recurrent Q-Learning for Partially Observable MDPs, M. Hausknecht and P. Stone, arXiv, 2015.
2. Deep Attention Recurrent Q-Network
3. Control of Memory, Active Perception, and Action in Minecraft, J. Oh et al., ICML, 2016.
4. Progressive Neural Networks
5. Language Understanding for Text-based Games Using Deep Reinforcement Learning
6. Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks
7. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
8. Recurrent Reinforcement Learning: A Hybrid Approach
9. Value Iteration Networks, NIPS, 2016 (12.20更新)
10. MazeBase:A sandbox for learning from games(12.20更新)
11. Strategic Attentive Writer for Learning Macro-Actions(12.20更新)
四. 基于策略梯度的深度强化学习
深度策略梯度:
1. End-to-End Training of Deep Visuomotor Policies
2. Learning Deep Control Policies for Autonomous Aerial Vehicles with MPC-Guided Policy Search
3. Trust Region Policy Optimization
深度行动者评论家算法:
1. Deterministic Policy Gradient Algorithms
2. Continuous control with deep reinforcement learning
3. High-Dimensional Continuous Control Using Using Generalized Advantage Estimation
4. Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies
5. Deep Reinforcement Learning in Parameterized Action Space
6. Memory-based control with recurrent neural networks
7. Terrain-adaptive locomotion skills using deep reinforcement learning
8. Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies
9. SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY(11.13更新)
搜索与监督:
1. End-to-End Training of Deep Visuomotor Policies
2. Interactive Control of Diverse Complex Characters with Neural Networks
连续动作空间下探索改进:
1. Curiosity-driven Exploration in DRL via Bayesian Neuarl Networks
结合策略梯度和Q学习:
1. Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC(11.13更新)
2. PGQ: COMBINING POLICY GRADIENT AND Q-LEARNING(11.13更新)
其它策略梯度文章:
1. Gradient Estimation Using Stochastic Computation Graphs
2. Continuous Deep Q-Learning with Model-based Acceleration
3. Benchmarking Deep Reinforcement Learning for Continuous Control
4. Learning Continuous Control Policies by Stochastic Value Gradients
5. Generalizing Skills with Semi-Supervised Reinforcement Learning(12.20更新)
五. 分层DRL
1. Deep Successor Reinforcement Learning
2. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
3. Hierarchical Reinforcement Learning using Spatio-Temporal Abstractions and Deep Neural Networks
六. DRL中的多任务和迁移学习
1. ADAAPT: A Deep Architecture for Adaptive Policy Transfer from Multiple
Sources
2. A Deep Hierarchical Approach to Lifelong Learning in Minecraft
3. Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning
5. Progressive Neural Networks
6. Universal Value Function Approximators
7. Multi-task learning with deep model based reinforcement learning(11.14更新)
8. Modular Multitask Reinforcement Learning with Policy Sketches (11.14更新)
七. 基于外部记忆模块的DRL模型
1. Control of Memory, Active Perception, and Action in Minecraft
2. Model-Free Episodic Control
八. DRL中探索与利用问题
1. Action-Conditional Video Prediction using Deep Networks in Atari Games
2. Curiosity-driven Exploration in Deep Reinforcement Learning via Bayesian Neural Networks
3. Deep Exploration via Bootstrapped DQN
4. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
5. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models
6. Unifying Count-Based Exploration and Intrinsic Motivation
7. #Exploration: A Study of Count-Based Exploration for Deep Reinforcemen Learning(11.14更新)
8. Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning(11.14更新)
9. VIME: Variational Information Maximizing Exploration(12.20更新)
九. 多Agent的DRL
1. Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks
2. Multiagent Cooperation and Competition with Deep Reinforcement Learning
十. 逆向DRL
1. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
2. Maximum Entropy Deep Inverse Reinforcement Learning
3. Generalizing Skills with Semi-Supervised Reinforcement Learning(11.14更新)
十一. 探索+监督学习
1. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning
2. Better Computer Go Player with Neural Network and Long-term Prediction
3. Mastering the game of Go with deep neural networks and tree search, D. Silver et al., Nature, 2016.
十二. 异步DRL
1. Asynchronous Methods for Deep Reinforcement Learning
2. Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU(11.14更新)
十三:适用于难度较大的游戏场景
2. Strategic Attentive Writer for Learning Macro-Actions
3. Unifying Count-Based Exploration and Intrinsic Motivation
十四:单个网络玩多个游戏
2. Universal Value Function Approximators
3. Learning values across many orders of magnitude
十五:德州poker
1. Deep Reinforcement Learning from Self-Play in Imperfect-Information Games
2. Fictitious Self-Play in Extensive-Form Games
3. Smooth UCT search in computer poker
十六:Doom游戏
1. ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning
2. Training Agent for First-Person Shooter Game with Actor-Critic Curriculum Learning
3. Playing FPS Games with Deep Reinforcement Learning
4. LEARNING TO ACT BY PREDICTING THE FUTURE(11.13更新)
5. Deep Reinforcement Learning From Raw Pixels in Doom(11.14更新)
十七:大规模动作空间
1. Deep Reinforcement Learning in Large Discrete Action Spaces
十八:参数化连续动作空间
1. Deep Reinforcement Learning in Parameterized Action Space
十九:Deep Model
1. Learning Visual Predictive Models of Physics for Playing Billiards
3. Learning Continuous Control Policies by Stochastic Value Gradients
4.Data-Efficient Learning of Feedback Policies from Image Pixels using Deep Dynamical Models
5. Action-Conditional Video Prediction using Deep Networks in Atari Games
6. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models
二十:DRL应用
机器人领域:
1. Trust Region Policy Optimization
2. Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control
3. Path Integral Guided Policy Search
4. Memory-based control with recurrent neural networks
6. Learning Deep Neural Network Policies with Continuous Memory States
7. High-Dimensional Continuous Control Using Generalized Advantage Estimation
8. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
9. End-to-End Training of Deep Visuomotor Policies
10. DeepMPC: Learning Deep Latent Features for Model Predictive Control
11. Deep Visual Foresight for Planning Robot Motion
12. Deep Reinforcement Learning for Robotic Manipulation
13. Continuous Deep Q-Learning with Model-based Acceleration
14. Collective Robot Reinforcement Learning with Distributed Asynchronous Guided Policy Search
15. Asynchronous Methods for Deep Reinforcement Learning
16. Learning Continuous Control Policies by Stochastic Value Gradients
机器翻译:
1. Simultaneous Machine Translation using Deep Reinforcement Learning
目标定位:
1. Active Object Localization with Deep Reinforcement Learning
目标驱动的视觉导航:
1. Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning
自动调控参数:
1. Using Deep Q-Learning to Control Optimization Hyperparameters
人机对话:
1. Deep Reinforcement Learning for Dialogue Generation
2. SimpleDS: A Simple Deep Reinforcement Learning Dialogue System
3. Strategic Dialogue Management via Deep Reinforcement Learning
视频预测:
1. Action-Conditional Video Prediction using Deep Networks in Atari Games
文本到语音:
1. WaveNet: A Generative Model for Raw Audio
文本生成:
1. Generating Text with Deep Reinforcement Learning
文本游戏:
1. Language Understanding for Text-based Games Using Deep Reinforcement Learning
无线电操控和信号监控:
1. Deep Reinforcement Learning Radio Control and Signal Detection with KeRLym, a Gym RL Agent
DRL来学习做物理实验:
1. LEARNING TO PERFORM PHYSICS EXPERIMENTS VIA DEEP REINFORCEMENT LEARNING(11.13更新)
DRL加速收敛:
1. Deep Reinforcement Learning for Accelerating the Convergence Rate(11.14更新)
利用DRL来设计神经网络:
1. Designing Neural Network Architectures using Reinforcement Learning(11.14更新)
2. Tuning Recurrent Neural Networks with Reinforcement Learning(11.14更新)
3. Neural Architecture Search with Reinforcement Learning(11.14更新)
控制信号灯:
1. Using a Deep Reinforcement Learning Agent for Traffic Signal Control(11.14更新)
自动驾驶:
1. CARMA: A Deep Reinforcement Learning Approach to Autonomous Driving(12.20更新)
2. Deep Reinforcement Learning for Simulated Autonomous Vehicle Control(12.20更新)
3. Deep Reinforcement Learning framework for Autonomous Driving(12.20更新)
二十一:其它方向
避免危险状态:
1. Combating Deep Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear (11.14更新)
DRL中On-Policy vs. Off-Policy 比较:
1. On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning(11.14更新)