(元)强化学习开源代码调研

(元)强化学习相关开源代码调研

本地代码：https://github.com/lucifer2859/meta-RL

元强化学习简介：https://www.cnblogs.com/lucifer1997/p/13603979.html

一、RL-Adventure

1、Deep Q-Learning：

参见先前的Blog

https://www.cnblogs.com/lucifer1997/p/13458563.html；
https://github.com/lucifer2859/DQN；

https://github.com/Kaixhin/Rainbow
- 环境：PyTorch，GPU;
- 任务：Atari;
- 模型：Rainbow；
- 实验：成功运行；
https://github.com/TianhongDai/hindsight-experience-replay
- 环境：PyTorch，GPU(Not Recommended, Better Use CPU)；
- 任务：MuJoCo；
- 模型：HER；
- 实验：未运行；

2、Policy Gradients：

https://github.com/higgsfield/RL-Adventure-2

环境：PyTorch，GPU；
任务：Gym；
模型：A2C，GAE，PPO，ACER，DDPG，TD3，SAC，GAIL，HER；
实验：成功运行；本地代码基于bug、issue以及性能对其进行修改，参见https://github.com/lucifer2859/Policy-Gradients；在本地代码中，所有模型(HER除外)均可以收敛且获得较好性能；HER的问题参见https://github.com/higgsfield/RL-Adventure-2/issues/14；SAC实现似乎与原文不符(参见https://github.com/higgsfield/RL-Adventure-2/issues/11)；A2C实验仅在CartPole-v0上能够收敛；

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail；

环境：PyTorch/TensorFlow GPU;
任务：Atari，MuJoCo，PyBullet (including Racecar, Minitaur and Kuka)，DeepMind Control Suite；
模型：A2C，PPO，ACKTR，GAIL
实验：未运行；

https://github.com/ikostrikov/pytorch-a3c

环境：PyTorch，CPU；
任务：Atari；
模型：A3C；
实验：初始运行失败，NotImplementedError；参考https://github.com/ikostrikov/pytorch-a3c/issues/66#issuecomment-559785590修改envs.py即可解决；最终成功运行；

https://github.com/haarnoja/sac

环境：TensorFlow，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第一版，模型带有状态价值函数V）；
实验：未运行；

https://github.com/denisyarats/pytorch_sac

环境：PyTorch，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第一版，模型带有状态价值函数V）；
实验：未运行；

http://github.com/rail-berkeley/softlearning/

环境：TensorFlow，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第二版，模型去掉了状态价值函数V）；
实验：未运行；

https://github.com/ku2482/sac-discrete.pytorch
- 环境：PyTorch，GPU；
- 任务：Atari；
- 模型：SAC-Discrete(基于新版连续控制任务下的SAC改进的离散版本)；
- 实验：成功运行；本地代码对其略有修改，参见https://github.com/lucifer2859/sac-discrete-pytorch；训练收敛，但性能与论文描述存在差异；

3、两者兼有：

https://github.com/ShangtongZhang/DeepRL

环境：PyTorch，GPU；
任务：Atari，MuJoCo；
模型：(Double/Dueling/Prioritized) DQN，C51，QR-DQN，(Continuous/Discrete) Synchronous Advantage A2C，N-Step DQN，DDPG，PPO，OC，TD3，COF-PAC，GradientDICE，Bi-Res-DDPG，DAC，Geoff-PAC，QUOTA，ACE；
实验：成功运行；

https://github.com/astooke/rlpyt

环境：PyTorch，GPU；
任务：Atari；
模型：Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient.实验：
- Policy Gradient：A2C, PPO.
- Replay Buffers：(supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
- Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style).
- Q-Function Policy Gradient DDPG, TD3, SAC.

- 成功运行，无bug；
https://github.com/vitchyr/rlkit

环境：PyTorch，GPU；
任务：gym[all]
模型：Skew-Fit，RIG，TDM，HER，DQN，SAC（新版），TD3，AWAC；
实验：未运行；

p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch: PyTorch implementations of deep reinforcement learning algorithms and environments (github.com)

环境：PyTorch；
任务：CartPole，MountainCar，Bit Flipping，Four Rooms，Long Corridor，Ant-[Maze, Push, Fall]；
模型：DQN，DQN with Fixed Q Target，DDQN，DDQN with Prioritised Experience Replay，Dueling DDQN，REINFORCE，DDPG，TD3，SAC，SAC-Discrete，A3C，A2C，PPO，DQN-HER，DDPG-HER，h-DQN，Stochastic NN-HRL，DIAYN；
实验：部分模型在部分任务上成功运行(例如SAC-Discrete无法在Atari上成功运行)；

https://github.com/hill-a/stable-baselines

环境：TensorFlow；

https://github.com/openai/baselines
- 环境：Tensorflow；
- 介绍：OpenAI Baselines is a set of high-quality implementations of reinforcement learning algorithms. These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones.
- A2C，ACER，ACKTR，DDPG，DQN，GAIL，HER，PPO1 (obsolete version, left here temporarily)，PPO2，TRPO

GitHub - openai/spinningup: An educational resource to help anyone learn deep reinforcement learning.

环境：TensorFlow/PyTorch
介绍：This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL). For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning. This module contains a variety of helpful resources, including:

a short introduction to RL terminology, kinds of algorithms, and basic theory,
an essay about how to grow into an RL research role,
a curated list of important papers organized by topic,
a well-documented code repo of short, standalone implementations of key algorithms,
and a few exercises to serve as warm-ups.

实验：TD3在MuJoCo任务上运行成功；

quantumiracle/Popular-RL-Algorithms: PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet.. (github.com)

环境：PyTorch/Tensorflow 2.0 + TensorLayer 2.0
介绍：PyTorch和Tensorflow 2.0在OpenAI gym环境和自行实现的Reacher环境中实现了最先进的无模型强化学习算法。算法包括SAC，DDPG，TD3，AC/A2C，PPO，QT-Opt(包括交叉熵方法)，PointNet，Transporter，Recurrent Policy Gradient，Soft Decision Tree，Probabilistic Mixture-of-Experts等。请注意，此repo更多的是我在研究和学习期间实现和测试的算法的个人集合，而不是供使用的官方开源库/包。然而，我认为与其他人分享可能会有所帮助，我期待着对我的实现进行有益的讨论。但我没有花太多时间清理或构建代码。正如您可能注意到的，每个算法可能有几个版本的实现，我特意在这里展示它们，供您参考和比较。此外，该repo仅包含PyTorch实施。对于RL算法的官方库，我提供了以下两个使用TensorFlow 2.0 + TensorLayer 2.0的方案：
- RL Tutorial (Status: Released) contains RL algorithms implementation as tutorials with simple structures.
- RLzoo (Status: Released) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.
由于Tensorflow 2.0已经包含了动态图形构造而不是静态图形构造，因此在Tensorflow和PyTorch之间传输RL代码就变得非常简单。
实验：PPO在Atari任务上运行性能无法收敛；

GitHub - michaelnny/deep_rl_zoo: A collection of Deep RL algorithms implemented with PyTorch to solve Atari games and classic control tasks like CartPole, LunarLander, and MountainCar.
- 环境：PyTorch

二、Meta Learning (Learn to Learn)

1、Platform：

https://github.com/learnables/learn2learn

三、Meta-RL

1、Learning to Reinforcement Learn：CogSci 2017

https://github.com/awjuliani/Meta-RL

环境：TensorFlow，CPU；
任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit，Contextual Bandit，GridWorld
- A3C-Meta-Bandit - Set of bandit tasks described in paper. Including: Independent, Dependent, and Restless bandits.
- A3C-Meta-Context - Rainbow bandit task using randomized colors to indicate reward-giving arm in each episode.
- A3C-Meta-Grid - Rainbow Gridworld task; a variation of gridworld in which goal colors are randomzied each episode and must be learned "on the fly."
模型：one-layer LSTM A3C [Figure 1(a)，无Enc层]；
实验：成功运行，无bug；训练收敛；结果大致相符；性能未达到论文效果(当前超参数)；本地代码对其略有修改，参见https://github.com/lucifer2859/meta-RL/tree/master/Meta-RL；

https://github.com/achao2013/Learning-To-Reinforcement-Learn

环境：MXNet，CPU；
任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit；
模型：multi-layer LSTM A3C[无Enc层]；
实验：未运行；

https://github.com/lucifer2859/meta-RL/tree/master/L2RL-pytorch
- 环境：PyTorch，CPU；
- 任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit；
- 模型：one-layer LSTM A3C [Figure 1(a)，with GAE，无Enc层]；
- 实验：成功运行，无bug；训练收敛；结果大致相符；性能未达到论文效果(当前超参数)；

2、RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (RL²)：ICLR 2017

https://github.com/mwufi/meta-rl-bandits

环境：PyTorch，CPU；
任务：Independent Bandit；
模型：two-layer LSTM REINFORCE；
实验：成功运行，无bug；模型与论文不符，原文RNN模型为GRU；训练不收敛(当前超参数)；

https://github.com/VashishtMadhavan/rl2
- 环境：TensorFlow，CPU；
- 任务：Dependent Bandit；
- 模型：one-layer LSTM A3C [无Enc层]；
- 实验：运行失败，gym.error.UnregisteredEnv: No registered env with id: MediumBandit-v0；

3、Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)：ICML 2017

https://github.com/tristandeleu/pytorch-maml-rl
- 环境：PyTorch，GPU；
- 任务：Multi-armed Bandit，Tabular MDP，Continuous Control with MuJoCo，2D Navigation Task；
- 模型：MAML TRPO；
- 实验：初始运行失败，terminate called after throwing an instance of 'c10::Error'；参见https://github.com/tristandeleu/pytorch-maml-rl/issues/40#issuecomment-632598191即可解决；但是出现新问题(AttributeError: Can't pickle local object 'make_env.<locals>._make_env')；参见https://github.com/tristandeleu/pytorch-maml-rl/issues/51即可解决；最终成功运行train.py，但test.py运行失败；bandit-k5-n10不收敛(当前超参数)；
https://github.com/cbfinn/maml_rl
- 环境：the TensorFlow rllab version，CPU；
- 任务：MuJoCo；
- 模型：MAML TRPO；
- 实验：未运行；

4、Evolved Policy Gradients (EPG)：NeurIPS, 2018

https://github.com/openai/EPG
- 环境：Chainer，CPU；
- 任务：MuJoCo；
- 模型：EPG PPO；
- 实验：未运行；

5、A Simple Neural Attentive Meta-Learner：ICLR 2018

https://github.com/chanb/metalearning_RL
- 环境：PyTorch，GPU；
- 任务：Multi-armed Bandit，Tabular MDP；
- 模型：SNAIL，RL²（GRU）+ PPO；
- 实验：成功运行，无bug；

6、Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL)：arXiv: Learning, 2019

https://github.com/katerakelly/oyster
- 环境：PyTorch，GPU；
- 任务：MuJoCo；
- 模型：PEARL (SAC-based)；
- 实验：Docker配置过程中运行docker build . -t pearl失败；放弃Docker配置在本地对相关包进行安装后，可以成功运行；使用本地包需要提前加一句：conda config --set restore_free_channel true，不然找不到大部分特定版本的包，就会导致创建环境失败；相关问题可以咨询Chains朱朱的主页 - 博客园 (cnblogs.com)；

7、Improving Generalization in Meta Reinforcement Learning using Learned Objectives (MetaGenRL)： ICLR 2020

http://louiskirsch.com/code/metagenrl

环境：TensorFlow，GPU；
任务：MuJoCo；
模型：MetaGenRL；
实验：在tensorflow-gpu==1.14.0与tensorflow==1.13.2环境上运行python ray_experiments.py train时都会出现bug；

posted on 2020-09-19 23:50 穷酸秀才大草包阅读(4841) 评论(6) 编辑收藏举报

刷新页面返回顶部

穷酸秀才大艹包

(元)强化学习开源代码调研

(元)强化学习相关开源代码调研

一、RL-Adventure

二、Meta Learning (Learn to Learn)

三、Meta-RL

导航

公告