(元)强化学习开源代码调研

(元)强化学习相关开源代码调研

本地代码:https://github.com/lucifer2859/meta-RL 

元强化学习简介:https://www.cnblogs.com/lucifer1997/p/13603979.html

 

一、RL-Adventure

1、Deep Q-Learning:

2、Policy Gradients:

  • https://github.com/haarnoja/sac
    • 环境:TensorFlow,GPU;
    • 任务:Continuous Control Tasks (MuJoCo);
    • 模型:Soft Actor-Critic(SAC,第一版,模型带有状态价值函数V);
    • 实验:未运行;
  • https://github.com/denisyarats/pytorch_sac
    • 环境:PyTorch,GPU;
    • 任务:Continuous Control Tasks (MuJoCo);
    • 模型:Soft Actor-Critic(SAC,第一版,模型带有状态价值函数V);
    • 实验:未运行;
  • http://github.com/rail-berkeley/softlearning/
    • 环境:TensorFlow,GPU;
    • 任务:Continuous Control Tasks (MuJoCo);
    • 模型:Soft Actor-Critic(SAC,第二版,模型去掉了状态价值函数V);
    • 实验:未运行;

3、两者兼有:

  • https://github.com/ShangtongZhang/DeepRL
    • 环境:PyTorch,GPU;
    • 任务:Atari,MuJoCo;
    • 模型:(Double/Dueling/Prioritized) DQN,C51,QR-DQN,(Continuous/Discrete) Synchronous Advantage A2C,N-Step DQN,DDPG,PPO,OC,TD3,COF-PAC,GradientDICE,Bi-Res-DDPG,DAC,Geoff-PAC,QUOTA,ACE;
    • 实验:成功运行;
  • https://github.com/astooke/rlpyt
    • 环境:PyTorch,GPU;
    • 任务:Atari;
    • 模型:Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient.实验:
      • Policy Gradient:A2C, PPO.
      • Replay Buffers:(supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
      • Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style).
      • Q-Function Policy Gradient DDPG, TD3, SAC.
    • 成功运行,无bug;
  • https://github.com/vitchyr/rlkit
    • 环境:PyTorch,GPU;
    • 任务:gym[all]
    • 模型:Skew-Fit,RIG,TDM,HER,DQN,SAC(新版),TD3,AWAC;
    • 实验:未运行;
  • p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch: PyTorch implementations of deep reinforcement learning algorithms and environments (github.com)
    • 环境:PyTorch;
    • 任务:CartPole,MountainCar,Bit Flipping,Four Rooms,Long Corridor,Ant-[Maze, Push, Fall];
    • 模型:DQN,DQN with Fixed Q Target,DDQN,DDQN with Prioritised Experience Replay,Dueling DDQN,REINFORCE,DDPG,TD3,SAC,SAC-Discrete,A3C,A2C,PPO,DQN-HER,DDPG-HER,h-DQN,Stochastic NN-HRL,DIAYN;
    • 实验:部分模型在部分任务上成功运行(例如SAC-Discrete无法在Atari上成功运行);
  • https://github.com/hill-a/stable-baselines
    • 环境:TensorFlow;
  • https://github.com/openai/baselines
    • 环境:Tensorflow;
    • 介绍:OpenAI Baselines is a set of high-quality implementations of reinforcement learning algorithms. These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones.
    • A2C,ACER,ACKTR,DDPG,DQN,GAIL,HER,PPO1 (obsolete version, left here temporarily),PPO2,TRPO

 

二、Meta Learning (Learn to Learn)

1、Platform:

 

三、Meta-RL

1、Learning to Reinforcement Learn:CogSci 2017

  • https://github.com/awjuliani/Meta-RL
    • 环境:TensorFlow,CPU;
    • 任务:Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit,Contextual Bandit,GridWorld
      • A3C-Meta-Bandit - Set of bandit tasks described in paper. Including: Independent, Dependent, and Restless bandits.
      • A3C-Meta-Context - Rainbow bandit task using randomized colors to indicate reward-giving arm in each episode.
      • A3C-Meta-Grid - Rainbow Gridworld task; a variation of gridworld in which goal colors are randomzied each episode and must be learned "on the fly."
    • 模型:one-layer LSTM A3C [Figure 1(a),无Enc层];
    • 实验:成功运行,无bug;训练收敛;结果大致相符;性能未达到论文效果(当前超参数);本地代码对其略有修改,参见https://github.com/lucifer2859/meta-RL/tree/master/Meta-RL

2、RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (RL2):ICLR 2017

  • https://github.com/mwufi/meta-rl-bandits
    • 环境:PyTorch,CPU;
    • 任务:Independent Bandit;
    • 模型:two-layer LSTM REINFORCE;
    • 实验:成功运行,无bug;模型与论文不符,原文RNN模型为GRU;训练不收敛(当前超参数);
  • https://github.com/VashishtMadhavan/rl2
    • 环境:TensorFlow,CPU;
    • 任务:Dependent Bandit;
    • 模型:one-layer LSTM A3C [无Enc层];
    • 实验:运行失败,gym.error.UnregisteredEnv: No registered env with id: MediumBandit-v0;

3、Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML):ICML 2017

4、Evolved Policy Gradients (EPG):NeurIPS, 2018

5、A Simple Neural Attentive Meta-Learner:ICLR 2018

6、Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL):arXiv: Learning, 2019

  •  https://github.com/katerakelly/oyster
    • 环境:PyTorch,GPU;
    • 任务:MuJoCo;
    • 模型:PEARL (SAC-based);
    • 实验:Docker配置过程中运行docker build . -t pearl失败;放弃Docker配置在本地对相关包进行安装后,可以成功运行;使用本地包需要提前加一句:conda config --set restore_free_channel true,不然找不到大部分特定版本的包,就会导致创建环境失败;相关问题可以咨询Chains朱朱的主页 - 博客园 (cnblogs.com)

7、Improving Generalization in Meta Reinforcement Learning using Learned Objectives (MetaGenRL): ICLR 2020

  • http://louiskirsch.com/code/metagenrl
    • 环境:TensorFlow,GPU;
    • 任务:MuJoCo;
    • 模型:MetaGenRL;
    • 实验:在tensorflow-gpu==1.14.0与tensorflow==1.13.2环境上运行python ray_experiments.py train时都会出现bug;

 

posted on 2020-09-19 23:50  穷酸秀才大草包  阅读(4841)  评论(6编辑  收藏  举报

导航