rlpyt(Deep Reinforcement Learning in PyTorch)
rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch
Github:https://github.com/astooke/rlpyt
Introduction (CH):https://baijiahao.baidu.com/s?id=1646437256939374418&wfr=spider&for=pc
Introduction (EN):https://bair.berkeley.edu/blog/2019/09/24/rlpyt/
Documentation:https://rlpyt.readthedocs.io/en/latest/
arxiv:https://arxiv.org/abs/1909.01500
blog:rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch - 穷酸秀才大艹包 - 博客园 (cnblogs.com)
Installation
-
Clone this repository to the local machine.
-
Install the anaconda environment appropriate for the machine.
conda env create -f linux_[cpu|cuda9|cuda10].yml source activate rlpyt
- Either A) Edit the PYTHONPATH to include the rlpyt directory, or B) Install as editable python package
#A export PYTHONPATH=path_to_rlpyt:$PYTHONPATH #B pip install -e .
- Install any packages / files pertaining to desired environments (e.g. gym, mujoco). Atari is included.
pip install gym
Hint: for easy access, add the following to your ~/.bashrc
(might substitute conda
for source
).
alias rlpyt="source activate rlpyt; cd path_to_rlpyt"
rlpyt/example/atari_dqn_async_cpu.py:设置n_socket=1;
rlpyt/example/atari_dqn_async_gpu.py:设置n_socket=1;
rlpyt/example/atari_dqn_async_serial.py:设置n_socket=1;
Recurrent DQN (R2D1)
--------------------
rlpyt.utils.launching.affinity.
encode_affinity
(n_cpu_core=1, n_gpu=0, contexts_per_gpu=1, gpu_per_run=1, cpu_per_run=1, cpu_per_worker=1, cpu_reserved=0, hyperthread_offset=None, n_socket=None, run_slot=None, async_sample=False, sample_gpu_per_run=0, optim_sample_share_gpu=False, alternating=False, set_affinity=True)
将硬件配置编码为字符串(含义在此文件中定义),该字符串可以作为命令行参数传递,以调用训练脚本。在整体实验设置脚本中使用,将计算机和实验资源指定为run_experiments()。我们将“实验”称为单独的学习运行,即一组超参数,且不与其他运行交互。
Parameters:
- n_cpu_core (int) – Total number of phyical cores to use on machine (not virtual)
- n_gpu (int) – Total number of GPUs to use on machine
- contexts_per_gpu (int) – How many experiment to share each GPU
- gpu_per_run (int) – How many GPUs to use per experiment (for multi-GPU optimization)
- cpu_per_run (int) – If not using GPU, specify how macores per experiment
- cpu_per_worker (int) – CPU cores per sampler worker; 1 unless environment is multi-threaded
- cpu_reserved (int) – Number of CPUs to reserve per GPU, and not allow sampler to use them
- hyperthread_offset (int) – Typically the number of physical cores, since they are labeled 0-x, and hyperthreads as (x+1)-2x; use 0 to disable hyperthreads, None to auto-detect
- n_socket (int) – Number of CPU sockets in machine; tries to keep CPUs grouped on same socket, and match socket-to-GPU affinity
- run_slot (int) – Which hardware slot to use; leave
None
intorun_experiments()
, but specified for inidividual train script - async_sample (bool) – True if asynchronous sampling/optimization mode; different affinity structure needed
- sample_gpu_per_run (int) – In asynchronous mode only, number of action-server GPUs per experiment
- optim_sample_share_gpu (bool) – In asynchronous mode only, whether to use same GPU(s) for both training and sampling
- alternating (bool) – True if using alternating sampler (will make more worker assignments)
- set_affinity (bool) – False to disable runner and sampler from setting cpu affinity via psutil, maybe inappropriate in cloud machines.
rlpyt.utils.launching.exp_launcher.
run_experiments
(script, affinity_code, experiment_title, runs_per_setting, variants, log_dirs, common_args=None, runs_args=None, set_egl_device=False)
调用脚本在机器上本地运行一组实验。对每个单独的运行使用launch_experience()函数,这是对script文件的调用。同时运行的实验数量由affinity_code确定,该代码表示机器的硬件资源以及每次运行获得的资源量(例如,4个GPU机器,每次运行2个GPU)。实验排队并按顺序运行,目的是避免硬件重叠。输入variants和log_dirs应该是相同长度的列表,包含每个实验配置和保存其日志文件的位置(它们具有相同的名称,因此不能存在于同一文件夹中)。
[Hint] To monitor progress, view the num_launched.txt file and experiments_tree.txt file in the experiment root directory, and also check the length of each progress.csv file, e.g. wc -l experiment-directory/.../run_*/progress.csv
.
runs_per_setting:每个实验设置下运行的轮数
rlpyt.utils.launching.exp_launcher.
launch_experiment
(script, run_slot, affinity_code, log_dir, variant, run_ID, args, python_executable=None, set_egl_device=False)
使用subprocess.Popen()启动一次学习运行以调用python脚本。将脚本调用为:python {script} {slot_affinity_code} {log_dir} {run_ID} {*args}。如果提供了affinity_code[“all_cpu”],则调用将以tasket -c ..作为前缀以及列出的CPU(这是将运行限制在这些CPU核上的最可靠的方法)。还保存variant文件。返回可以监视的进程句柄。使用set_egl_device=True将环境变量EDL_DEVICE_ID设置为与算法的cuda索引相同的值。例如,在选择用于无头渲染的GPU时,可以使用DMControl环境修改来查找此环境变量。
--rlpyt/rlpyt
--agent
--dqn
--atari/atari_r2d1_agent.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
from rlpyt.agents.dqn.r2d1_agent import R2d1Agent, R2d1AlternatingAgent from rlpyt.models.dqn.atari_r2d1_model import AtariR2d1Model from rlpyt.agents.dqn.atari.mixin import AtariMixin class AtariR2d1Agent(AtariMixin, R2d1Agent): def __init__(self, ModelCls=AtariR2d1Model, **kwargs): super().__init__(ModelCls=ModelCls, **kwargs) class AtariR2d1AlternatingAgent(AtariMixin, R2d1AlternatingAgent): def __init__(self, ModelCls=AtariR2d1Model, **kwargs): super().__init__(ModelCls=ModelCls, **kwargs)
--r2d1_agent.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import torch from rlpyt.agents.base import (AgentStep, RecurrentAgentMixin, AlternatingRecurrentAgentMixin) from rlpyt.agents.dqn.dqn_agent import DqnAgent from rlpyt.utils.buffer import buffer_to, buffer_func, buffer_method from rlpyt.utils.collections import namedarraytuple AgentInfo = namedarraytuple("AgentInfo", ["q", "prev_rnn_state"]) class R2d1AgentBase(DqnAgent): """Base agent for recurrent DQN (to add recurrent mixin).""" def __call__(self, observation, prev_action, prev_reward, init_rnn_state): # Assume init_rnn_state already shaped: [N,B,H] prev_action = self.distribution.to_onehot(prev_action) model_inputs = buffer_to((observation, prev_action, prev_reward, init_rnn_state), device=self.device) q, rnn_state = self.model(*model_inputs) return q.cpu(), rnn_state # Leave rnn state on device. @torch.no_grad() def step(self, observation, prev_action, prev_reward): """Computes Q-values for states/observations and selects actions by epsilon-greedy (no grad). Advances RNN state.""" prev_action = self.distribution.to_onehot(prev_action) agent_inputs = buffer_to((observation, prev_action, prev_reward), device=self.device) q, rnn_state = self.model(*agent_inputs, self.prev_rnn_state) # Model handles None. q = q.cpu() action = self.distribution.sample(q) prev_rnn_state = self.prev_rnn_state or buffer_func(rnn_state, torch.zeros_like) # Transpose the rnn_state from [N,B,H] --> [B,N,H] for storage. # (Special case, model should always leave B dimension in.) prev_rnn_state = buffer_method(prev_rnn_state, "transpose", 0, 1) prev_rnn_state = buffer_to(prev_rnn_state, device="cpu") agent_info = AgentInfo(q=q, prev_rnn_state=prev_rnn_state) self.advance_rnn_state(rnn_state) # Keep on device. return AgentStep(action=action, agent_info=agent_info) def target(self, observation, prev_action, prev_reward, init_rnn_state): # Assume init_rnn_state already shaped: [N,B,H] prev_action = self.distribution.to_onehot(prev_action) model_inputs = buffer_to((observation, prev_action, prev_reward, init_rnn_state), device=self.device) target_q, rnn_state = self.target_model(*model_inputs) return target_q.cpu(), rnn_state # Leave rnn state on device. class R2d1Agent(RecurrentAgentMixin, R2d1AgentBase): """R2D1 agent.""" pass class R2d1AlternatingAgent(AlternatingRecurrentAgentMixin, R2d1AgentBase): pass
--base.py
--algos
--dqn/r2d1.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import torch from collections import namedtuple from rlpyt.algos.dqn.dqn import DQN, SamplesToBuffer from rlpyt.agents.base import AgentInputs from rlpyt.utils.quick_args import save__init__args from rlpyt.utils.logging import logger from rlpyt.utils.collections import namedarraytuple from rlpyt.replays.sequence.frame import (UniformSequenceReplayFrameBuffer, PrioritizedSequenceReplayFrameBuffer, AsyncUniformSequenceReplayFrameBuffer, AsyncPrioritizedSequenceReplayFrameBuffer) from rlpyt.utils.tensor import select_at_indexes, valid_mean from rlpyt.algos.utils import valid_from_done, discount_return_n_step from rlpyt.utils.buffer import buffer_to, buffer_method, torchify_buffer OptInfo = namedtuple("OptInfo", ["loss", "gradNorm", "tdAbsErr", "priority"]) SamplesToBufferRnn = namedarraytuple("SamplesToBufferRnn", SamplesToBuffer._fields + ("prev_rnn_state",)) PrioritiesSamplesToBuffer = namedarraytuple("PrioritiesSamplesToBuffer", ["priorities", "samples"]) class R2D1(DQN): """Recurrent-replay DQN with options for: Double-DQN, Dueling Architecture, n-step returns, prioritized_replay.""" opt_info_fields = tuple(f for f in OptInfo._fields) # copy def __init__( self, discount=0.997, batch_T=80, batch_B=64, warmup_T=40, store_rnn_state_interval=40, # 0 for none, 1 for all. min_steps_learn=int(1e5), delta_clip=None, # Typically use squared-error loss (Steven). replay_size=int(1e6), replay_ratio=1, target_update_interval=2500, # (Steven says 2500 but maybe faster.) n_step_return=5, learning_rate=1e-4, OptimCls=torch.optim.Adam, optim_kwargs=None, initial_optim_state_dict=None, clip_grad_norm=80., # 80 (Steven). # eps_init=1, # NOW IN AGENT. # eps_final=0.1, # eps_final_min=0.0005, # eps_eval=0.001, eps_steps=int(1e6), # STILL IN ALGO; conver to itr, give to agent. double_dqn=True, prioritized_replay=True, pri_alpha=0.6, pri_beta_init=0.9, pri_beta_final=0.9, pri_beta_steps=int(50e6), pri_eta=0.9, default_priority=None, input_priorities=True, input_priority_shift=None, value_scale_eps=1e-3, # 1e-3 (Steven). ReplayBufferCls=None, # leave None to select by above options updates_per_sync=1, # For async mode only. ): """Saves input arguments. Args: store_rnn_state_interval (int): store RNN state only once this many steps, to reduce memory usage; replay sequences will only begin at the steps with stored recurrent state. Note: Typically ran with ``store_rnn_state_interval`` equal to the sampler's ``batch_T``, 40. Then every 40 steps can be the beginning of a replay sequence, and will be guaranteed to start with a valid RNN state. Only reset the RNN state (and env) at the end of the sampler batch, so that the beginnings of episodes are trained on. """ if optim_kwargs is None: optim_kwargs = dict(eps=1e-3) # Assumes Adam. if default_priority is None: default_priority = delta_clip or 1. if input_priority_shift is None: input_priority_shift = warmup_T // store_rnn_state_interval save__init__args(locals()) self._batch_size = (self.batch_T + self.warmup_T) * self.batch_B def initialize_replay_buffer(self, examples, batch_spec, async_=False): """Similar to DQN but uses replay buffers which return sequences, and stores the agent's recurrent state.""" example_to_buffer = SamplesToBuffer( observation=examples["observation"], action=examples["action"], reward=examples["reward"], done=examples["done"], ) if self.store_rnn_state_interval > 0: example_to_buffer = SamplesToBufferRnn(*example_to_buffer, prev_rnn_state=examples["agent_info"].prev_rnn_state, ) replay_kwargs = dict( example=example_to_buffer, size=self.replay_size, B=batch_spec.B, discount=self.discount, n_step_return=self.n_step_return, rnn_state_interval=self.store_rnn_state_interval, # batch_T fixed for prioritized, (relax if rnn_state_interval=1 or 0). batch_T=self.batch_T + self.warmup_T, ) if self.prioritized_replay: replay_kwargs.update(dict( alpha=self.pri_alpha, beta=self.pri_beta_init, default_priority=self.default_priority, input_priorities=self.input_priorities, # True/False. input_priority_shift=self.input_priority_shift, )) ReplayCls = (AsyncPrioritizedSequenceReplayFrameBuffer if async_ else PrioritizedSequenceReplayFrameBuffer) else: ReplayCls = (AsyncUniformSequenceReplayFrameBuffer if async_ else UniformSequenceReplayFrameBuffer) if self.ReplayBufferCls is not None: ReplayCls = self.ReplayBufferCls logger.log(f"WARNING: ignoring internal selection logic and using" f" input replay buffer class: {ReplayCls} -- compatibility not" " guaranteed.") self.replay_buffer = ReplayCls(**replay_kwargs) return self.replay_buffer def optimize_agent(self, itr, samples=None, sampler_itr=None): """ Similar to DQN, except allows to compute the priorities of new samples as they enter the replay buffer (input priorities), instead of only once they are used in training (important because the replay-ratio is quite low, about 1, so must avoid un-informative samples). """ # TODO: estimate priorities for samples entering the replay buffer. # Steven says: workers did this approximately by using the online # network only for td-errors (not the target network). # This could be tough since add samples before the priorities are ready # (next batch), and in async case workers must do it. itr = itr if sampler_itr is None else sampler_itr # Async uses sampler_itr if samples is not None: samples_to_buffer = self.samples_to_buffer(samples) self.replay_buffer.append_samples(samples_to_buffer) opt_info = OptInfo(*([] for _ in range(len(OptInfo._fields)))) if itr < self.min_itr_learn: return opt_info for _ in range(self.updates_per_optimize): samples_from_replay = self.replay_buffer.sample_batch(self.batch_B) self.optimizer.zero_grad() loss, td_abs_errors, priorities = self.loss(samples_from_replay) loss.backward() grad_norm = torch.nn.utils.clip_grad_norm_( self.agent.parameters(), self.clip_grad_norm) self.optimizer.step() if self.prioritized_replay: self.replay_buffer.update_batch_priorities(priorities) opt_info.loss.append(loss.item()) opt_info.gradNorm.append(torch.tensor(grad_norm).item()) # backwards compatible opt_info.tdAbsErr.extend(td_abs_errors[::8].numpy()) opt_info.priority.extend(priorities) self.update_counter += 1 if self.update_counter % self.target_update_interval == 0: self.agent.update_target() self.update_itr_hyperparams(itr) return opt_info def samples_to_buffer(self, samples): samples_to_buffer = super().samples_to_buffer(samples) if self.store_rnn_state_interval > 0: samples_to_buffer = SamplesToBufferRnn(*samples_to_buffer, prev_rnn_state=samples.agent.agent_info.prev_rnn_state) if self.input_priorities: priorities = self.compute_input_priorities(samples) samples_to_buffer = PrioritiesSamplesToBuffer( priorities=priorities, samples=samples_to_buffer) return samples_to_buffer def compute_input_priorities(self, samples): """Used when putting new samples into the replay buffer. Computes n-step TD-errors using recorded Q-values from online network and value scaling. Weights the max and the mean TD-error over each sequence to make a single priority value for that sequence. Note: Although the original R2D2 implementation used the entire 80-step sequence to compute the input priorities, we ran R2D1 with 40 time-step sample batches, and so computed the priority for each 80-step training sequence based on one of the two 40-step halves. Algorithm argument ``input_priority_shift`` determines which 40-step half is used as the priority for the 80-step sequence. (Since this method might get executed by alternating memory copiers in async mode, don't carry internal state here, do all computation with only the samples available in input. Could probably reduce to one memory copier and keep state there, if needed.) """ # """Just for first input into replay buffer. # Simple 1-step return TD-errors using recorded Q-values from online # network and value scaling, with the T dimension reduced away (same # priority applied to all samples in this batch; whereever the rnn state # is kept--hopefully the first step--this priority will apply there). # The samples duration T might be less than the training segment, so # this is an approximation of an approximation, but hopefully will # capture the right behavior. # UPDATE 20190826: Trying using n-step returns. For now using samples # with full n-step return available...later could also use partial # returns for samples at end of batch. 35/40 ain't bad tho. # Might not carry/use internal state here, because might get executed # by alternating memory copiers in async mode; do all with only the # samples avialable from input.""" samples = torchify_buffer(samples) q = samples.agent.agent_info.q action = samples.agent.action q_max = torch.max(q, dim=-1).values q_at_a = select_at_indexes(action, q) return_n, done_n = discount_return_n_step( reward=samples.env.reward, done=samples.env.done, n_step=self.n_step_return, discount=self.discount, do_truncated=False, # Only samples with full n-step return. ) # y = self.value_scale( # samples.env.reward[:-1] + # (self.discount * (1 - samples.env.done[:-1].float()) * # probably done.float() # self.inv_value_scale(q_max[1:])) # ) nm1 = max(1, self.n_step_return - 1) # At least 1 bc don't have next Q. y = self.value_scale(return_n + (1 - done_n.float()) * self.inv_value_scale(q_max[nm1:])) delta = abs(q_at_a[:-nm1] - y) # NOTE: by default, with R2D1, use squared-error loss, delta_clip=None. if self.delta_clip is not None: # Huber loss. delta = torch.clamp(delta, 0, self.delta_clip) valid = valid_from_done(samples.env.done[:-nm1]) max_d = torch.max(delta * valid, dim=0).values mean_d = valid_mean(delta, valid, dim=0) # Still high if less valid. priorities = self.pri_eta * max_d + (1 - self.pri_eta) * mean_d # [B] return priorities.numpy() def loss(self, samples): """Samples have leading Time and Batch dimentions [T,B,..]. Move all samples to device first, and then slice for sub-sequences. Use same init_rnn_state for agent and target; start both at same t. Warmup the RNN state first on the warmup subsequence, then train on the remaining subsequence. Returns loss (usually use MSE, not Huber), TD-error absolute values, and new sequence-wise priorities, based on weighted sum of max and mean TD-error over the sequence.""" all_observation, all_action, all_reward = buffer_to( (samples.all_observation, samples.all_action, samples.all_reward), device=self.agent.device) wT, bT, nsr = self.warmup_T, self.batch_T, self.n_step_return if wT > 0: warmup_slice = slice(None, wT) # Same for agent and target. warmup_inputs = AgentInputs( observation=all_observation[warmup_slice], prev_action=all_action[warmup_slice], prev_reward=all_reward[warmup_slice], ) agent_slice = slice(wT, wT + bT) agent_inputs = AgentInputs( observation=all_observation[agent_slice], prev_action=all_action[agent_slice], prev_reward=all_reward[agent_slice], ) target_slice = slice(wT, None) # Same start t as agent. (wT + bT + nsr) target_inputs = AgentInputs( observation=all_observation[target_slice], prev_action=all_action[target_slice], prev_reward=all_reward[target_slice], ) action = samples.all_action[wT + 1:wT + 1 + bT] # CPU. return_ = samples.return_[wT:wT + bT] done_n = samples.done_n[wT:wT + bT] if self.store_rnn_state_interval == 0: init_rnn_state = None else: # [B,N,H]-->[N,B,H] cudnn. init_rnn_state = buffer_method(samples.init_rnn_state, "transpose", 0, 1) init_rnn_state = buffer_method(init_rnn_state, "contiguous") if wT > 0: # Do warmup. with torch.no_grad(): _, target_rnn_state = self.agent.target(*warmup_inputs, init_rnn_state) _, init_rnn_state = self.agent(*warmup_inputs, init_rnn_state) # Recommend aligning sampling batch_T and store_rnn_interval with # warmup_T (and no mid_batch_reset), so that end of trajectory # during warmup leads to new trajectory beginning at start of # training segment of replay. warmup_invalid_mask = valid_from_done(samples.done[:wT])[-1] == 0 # [B] init_rnn_state[:, warmup_invalid_mask] = 0 # [N,B,H] (cudnn) target_rnn_state[:, warmup_invalid_mask] = 0 else: target_rnn_state = init_rnn_state qs, _ = self.agent(*agent_inputs, init_rnn_state) # [T,B,A] q = select_at_indexes(action, qs) with torch.no_grad(): target_qs, _ = self.agent.target(*target_inputs, target_rnn_state) if self.double_dqn: next_qs, _ = self.agent(*target_inputs, init_rnn_state) next_a = torch.argmax(next_qs, dim=-1) target_q = select_at_indexes(next_a, target_qs) else: target_q = torch.max(target_qs, dim=-1).values target_q = target_q[-bT:] # Same length as q. disc = self.discount ** self.n_step_return y = self.value_scale(return_ + (1 - done_n.float()) * disc * self.inv_value_scale(target_q)) # [T,B] delta = y - q losses = 0.5 * delta ** 2 abs_delta = abs(delta) # NOTE: by default, with R2D1, use squared-error loss, delta_clip=None. if self.delta_clip is not None: # Huber loss. b = self.delta_clip * (abs_delta - self.delta_clip / 2) losses = torch.where(abs_delta <= self.delta_clip, losses, b) if self.prioritized_replay: losses *= samples.is_weights.unsqueeze(0) # weights: [B] --> [1,B] valid = valid_from_done(samples.done[wT:]) # 0 after first done. loss = valid_mean(losses, valid) td_abs_errors = abs_delta.detach() if self.delta_clip is not None: td_abs_errors = torch.clamp(td_abs_errors, 0, self.delta_clip) # [T,B] valid_td_abs_errors = td_abs_errors * valid max_d = torch.max(valid_td_abs_errors, dim=0).values mean_d = valid_mean(td_abs_errors, valid, dim=0) # Still high if less valid. priorities = self.pri_eta * max_d + (1 - self.pri_eta) * mean_d # [B] return loss, valid_td_abs_errors, priorities def value_scale(self, x): """Value scaling function to handle raw rewards across games (not clipped).""" return (torch.sign(x) * (torch.sqrt(abs(x) + 1) - 1) + self.value_scale_eps * x) def inv_value_scale(self, z): """Invert the value scaling.""" return torch.sign(z) * (((torch.sqrt(1 + 4 * self.value_scale_eps * (abs(z) + 1 + self.value_scale_eps)) - 1) / (2 * self.value_scale_eps)) ** 2 - 1)
--experiments
--config/atari/dqn/atari_r2d1.py
--scripts/atari/dqn
--launch
--dgx
--launch_atari_r2d1_async_alt.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
from rlpyt.utils.launching.affinity import encode_affinity from rlpyt.utils.launching.exp_launcher import run_experiments from rlpyt.utils.launching.variant import make_variants, VariantLevel script = "rlpyt/experiments/scripts/atari/dqn/train/atari_r2d1_async_alt.py" affinity_code = encode_affinity( n_cpu_core=40, n_gpu=4, async_sample=True, gpu_per_run=1, sample_gpu_per_run=2, # hyperthread_offset=24, # optim_sample_share_gpu=True, n_socket=1, # Force this. alternating=True, ) runs_per_setting = 2 experiment_title = "atari_r2d1_async_alt" variant_levels = list() games = ["pong", "seaquest", "qbert", "chopper_command"] values = list(zip(games)) dir_names = ["{}".format(*v) for v in values] keys = [("env", "game")] variant_levels.append(VariantLevel(keys, values, dir_names)) variants, log_dirs = make_variants(*variant_levels) default_config_key = "async_alt_dgx" run_experiments( script=script, affinity_code=affinity_code, experiment_title=experiment_title, runs_per_setting=runs_per_setting, variants=variants, log_dirs=log_dirs, common_args=(default_config_key,), )
--launch_atari_r2d1_async_alt_seaquest.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
from rlpyt.utils.launching.affinity import encode_affinity from rlpyt.utils.launching.exp_launcher import run_experiments from rlpyt.utils.launching.variant import make_variants, VariantLevel script = "rlpyt/experiments/scripts/atari/dqn/train/atari_r2d1_async_alt.py" affinity_code = encode_affinity( n_cpu_core=40, n_gpu=4, async_sample=True, gpu_per_run=1, sample_gpu_per_run=2, # hyperthread_offset=24, # optim_sample_share_gpu=True, n_socket=1, # Force this. alternating=True, ) runs_per_setting = 1 experiment_title = "atari_r2d1_async_alt" variant_levels = list() games = ["seaquest"] values = list(zip(games)) dir_names = ["{}".format(*v) for v in values] keys = [("env", "game")] variant_levels.append(VariantLevel(keys, values, dir_names)) variants, log_dirs = make_variants(*variant_levels) default_config_key = "async_alt_dgx" run_experiments( script=script, affinity_code=affinity_code, experiment_title=experiment_title, runs_per_setting=runs_per_setting, variants=variants, log_dirs=log_dirs, common_args=(default_config_key,), )
--launch_atari_r2d1_async_gpu.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
from rlpyt.utils.launching.affinity import encode_affinity from rlpyt.utils.launching.exp_launcher import run_experiments from rlpyt.utils.launching.variant import make_variants, VariantLevel script = "rlpyt/experiments/scripts/atari/dqn/train/atari_r2d1_async_gpu.py" affinity_code = encode_affinity( n_cpu_core=40, n_gpu=4, async_sample=True, gpu_per_run=1, sample_gpu_per_run=3, # hyperthread_offset=24, # optim_sample_share_gpu=True, n_socket=1, # Force this. alternating=False, ) runs_per_setting = 1 experiment_title = "atari_r2d1_async_gpu" variant_levels = list() games = ["seaquest"] values = list(zip(games)) dir_names = ["{}".format(*v) for v in values] keys = [("env", "game")] variant_levels.append(VariantLevel(keys, values, dir_names)) variants, log_dirs = make_variants(*variant_levels) default_config_key = "async_gpu_dgx" run_experiments( script=script, affinity_code=affinity_code, experiment_title=experiment_title, runs_per_setting=runs_per_setting, variants=variants, log_dirs=log_dirs, common_args=(default_config_key,), )
--got
--launch_atari_r2d1_async_alt.py
--launch_atari_r2d1_async_alt_amidar.py
--launch_atari_r2d1_async_alt_gravitar.py
--launch_atari_r2d1_async_alt_pong.py
--launch_atari_r2d1_async_alt_seaquest.py
--launch_atari_r2d1_async_gpu.py
--launch_atari_r2d1_async_gpu_amidar.py
--launch_atari_r2d1_async_gpu_test.py
--launch_atari_r2d1_long_4tr_gravitar.py
--pabti
--launch_atari_r2d1_async_alt.py
--launch_atari_r2d1_async_alt_chopper_command.py
--launch_atari_r2d1_async_alt_gravitar.py
--launch_atari_r2d1_async_alt_qbert.py
--launch_atari_r2d1_async_gpu_qbert.py
--launch_atari_r2d1_long_4tr_asteroids.py
--launch_atari_r2d1_long_4tr_chopper_command.py
--launch_atari_r2d1_long_4tr_gravitar.py
--launch_atari_r2d1_long_4tr_seaquest.py
--launch_atari_r2d1_long_gt_ad.py
--launch_atari_r2d1_long_sq_cc.py
--launch_atari_r2d1_gpu_basic.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
from rlpyt.utils.launching.affinity import encode_affinity from rlpyt.utils.launching.exp_launcher import run_experiments from rlpyt.utils.launching.variant import make_variants, VariantLevel script = "rlpyt/experiments/scripts/atari/dqn/train/atari_r2d1_gpu.py" affinity_code = encode_affinity( n_cpu_core=4, n_gpu=1, hyperthread_offset=8, n_socket=1, # cpu_per_run=2, ) runs_per_setting = 2 experiment_title = "atari_r2d1_basic" variant_levels = list() games = ["pong", "seaquest", "qbert", "chopper_command"] values = list(zip(games)) dir_names = ["{}".format(*v) for v in values] keys = [("env", "game")] variant_levels.append(VariantLevel(keys, values, dir_names)) variants, log_dirs = make_variants(*variant_levels) default_config_key = "r2d1" run_experiments( script=script, affinity_code=affinity_code, experiment_title=experiment_title, runs_per_setting=runs_per_setting, variants=variants, log_dirs=log_dirs, common_args=(default_config_key,), )
--train
--atari_r2d1_async_alt.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import sys from rlpyt.utils.launching.affinity import affinity_from_code from rlpyt.samplers.async_.alternating_sampler import AsyncAlternatingSampler from rlpyt.samplers.async_.collectors import DbGpuResetCollector from rlpyt.envs.atari.atari_env import AtariEnv, AtariTrajInfo from rlpyt.algos.dqn.r2d1 import R2D1 from rlpyt.agents.dqn.atari.atari_r2d1_agent import AtariR2d1AlternatingAgent from rlpyt.runners.async_rl import AsyncRlEval from rlpyt.utils.logging.context import logger_context from rlpyt.utils.launching.variant import load_variant, update_config from rlpyt.experiments.configs.atari.dqn.atari_r2d1 import configs def build_and_train(slot_affinity_code, log_dir, run_ID, config_key): affinity = affinity_from_code(slot_affinity_code) config = configs[config_key] variant = load_variant(log_dir) config = update_config(config, variant) config["eval_env"]["game"] = config["env"]["game"] sampler = AsyncAlternatingSampler( EnvCls=AtariEnv, env_kwargs=config["env"], CollectorCls=DbGpuResetCollector, TrajInfoCls=AtariTrajInfo, eval_env_kwargs=config["eval_env"], **config["sampler"] ) algo = R2D1(optim_kwargs=config["optim"], **config["algo"]) agent = AtariR2d1AlternatingAgent(model_kwargs=config["model"], **config["agent"]) runner = AsyncRlEval( algo=algo, agent=agent, sampler=sampler, affinity=affinity, **config["runner"] ) name = "async_alt_" + config["env"]["game"] with logger_context(log_dir, run_ID, name, config): runner.train() if __name__ == "__main__": build_and_train(*sys.argv[1:])
--atari_r2d1_async_gpu.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import sys from rlpyt.utils.launching.affinity import affinity_from_code from rlpyt.samplers.async_.gpu_sampler import AsyncGpuSampler from rlpyt.samplers.async_.collectors import DbGpuResetCollector from rlpyt.envs.atari.atari_env import AtariEnv, AtariTrajInfo from rlpyt.algos.dqn.r2d1 import R2D1 from rlpyt.agents.dqn.atari.atari_r2d1_agent import AtariR2d1Agent from rlpyt.runners.async_rl import AsyncRlEval from rlpyt.utils.logging.context import logger_context from rlpyt.utils.launching.variant import load_variant, update_config from rlpyt.experiments.configs.atari.dqn.atari_r2d1 import configs def build_and_train(slot_affinity_code, log_dir, run_ID, config_key): affinity = affinity_from_code(slot_affinity_code) config = configs[config_key] variant = load_variant(log_dir) config = update_config(config, variant) config["eval_env"]["game"] = config["env"]["game"] sampler = AsyncGpuSampler( EnvCls=AtariEnv, env_kwargs=config["env"], CollectorCls=DbGpuResetCollector, TrajInfoCls=AtariTrajInfo, eval_env_kwargs=config["eval_env"], **config["sampler"] ) algo = R2D1(optim_kwargs=config["optim"], **config["algo"]) agent = AtariR2d1Agent(model_kwargs=config["model"], **config["agent"]) runner = AsyncRlEval( algo=algo, agent=agent, sampler=sampler, affinity=affinity, **config["runner"] ) name = "async_gpu_" + config["env"]["game"] with logger_context(log_dir, run_ID, name, config): runner.train() if __name__ == "__main__": build_and_train(*sys.argv[1:])
--atari_r2d1_gpu.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import sys from rlpyt.utils.launching.affinity import affinity_from_code from rlpyt.samplers.parallel.gpu.sampler import GpuSampler from rlpyt.samplers.parallel.gpu.collectors import GpuWaitResetCollector from rlpyt.envs.atari.atari_env import AtariEnv, AtariTrajInfo from rlpyt.algos.dqn.r2d1 import R2D1 from rlpyt.agents.dqn.atari.atari_r2d1_agent import AtariR2d1Agent from rlpyt.runners.minibatch_rl import MinibatchRlEval from rlpyt.utils.logging.context import logger_context from rlpyt.utils.launching.variant import load_variant, update_config from rlpyt.experiments.configs.atari.dqn.atari_r2d1 import configs def build_and_train(slot_affinity_code, log_dir, run_ID, config_key): affinity = affinity_from_code(slot_affinity_code) config = configs[config_key] variant = load_variant(log_dir) config = update_config(config, variant) config["eval_env"]["game"] = config["env"]["game"] sampler = GpuSampler( EnvCls=AtariEnv, env_kwargs=config["env"], CollectorCls=GpuWaitResetCollector, TrajInfoCls=AtariTrajInfo, eval_env_kwargs=config["eval_env"], **config["sampler"] ) algo = R2D1(optim_kwargs=config["optim"], **config["algo"]) agent = AtariR2d1Agent(model_kwargs=config["model"], **config["agent"]) runner = MinibatchRlEval( algo=algo, agent=agent, sampler=sampler, affinity=affinity, **config["runner"] ) name = config["env"]["game"] with logger_context(log_dir, run_ID, name, config): runner.train() if __name__ == "__main__": build_and_train(*sys.argv[1:])
--models/dqn/atari_r2d1_model.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import torch from rlpyt.utils.tensor import infer_leading_dims, restore_leading_dims from rlpyt.utils.collections import namedarraytuple from rlpyt.models.conv2d import Conv2dHeadModel from rlpyt.models.mlp import MlpModel from rlpyt.models.dqn.dueling import DuelingHeadModel RnnState = namedarraytuple("RnnState", ["h", "c"]) class AtariR2d1Model(torch.nn.Module): """2D convolutional neural network (for multiple video frames per observation) feeding into an LSTM and MLP output for Q-value outputs for the action set.""" def __init__( self, image_shape, output_size, fc_size=512, # Between conv and lstm. lstm_size=512, head_size=512, dueling=False, use_maxpool=False, channels=None, # None uses default. kernel_sizes=None, strides=None, paddings=None, ): """Instantiates the neural network according to arguments; network defaults stored within this method.""" super().__init__() self.dueling = dueling self.conv = Conv2dHeadModel( image_shape=image_shape, channels=channels or [32, 64, 64], kernel_sizes=kernel_sizes or [8, 4, 3], strides=strides or [4, 2, 1], paddings=paddings or [0, 1, 1], use_maxpool=use_maxpool, hidden_sizes=fc_size, # ReLU applied here (Steven). ) self.lstm = torch.nn.LSTM(self.conv.output_size + output_size + 1, lstm_size) if dueling: self.head = DuelingHeadModel(lstm_size, head_size, output_size) else: self.head = MlpModel(lstm_size, head_size, output_size=output_size) def forward(self, observation, prev_action, prev_reward, init_rnn_state): """Feedforward layers process as [T*B,H]. Return same leading dims as input, can be [T,B], [B], or [].""" img = observation.type(torch.float) # Expect torch.uint8 inputs img = img.mul_(1. / 255) # From [0-255] to [0-1], in place. # Infer (presence of) leading dimensions: [T,B], [B], or []. lead_dim, T, B, img_shape = infer_leading_dims(img, 3) conv_out = self.conv(img.view(T * B, *img_shape)) # Fold if T dimension. lstm_input = torch.cat([ conv_out.view(T, B, -1), prev_action.view(T, B, -1), # Assumed onehot. prev_reward.view(T, B, 1), ], dim=2) init_rnn_state = None if init_rnn_state is None else tuple(init_rnn_state) lstm_out, (hn, cn) = self.lstm(lstm_input, init_rnn_state) q = self.head(lstm_out.view(T * B, -1)) # Restore leading dimensions: [T,B], [B], or [], as input. q = restore_leading_dims(q, lead_dim, T, B) # Model should always leave B-dimension in rnn state: [N,B,H]. next_rnn_state = RnnState(h=hn, c=cn) return q, next_rnn_state