关于ppo中针对MLP和RNN两种不同网络结构的数据处理与采样方法

在RL中，需要对数据进行采样，因此如何构造可采样的数据或数据块，则是需要关注的问题：

if self.actor_critic.is_recurrent:
    data_generator = rollouts.recurrent_generator(
        advantages, self.num_mini_batch)
else:
    data_generator = rollouts.feed_forward_generator(
        advantages, self.num_mini_batch)

for sample in data_generator:
    obs_batch, recurrent_hidden_states_batch, actions_batch, \
       value_preds_batch, return_batch, masks_batch, old_action_log_probs_batch, \
            adv_targ = sample

    # Reshape to do in a single forward pass for all steps
    values, action_log_probs, dist_entropy, _ = self.actor_critic.evaluate_actions(
        obs_batch, recurrent_hidden_states_batch, masks_batch,
        actions_batch)
(以上代码来自于ppo_atari,pytorch-a2c-ppo-acktr-gail-master,这两个项目中的代码都是一致的，涉及到针对MLP\CNN\RNN的数据的生成和采样)
如上，我们对数据进行重新生成和采样，其中 data_generator 是重新生成的数据，且其针对 self.actor_critic 是否是 network，有不同的生成策略，但采样方法是一致的。

PPO中存储的其实也是8个process的128个step的数据块，即buffer中存储的是8*128个数据。如果说self.actor_critic不是RNN,则数据是可以被打乱采样，不需要考虑每个tuple数据之间的时序
依赖关系，可如果self.actor_critic是RNN，则我们需要考虑数据之间的时序关联性，因此，我们需要在处理、采样数据时保留数据之间的时序关系。

（1）在self.actor_critic不是RNN的情况下：

    data_generator = rollouts.feed_forward_generator(
        advantages, self.num_mini_batch)


其中：
def feed_forward_generator(self,

                           advantages,
                           num_mini_batch=None,
                           mini_batch_size=None):

    num_steps, num_processes = self.rewards.size()[0:2]
    batch_size = num_processes * num_steps

    if mini_batch_size is None:
        assert batch_size >= num_mini_batch, (
            "PPO requires the number of processes ({}) "
            "* number of steps ({}) = {} "
            "to be greater than or equal to the number of PPO mini batches ({})."
            "".format(num_processes, num_steps, num_processes * num_steps,
                      num_mini_batch))

        mini_batch_size = batch_size // num_mini_batch
    sampler = BatchSampler(
        SubsetRandomSampler(range(batch_size)),
        mini_batch_size,
        drop_last=True)

    for indices in sampler:
        obs_batch = self.obs[:-1].view(-1, *self.obs.size()[2:])[indices]
        recurrent_hidden_states_batch = self.recurrent_hidden_states[:-1].view(
            -1, self.recurrent_hidden_states.size(-1))[indices]
        actions_batch = self.actions.view(-1,
                                          self.actions.size(-1))[indices]
        value_preds_batch = self.value_preds[:-1].view(-1, 1)[indices]
        return_batch = self.returns[:-1].view(-1, 1)[indices]
        masks_batch = self.masks[:-1].view(-1, 1)[indices]
        old_action_log_probs_batch = self.action_log_probs.view(-1,
                                                                1)[indices]

        if advantages is None:
            adv_targ = None
        else:
            adv_targ = advantages.view(-1, 1)[indices]

        yield obs_batch, recurrent_hidden_states_batch, actions_batch, \
            value_preds_batch, return_batch, masks_batch, old_action_log_probs_batch, adv_targ

其中，函数 sampler = BatchSampler(
        SubsetRandomSampler(range(batch_size)),
        mini_batch_size,
        drop_last=True)
　　
　　则是把通过把数据分成 mini_batch_size（文中是4） 份，每份有batch_size（文中是256）个数据。在随后的训练中，则是每个data_generator都有4次数据训练，每次的数据都是256个，都不一样。

（2）在self.actor_critic是RNN的情况下：

    data_generator = rollouts.feed_forward_generator(
        advantages, self.num_mini_batch)

def recurrent_generator(self, advantages, num_mini_batch):
    num_processes = self.rewards.size(1)
    assert num_processes >= num_mini_batch, (
        "PPO requires the number of processes ({}) "
        "to be greater than or equal to the number of "
        "PPO mini batches ({}).".format(num_processes, num_mini_batch))
    num_envs_per_batch = num_processes // num_mini_batch
    perm = torch.randperm(num_processes)

    for start_ind in range(0, num_processes, num_envs_per_batch):
        obs_batch = []
        recurrent_hidden_states_batch = []
        actions_batch = []
        value_preds_batch = []
        return_batch = []
        masks_batch = []
        old_action_log_probs_batch = []
        adv_targ = []

        for offset in range(num_envs_per_batch):
            ind = perm[start_ind + offset]

            obs_batch.append(self.obs[:-1, ind])
            recurrent_hidden_states_batch.append(
                self.recurrent_hidden_states[0:1, ind])
            actions_batch.append(self.actions[:, ind])
            value_preds_batch.append(self.value_preds[:-1, ind])
            return_batch.append(self.returns[:-1, ind])
            masks_batch.append(self.masks[:-1, ind])
            old_action_log_probs_batch.append(
                self.action_log_probs[:, ind])
            adv_targ.append(advantages[:, ind])

        T, N = self.num_steps, num_envs_per_batch

        # These are all tensors of size (T, N, -1)
        obs_batch = torch.stack(obs_batch, 1)
        actions_batch = torch.stack(actions_batch, 1)
        value_preds_batch = torch.stack(value_preds_batch, 1)
        return_batch = torch.stack(return_batch, 1)
        masks_batch = torch.stack(masks_batch, 1)
        old_action_log_probs_batch = torch.stack(
            old_action_log_probs_batch, 1)
        adv_targ = torch.stack(adv_targ, 1)

        # States is just a (N, -1) tensor
        recurrent_hidden_states_batch = torch.stack(
            recurrent_hidden_states_batch, 1).view(N, -1)

        # Flatten the (T, N, ...) tensors to (T * N, ...)
        obs_batch = _flatten_helper(T, N, obs_batch)
        actions_batch = _flatten_helper(T, N, actions_batch)
        value_preds_batch = _flatten_helper(T, N, value_preds_batch)
        return_batch = _flatten_helper(T, N, return_batch)
        masks_batch = _flatten_helper(T, N, masks_batch)
        old_action_log_probs_batch = _flatten_helper(T, N, \
                old_action_log_probs_batch)
        adv_targ = _flatten_helper(T, N, adv_targ)

        yield obs_batch, recurrent_hidden_states_batch, actions_batch, \
            value_preds_batch, return_batch, masks_batch, old_action_log_probs_batch, adv_targ


其中，8个process的数据块，每个数据块有128个数据，我们首先以数据块的形式打乱这8个数据块，之后随机抽取2个数据块，组成128*2的数据块，这样就保持了数据之间的时序性，同时将数据分成4份，
进行4次采样和训练，确保每个ppo train_step都对当前buffer中的所有的数据都进行了训练。

posted @ 2022-08-27 10:22 呦呦南山阅读(663) 评论(0) 编辑收藏举报

刷新页面返回顶部

关于ppo中针对MLP和RNN两种不同网络结构的数据处理与采样方法

公告