机器学习工程师 - Udacity 强化学习 Part Nine

深度Q学习 TensorFlow实现

如果你想了解如何在其他 Python 框架中编写实现,请参阅:

 

在此 notebook 中,我们将构建一个可以通过强化学习学会玩游戏的神经网络。具体而言,我们将使用 Q-学习训练智能体玩一个叫做 Cart-Pole 的游戏。在此游戏中,小车上有一个可以自由摆动的杆子。小车可以向左和向右移动,目标是尽量长时间地使杆子保持笔直。

我们可以使用 OpenAI Gym 模拟该游戏。首先,我们看看 OpenAI Gym 的原理。然后,我们将训练智能体玩 Cart-Pole 游戏。

import gym
import numpy as np
​
# Create the Cart-Pole game environment
env = gym.make('CartPole-v1')
​
# Number of possible actions
print('Number of possible actions:', env.action_space.n)
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Number of possible actions: 2

我们通过 env 与模拟环境互动。你可以通过 env.action_space.n查看有多少潜在的动作,并且使用 env.action_space.sample() 获得随机动作。向 env.step 传入动作(用整数表示)将生成模拟环境的下一个步骤。所有 Gym 游戏基本都是这样。

在 Cart-Pole 游戏中有两个潜在动作,即使小车向左或向右移动。因此我们可以采取两种动作,分别表示为 0 和 1。

运行以下代码以与环境互动。

actions = [] # actions that the agent selects
rewards = [] # obtained rewards
state = env.reset()
​
while True:
    action = env.action_space.sample()  # choose a random action
    state, reward, done, _ = env.step(action) 
    rewards.append(reward)
    actions.append(action)
    if done:
        break

我们可以查看动作和奖励:

print('Actions:', actions)
print('Rewards:', rewards)
Actions: [0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
Rewards: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

当杆子倾斜角度超过特定的角度之后,游戏就会重置。当游戏还在运行时,在每一步都会返回奖励 1.0。游戏运行时间越久,我们获得的奖励就越多。网络的目标是通过使杆子保持垂直状态最大化奖励。为此,它将使小车向左和向右移动。

Q-网络

为了跟踪动作值,我们将使用接受状态 s 作为输入的神经网络。输出将是每个潜在动作的 Q 值(即输出是输入状态 s 对应的所有动作值 Q(s,a)。


对于这个 Cart-Pole 游戏,状态有四个值:小车的位置和速度,杆子的位置和速度。因此,该神经网络有四个输入(状态中的每个值对应一个输入)和两个输出(每个潜在动作对应一个输出)。

正如在这节课所讨论的,为了实现训练目标,我们首先将利用状态 s 提供的背景信息选择动作 a,然后使用该动作模拟游戏。这样将会获得下个状态 s′ 以及奖励 r。这样我们就可以计算 Q^(s,a)=r+γmaxa′Q(s′,a′)。然后,我们通过最小化 (Q^(s,a)−Q(s,a))2 更新权重。

下面是 𝑄Q 网络的一种实现。它使用两个包含 ReLU 激活函数的完全连接层。两层似乎很好,三层可能更好,你可以随意尝试。

import tensorflow as tf
​
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)
​
            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

经验回放

强化学习算法可能会因为状态之间存在关联性而出现稳定性问题。为了在训练期间减少关联性,我们可以存储智能体的经验,稍后从这些经验中随机抽取一个小批量经验进行训练。

在以下代码单元格中,我们将创建一个 Memory 对象来存储我们的经验,即转换 <s,a,r,s′>。该存储器将设有最大容量,以便保留更新的经验并删除旧的经验。然后,我们将随机抽取一个小批次转换 <s,a,r,s′> 并用它来训练智能体。

我在下面实现了 Memory 对象。如果你不熟悉 deque,其实它是一个双端队列。可以将其看做在两端都有开口的管子。你可以从任何一端放入物体。但是如果放满了,再添加物体的话将使物体从另一端被挤出。这是一种非常适合内存缓冲区的数据结构。

from collections import deque
​
class Memory():
    def __init__(self, max_size=1000):
        self.buffer = deque(maxlen=max_size)
    def add(self, experience):
        self.buffer.append(experience)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Q-学习训练算法

我们将使用以下算法训练网络。对于此游戏,目标是使杆子在 195 帧内都保持垂直状态。因此当我们满足该目标后,可以开始新的阶段。如果杆子倾斜角度太大,或者小车向左或向右移动幅度太大,则游戏结束。当游戏结束后,我们可以开始新的阶段。现在,为了训练智能体:

  • 初始化存储器 D
  • 使用随机权重初始化动作值网络 Q
  • 对于阶段 ←1 到**M,执行**以下操作endfor
    • 观察 s0
    • 对于 t←0 到**T−1,执行**以下操作
      • 对于概率 ϵ,选择随机动作 at,否则选择 at=argmaxaQ(st,a)
      • 在模拟器中执行动作 at,并观察奖励 rt+1 和新状态 st+1
      • 将转换 <st,at,rt+1,st+1> 存储在存储器 D 中
      • 从 D: <sj,aj,rj,sj′> 中随机抽取小批量经验
      • 如果阶段在 j+1 时结束,设为 Q^j=rj,否则设为 Q^j=rj+γmaxa′Q(sj′,a′)
      • 创建梯度下降步骤,损失为 (Q^j−Q(sj,aj))2
    • endfor

建议你花时间扩展这段代码,以实现我们在这节课讨论的一些改进之处,从而包含固定 𝑄Q 目标、双 DQN、优先回放和/或对抗网络。

超参数

对于强化学习,比较难的一个方面是超参数很大。我们不仅要调整网络,还要调整模拟环境。

train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob
# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate
# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

填充经验存储器

我们在下面重新初始化了模拟环境并提前填充了存储器。智能体正在采取随机动作,并将转换存储在存储器中。这样可以帮助智能体探索该游戏。

# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())
​
memory = Memory(max_size=memory_size)
​
# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
​
    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
​
    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

训练

下面我们将训练智能体。

# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())
​
            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            targets = rewards + gamma * np.max(target_Qs, axis=1)
​
            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
    saver.save(sess, "checkpoints/cartpole.ckpt")
Episode: 1 Total reward: 19.0 Training loss: 1.1389 Explore P: 0.9981
Episode: 2 Total reward: 27.0 Training loss: 1.0941 Explore P: 0.9955
Episode: 3 Total reward: 10.0 Training loss: 1.2043 Explore P: 0.9945
Episode: 4 Total reward: 19.0 Training loss: 1.3239 Explore P: 0.9926
Episode: 5 Total reward: 24.0 Training loss: 1.2904 Explore P: 0.9902
Episode: 6 Total reward: 18.0 Training loss: 1.0776 Explore P: 0.9885
Episode: 7 Total reward: 50.0 Training loss: 1.3426 Explore P: 0.9836
Episode: 8 Total reward: 15.0 Training loss: 1.2110 Explore P: 0.9821
Episode: 9 Total reward: 22.0 Training loss: 1.2152 Explore P: 0.9800
Episode: 10 Total reward: 21.0 Training loss: 1.5765 Explore P: 0.9780
Episode: 11 Total reward: 12.0 Training loss: 1.4658 Explore P: 0.9768
Episode: 12 Total reward: 28.0 Training loss: 1.3410 Explore P: 0.9741
Episode: 13 Total reward: 11.0 Training loss: 1.3264 Explore P: 0.9730
Episode: 14 Total reward: 7.0 Training loss: 1.5729 Explore P: 0.9724
Episode: 15 Total reward: 46.0 Training loss: 2.1239 Explore P: 0.9680
Episode: 16 Total reward: 14.0 Training loss: 1.6112 Explore P: 0.9666
Episode: 17 Total reward: 16.0 Training loss: 1.9459 Explore P: 0.9651
Episode: 18 Total reward: 18.0 Training loss: 2.2683 Explore P: 0.9634
Episode: 19 Total reward: 42.0 Training loss: 1.8491 Explore P: 0.9594
Episode: 20 Total reward: 9.0 Training loss: 2.1436 Explore P: 0.9585
Episode: 21 Total reward: 15.0 Training loss: 2.9827 Explore P: 0.9571
Episode: 22 Total reward: 25.0 Training loss: 2.7261 Explore P: 0.9547
Episode: 23 Total reward: 12.0 Training loss: 2.3194 Explore P: 0.9536
Episode: 24 Total reward: 26.0 Training loss: 3.3895 Explore P: 0.9512
Episode: 25 Total reward: 25.0 Training loss: 8.4542 Explore P: 0.9488
Episode: 26 Total reward: 36.0 Training loss: 4.5023 Explore P: 0.9454
Episode: 27 Total reward: 44.0 Training loss: 10.8840 Explore P: 0.9413
Episode: 28 Total reward: 12.0 Training loss: 9.0166 Explore P: 0.9402
Episode: 29 Total reward: 25.0 Training loss: 6.8834 Explore P: 0.9379
Episode: 30 Total reward: 34.0 Training loss: 4.7544 Explore P: 0.9347
Episode: 31 Total reward: 21.0 Training loss: 14.7541 Explore P: 0.9328
Episode: 32 Total reward: 12.0 Training loss: 13.9110 Explore P: 0.9317
Episode: 33 Total reward: 14.0 Training loss: 5.4864 Explore P: 0.9304
Episode: 34 Total reward: 15.0 Training loss: 6.1649 Explore P: 0.9290
Episode: 35 Total reward: 14.0 Training loss: 16.4675 Explore P: 0.9277
Episode: 36 Total reward: 9.0 Training loss: 7.2353 Explore P: 0.9269
Episode: 37 Total reward: 30.0 Training loss: 18.3005 Explore P: 0.9242
Episode: 38 Total reward: 8.0 Training loss: 6.8947 Explore P: 0.9234
Episode: 39 Total reward: 17.0 Training loss: 3.6688 Explore P: 0.9219
Episode: 40 Total reward: 13.0 Training loss: 11.4010 Explore P: 0.9207
Episode: 41 Total reward: 15.0 Training loss: 37.5365 Explore P: 0.9193
Episode: 42 Total reward: 25.0 Training loss: 7.8100 Explore P: 0.9171
Episode: 43 Total reward: 16.0 Training loss: 30.1224 Explore P: 0.9156
Episode: 44 Total reward: 13.0 Training loss: 12.0979 Explore P: 0.9144
Episode: 45 Total reward: 23.0 Training loss: 79.7526 Explore P: 0.9124
Episode: 46 Total reward: 9.0 Training loss: 19.1093 Explore P: 0.9115
Episode: 47 Total reward: 12.0 Training loss: 48.1066 Explore P: 0.9105
Episode: 48 Total reward: 18.0 Training loss: 37.4678 Explore P: 0.9088
Episode: 49 Total reward: 10.0 Training loss: 30.0274 Explore P: 0.9079
Episode: 50 Total reward: 13.0 Training loss: 11.4602 Explore P: 0.9068
Episode: 51 Total reward: 18.0 Training loss: 16.1100 Explore P: 0.9052
Episode: 52 Total reward: 25.0 Training loss: 31.1109 Explore P: 0.9029
Episode: 53 Total reward: 27.0 Training loss: 11.9219 Explore P: 0.9005
Episode: 54 Total reward: 10.0 Training loss: 17.6061 Explore P: 0.8996
Episode: 55 Total reward: 8.0 Training loss: 69.7947 Explore P: 0.8989
Episode: 56 Total reward: 10.0 Training loss: 22.9729 Explore P: 0.8980
Episode: 57 Total reward: 58.0 Training loss: 166.3847 Explore P: 0.8929
Episode: 58 Total reward: 13.0 Training loss: 47.1774 Explore P: 0.8917
Episode: 59 Total reward: 23.0 Training loss: 99.2384 Explore P: 0.8897
Episode: 60 Total reward: 15.0 Training loss: 188.8578 Explore P: 0.8884
Episode: 61 Total reward: 11.0 Training loss: 55.4827 Explore P: 0.8874
Episode: 62 Total reward: 34.0 Training loss: 37.8267 Explore P: 0.8845
Episode: 63 Total reward: 29.0 Training loss: 21.7541 Explore P: 0.8819
Episode: 64 Total reward: 13.0 Training loss: 174.5038 Explore P: 0.8808
Episode: 65 Total reward: 11.0 Training loss: 51.1919 Explore P: 0.8798
Episode: 66 Total reward: 16.0 Training loss: 47.8301 Explore P: 0.8784
Episode: 67 Total reward: 24.0 Training loss: 155.9641 Explore P: 0.8764
Episode: 68 Total reward: 18.0 Training loss: 64.6396 Explore P: 0.8748
Episode: 69 Total reward: 33.0 Training loss: 39.7262 Explore P: 0.8720
Episode: 70 Total reward: 36.0 Training loss: 368.6354 Explore P: 0.8689
Episode: 71 Total reward: 49.0 Training loss: 61.7042 Explore P: 0.8647
Episode: 72 Total reward: 51.0 Training loss: 41.2167 Explore P: 0.8603
Episode: 73 Total reward: 11.0 Training loss: 58.3102 Explore P: 0.8594
Episode: 74 Total reward: 21.0 Training loss: 89.0633 Explore P: 0.8576
Episode: 75 Total reward: 9.0 Training loss: 151.4177 Explore P: 0.8568
Episode: 76 Total reward: 17.0 Training loss: 123.7854 Explore P: 0.8554
Episode: 77 Total reward: 30.0 Training loss: 237.1334 Explore P: 0.8529
Episode: 78 Total reward: 27.0 Training loss: 161.7380 Explore P: 0.8506
Episode: 79 Total reward: 12.0 Training loss: 78.3928 Explore P: 0.8496
Episode: 80 Total reward: 14.0 Training loss: 40.9375 Explore P: 0.8484
Episode: 81 Total reward: 41.0 Training loss: 50.7743 Explore P: 0.8450
Episode: 82 Total reward: 20.0 Training loss: 47.9131 Explore P: 0.8433
Episode: 83 Total reward: 10.0 Training loss: 125.2967 Explore P: 0.8425
Episode: 84 Total reward: 28.0 Training loss: 65.5926 Explore P: 0.8401
Episode: 85 Total reward: 9.0 Training loss: 381.4973 Explore P: 0.8394
Episode: 86 Total reward: 29.0 Training loss: 185.7411 Explore P: 0.8370
Episode: 87 Total reward: 9.0 Training loss: 1201.2051 Explore P: 0.8363
Episode: 88 Total reward: 12.0 Training loss: 129.6992 Explore P: 0.8353
Episode: 89 Total reward: 21.0 Training loss: 181.3655 Explore P: 0.8335
Episode: 90 Total reward: 12.0 Training loss: 49.1045 Explore P: 0.8325
Episode: 91 Total reward: 36.0 Training loss: 39.3167 Explore P: 0.8296
Episode: 92 Total reward: 7.0 Training loss: 573.7310 Explore P: 0.8290
Episode: 93 Total reward: 9.0 Training loss: 193.5926 Explore P: 0.8283
Episode: 94 Total reward: 10.0 Training loss: 53.2681 Explore P: 0.8275
Episode: 95 Total reward: 24.0 Training loss: 666.5759 Explore P: 0.8255
Episode: 96 Total reward: 10.0 Training loss: 688.6807 Explore P: 0.8247
Episode: 97 Total reward: 11.0 Training loss: 59.9440 Explore P: 0.8238
Episode: 98 Total reward: 10.0 Training loss: 252.5545 Explore P: 0.8230
Episode: 99 Total reward: 15.0 Training loss: 743.6579 Explore P: 0.8218
Episode: 100 Total reward: 21.0 Training loss: 52.8670 Explore P: 0.8201
Episode: 101 Total reward: 13.0 Training loss: 56.7770 Explore P: 0.8190
Episode: 102 Total reward: 18.0 Training loss: 67.9279 Explore P: 0.8175
Episode: 103 Total reward: 19.0 Training loss: 93.5078 Explore P: 0.8160
Episode: 104 Total reward: 24.0 Training loss: 825.1201 Explore P: 0.8141
Episode: 105 Total reward: 11.0 Training loss: 469.6794 Explore P: 0.8132
Episode: 106 Total reward: 16.0 Training loss: 172.7854 Explore P: 0.8119
Episode: 107 Total reward: 16.0 Training loss: 150.6912 Explore P: 0.8106
Episode: 108 Total reward: 16.0 Training loss: 791.6756 Explore P: 0.8094
Episode: 109 Total reward: 11.0 Training loss: 1603.6362 Explore P: 0.8085
Episode: 110 Total reward: 22.0 Training loss: 1546.7262 Explore P: 0.8067
Episode: 111 Total reward: 9.0 Training loss: 650.3567 Explore P: 0.8060
Episode: 112 Total reward: 15.0 Training loss: 408.9959 Explore P: 0.8048
Episode: 113 Total reward: 12.0 Training loss: 648.9672 Explore P: 0.8039
Episode: 114 Total reward: 15.0 Training loss: 60.1556 Explore P: 0.8027
 
Episode: 115 Total reward: 11.0 Training loss: 149.0513 Explore P: 0.8018
Episode: 116 Total reward: 20.0 Training loss: 66.6188 Explore P: 0.8002
Episode: 117 Total reward: 23.0 Training loss: 1193.2603 Explore P: 0.7984
Episode: 118 Total reward: 44.0 Training loss: 523.1418 Explore P: 0.7949
Episode: 119 Total reward: 16.0 Training loss: 208.3392 Explore P: 0.7937
Episode: 120 Total reward: 13.0 Training loss: 79.3734 Explore P: 0.7927
Episode: 121 Total reward: 15.0 Training loss: 58.2415 Explore P: 0.7915
Episode: 122 Total reward: 14.0 Training loss: 51.8719 Explore P: 0.7904
Episode: 123 Total reward: 16.0 Training loss: 72.1882 Explore P: 0.7892
Episode: 124 Total reward: 16.0 Training loss: 687.3945 Explore P: 0.7879
Episode: 125 Total reward: 11.0 Training loss: 55.8961 Explore P: 0.7871
Episode: 126 Total reward: 12.0 Training loss: 225.4171 Explore P: 0.7861
Episode: 127 Total reward: 51.0 Training loss: 640.1474 Explore P: 0.7822
Episode: 128 Total reward: 13.0 Training loss: 59.6254 Explore P: 0.7812
Episode: 129 Total reward: 21.0 Training loss: 630.3236 Explore P: 0.7795
Episode: 130 Total reward: 14.0 Training loss: 60.7224 Explore P: 0.7785
Episode: 131 Total reward: 11.0 Training loss: 204.4186 Explore P: 0.7776
Episode: 132 Total reward: 19.0 Training loss: 250.1553 Explore P: 0.7762
Episode: 133 Total reward: 10.0 Training loss: 81.5859 Explore P: 0.7754
Episode: 134 Total reward: 29.0 Training loss: 220.5163 Explore P: 0.7732
Episode: 135 Total reward: 23.0 Training loss: 958.7322 Explore P: 0.7714
Episode: 136 Total reward: 12.0 Training loss: 555.2710 Explore P: 0.7705
Episode: 137 Total reward: 12.0 Training loss: 494.0706 Explore P: 0.7696
Episode: 138 Total reward: 33.0 Training loss: 625.9376 Explore P: 0.7671
Episode: 139 Total reward: 19.0 Training loss: 377.5651 Explore P: 0.7657
Episode: 140 Total reward: 12.0 Training loss: 313.5325 Explore P: 0.7648
Episode: 141 Total reward: 13.0 Training loss: 437.6555 Explore P: 0.7638
Episode: 142 Total reward: 10.0 Training loss: 261.3405 Explore P: 0.7630
Episode: 143 Total reward: 13.0 Training loss: 526.2977 Explore P: 0.7621
Episode: 144 Total reward: 15.0 Training loss: 57.1256 Explore P: 0.7609
Episode: 145 Total reward: 23.0 Training loss: 42.2737 Explore P: 0.7592
Episode: 146 Total reward: 18.0 Training loss: 1677.1479 Explore P: 0.7579
Episode: 147 Total reward: 18.0 Training loss: 503.4913 Explore P: 0.7565
Episode: 148 Total reward: 75.0 Training loss: 530.5206 Explore P: 0.7509
Episode: 149 Total reward: 26.0 Training loss: 464.7852 Explore P: 0.7490
Episode: 150 Total reward: 17.0 Training loss: 1177.1429 Explore P: 0.7477
Episode: 151 Total reward: 11.0 Training loss: 30.3916 Explore P: 0.7469
Episode: 152 Total reward: 51.0 Training loss: 1696.0283 Explore P: 0.7432
Episode: 153 Total reward: 22.0 Training loss: 357.7746 Explore P: 0.7416
Episode: 154 Total reward: 20.0 Training loss: 262.0337 Explore P: 0.7401
Episode: 155 Total reward: 15.0 Training loss: 1049.1298 Explore P: 0.7390
Episode: 156 Total reward: 10.0 Training loss: 725.1987 Explore P: 0.7383
Episode: 157 Total reward: 27.0 Training loss: 261.7267 Explore P: 0.7363
Episode: 158 Total reward: 29.0 Training loss: 19.1248 Explore P: 0.7342
Episode: 159 Total reward: 8.0 Training loss: 18.3619 Explore P: 0.7336
Episode: 160 Total reward: 32.0 Training loss: 264.7924 Explore P: 0.7313
Episode: 161 Total reward: 31.0 Training loss: 26.7975 Explore P: 0.7291
Episode: 162 Total reward: 34.0 Training loss: 20.7654 Explore P: 0.7267
Episode: 163 Total reward: 14.0 Training loss: 946.9641 Explore P: 0.7257
Episode: 164 Total reward: 8.0 Training loss: 973.4227 Explore P: 0.7251
Episode: 165 Total reward: 20.0 Training loss: 25.1839 Explore P: 0.7237
Episode: 166 Total reward: 14.0 Training loss: 19.6044 Explore P: 0.7227
Episode: 167 Total reward: 11.0 Training loss: 17.0158 Explore P: 0.7219
Episode: 168 Total reward: 15.0 Training loss: 313.3605 Explore P: 0.7208
Episode: 169 Total reward: 15.0 Training loss: 788.2931 Explore P: 0.7197
Episode: 170 Total reward: 23.0 Training loss: 892.7424 Explore P: 0.7181
Episode: 171 Total reward: 14.0 Training loss: 846.1658 Explore P: 0.7171
Episode: 172 Total reward: 36.0 Training loss: 733.0235 Explore P: 0.7146
Episode: 173 Total reward: 32.0 Training loss: 375.6190 Explore P: 0.7123
Episode: 174 Total reward: 22.0 Training loss: 20.2878 Explore P: 0.7108
Episode: 175 Total reward: 24.0 Training loss: 266.7317 Explore P: 0.7091
Episode: 176 Total reward: 11.0 Training loss: 15.4090 Explore P: 0.7083
Episode: 177 Total reward: 12.0 Training loss: 833.2009 Explore P: 0.7075
Episode: 178 Total reward: 23.0 Training loss: 305.5236 Explore P: 0.7059
Episode: 179 Total reward: 50.0 Training loss: 273.9768 Explore P: 0.7024
Episode: 180 Total reward: 24.0 Training loss: 382.5863 Explore P: 0.7008
Episode: 181 Total reward: 13.0 Training loss: 8.8325 Explore P: 0.6999
Episode: 182 Total reward: 26.0 Training loss: 11.1730 Explore P: 0.6981
Episode: 183 Total reward: 10.0 Training loss: 8.0565 Explore P: 0.6974
Episode: 184 Total reward: 14.0 Training loss: 331.3760 Explore P: 0.6964
Episode: 185 Total reward: 7.0 Training loss: 595.5818 Explore P: 0.6960
Episode: 186 Total reward: 21.0 Training loss: 12.9129 Explore P: 0.6945
Episode: 187 Total reward: 34.0 Training loss: 384.9835 Explore P: 0.6922
Episode: 188 Total reward: 10.0 Training loss: 525.0700 Explore P: 0.6915
Episode: 189 Total reward: 8.0 Training loss: 9.4292 Explore P: 0.6910
Episode: 190 Total reward: 26.0 Training loss: 10.6541 Explore P: 0.6892
Episode: 191 Total reward: 12.0 Training loss: 9.2568 Explore P: 0.6884
Episode: 192 Total reward: 12.0 Training loss: 324.2149 Explore P: 0.6876
Episode: 193 Total reward: 21.0 Training loss: 10.4516 Explore P: 0.6861
Episode: 194 Total reward: 15.0 Training loss: 613.4329 Explore P: 0.6851
Episode: 195 Total reward: 15.0 Training loss: 12.0188 Explore P: 0.6841
Episode: 196 Total reward: 43.0 Training loss: 6.3281 Explore P: 0.6812
Episode: 197 Total reward: 18.0 Training loss: 537.2289 Explore P: 0.6800
Episode: 198 Total reward: 17.0 Training loss: 222.4585 Explore P: 0.6789
Episode: 199 Total reward: 15.0 Training loss: 4.3755 Explore P: 0.6779
Episode: 200 Total reward: 10.0 Training loss: 318.6465 Explore P: 0.6772
Episode: 201 Total reward: 9.0 Training loss: 7.5589 Explore P: 0.6766
Episode: 202 Total reward: 13.0 Training loss: 552.2013 Explore P: 0.6757
Episode: 203 Total reward: 9.0 Training loss: 5.4712 Explore P: 0.6751
Episode: 204 Total reward: 19.0 Training loss: 418.2550 Explore P: 0.6739
Episode: 205 Total reward: 14.0 Training loss: 242.6860 Explore P: 0.6730
Episode: 206 Total reward: 12.0 Training loss: 701.7039 Explore P: 0.6722
Episode: 207 Total reward: 12.0 Training loss: 694.5750 Explore P: 0.6714
Episode: 208 Total reward: 10.0 Training loss: 232.5367 Explore P: 0.6707
Episode: 209 Total reward: 17.0 Training loss: 229.9765 Explore P: 0.6696
Episode: 210 Total reward: 14.0 Training loss: 252.4525 Explore P: 0.6687
Episode: 211 Total reward: 20.0 Training loss: 200.7635 Explore P: 0.6673
Episode: 212 Total reward: 11.0 Training loss: 236.7219 Explore P: 0.6666
Episode: 213 Total reward: 16.0 Training loss: 659.2346 Explore P: 0.6656
Episode: 214 Total reward: 22.0 Training loss: 579.0916 Explore P: 0.6641
Episode: 215 Total reward: 15.0 Training loss: 5.0113 Explore P: 0.6631
Episode: 216 Total reward: 19.0 Training loss: 550.9601 Explore P: 0.6619
Episode: 217 Total reward: 11.0 Training loss: 6.6383 Explore P: 0.6612
Episode: 218 Total reward: 11.0 Training loss: 202.1263 Explore P: 0.6605
Episode: 219 Total reward: 11.0 Training loss: 160.7769 Explore P: 0.6598
Episode: 220 Total reward: 12.0 Training loss: 179.7851 Explore P: 0.6590
Episode: 221 Total reward: 12.0 Training loss: 523.3667 Explore P: 0.6582
Episode: 222 Total reward: 15.0 Training loss: 185.3723 Explore P: 0.6572
Episode: 223 Total reward: 17.0 Training loss: 178.1803 Explore P: 0.6561
Episode: 224 Total reward: 15.0 Training loss: 2.4123 Explore P: 0.6552
Episode: 225 Total reward: 17.0 Training loss: 2.1791 Explore P: 0.6541
Episode: 226 Total reward: 19.0 Training loss: 161.2964 Explore P: 0.6528
Episode: 227 Total reward: 7.0 Training loss: 1.7880 Explore P: 0.6524
Episode: 228 Total reward: 17.0 Training loss: 1.3397 Explore P: 0.6513
Episode: 229 Total reward: 11.0 Training loss: 420.3036 Explore P: 0.6506
 
Episode: 230 Total reward: 10.0 Training loss: 134.4979 Explore P: 0.6500
Episode: 231 Total reward: 15.0 Training loss: 225.6664 Explore P: 0.6490
Episode: 232 Total reward: 18.0 Training loss: 1.6078 Explore P: 0.6479
Episode: 233 Total reward: 28.0 Training loss: 2.6723 Explore P: 0.6461
Episode: 234 Total reward: 15.0 Training loss: 3.2360 Explore P: 0.6451
Episode: 235 Total reward: 17.0 Training loss: 249.5189 Explore P: 0.6440
Episode: 236 Total reward: 9.0 Training loss: 2.9458 Explore P: 0.6435
Episode: 237 Total reward: 12.0 Training loss: 2.2466 Explore P: 0.6427
Episode: 238 Total reward: 12.0 Training loss: 225.7894 Explore P: 0.6419
Episode: 239 Total reward: 14.0 Training loss: 177.5147 Explore P: 0.6411
Episode: 240 Total reward: 16.0 Training loss: 2.5767 Explore P: 0.6401
Episode: 241 Total reward: 15.0 Training loss: 4.4297 Explore P: 0.6391
Episode: 242 Total reward: 9.0 Training loss: 403.2294 Explore P: 0.6385
Episode: 243 Total reward: 10.0 Training loss: 137.8289 Explore P: 0.6379
Episode: 244 Total reward: 11.0 Training loss: 169.0499 Explore P: 0.6372
Episode: 245 Total reward: 9.0 Training loss: 3.8267 Explore P: 0.6367
Episode: 246 Total reward: 9.0 Training loss: 225.4438 Explore P: 0.6361
Episode: 247 Total reward: 13.0 Training loss: 454.1537 Explore P: 0.6353
Episode: 248 Total reward: 10.0 Training loss: 124.2338 Explore P: 0.6347
Episode: 249 Total reward: 12.0 Training loss: 120.7217 Explore P: 0.6339
Episode: 250 Total reward: 29.0 Training loss: 114.0817 Explore P: 0.6321
Episode: 251 Total reward: 8.0 Training loss: 116.2312 Explore P: 0.6316
Episode: 252 Total reward: 12.0 Training loss: 2.6206 Explore P: 0.6309
Episode: 253 Total reward: 10.0 Training loss: 112.9861 Explore P: 0.6302
Episode: 254 Total reward: 8.0 Training loss: 136.0334 Explore P: 0.6297
Episode: 255 Total reward: 9.0 Training loss: 3.1004 Explore P: 0.6292
Episode: 256 Total reward: 9.0 Training loss: 125.0454 Explore P: 0.6286
Episode: 257 Total reward: 17.0 Training loss: 184.0612 Explore P: 0.6276
Episode: 258 Total reward: 8.0 Training loss: 90.1252 Explore P: 0.6271
Episode: 259 Total reward: 9.0 Training loss: 317.5232 Explore P: 0.6265
Episode: 260 Total reward: 7.0 Training loss: 3.0928 Explore P: 0.6261
Episode: 261 Total reward: 17.0 Training loss: 100.7538 Explore P: 0.6251
Episode: 262 Total reward: 9.0 Training loss: 2.9335 Explore P: 0.6245
Episode: 263 Total reward: 18.0 Training loss: 192.3180 Explore P: 0.6234
Episode: 264 Total reward: 8.0 Training loss: 4.3705 Explore P: 0.6229
Episode: 265 Total reward: 18.0 Training loss: 200.3586 Explore P: 0.6218
Episode: 266 Total reward: 16.0 Training loss: 91.3879 Explore P: 0.6208
Episode: 267 Total reward: 26.0 Training loss: 92.0807 Explore P: 0.6192
Episode: 268 Total reward: 18.0 Training loss: 3.7110 Explore P: 0.6181
Episode: 269 Total reward: 9.0 Training loss: 609.8471 Explore P: 0.6176
Episode: 270 Total reward: 15.0 Training loss: 1.8198 Explore P: 0.6167
Episode: 271 Total reward: 12.0 Training loss: 2.1185 Explore P: 0.6160
Episode: 272 Total reward: 24.0 Training loss: 1.9744 Explore P: 0.6145
Episode: 273 Total reward: 24.0 Training loss: 249.7179 Explore P: 0.6131
Episode: 274 Total reward: 15.0 Training loss: 193.5612 Explore P: 0.6121
Episode: 275 Total reward: 10.0 Training loss: 6.9737 Explore P: 0.6115
Episode: 276 Total reward: 16.0 Training loss: 2.4565 Explore P: 0.6106
Episode: 277 Total reward: 16.0 Training loss: 525.9277 Explore P: 0.6096
Episode: 278 Total reward: 12.0 Training loss: 83.4610 Explore P: 0.6089
Episode: 279 Total reward: 11.0 Training loss: 77.5233 Explore P: 0.6082
Episode: 280 Total reward: 18.0 Training loss: 4.9265 Explore P: 0.6072
Episode: 281 Total reward: 9.0 Training loss: 269.1309 Explore P: 0.6066
Episode: 282 Total reward: 18.0 Training loss: 2.8348 Explore P: 0.6056
Episode: 283 Total reward: 11.0 Training loss: 601.5049 Explore P: 0.6049
Episode: 284 Total reward: 8.0 Training loss: 161.8174 Explore P: 0.6044
Episode: 285 Total reward: 14.0 Training loss: 82.9931 Explore P: 0.6036
Episode: 286 Total reward: 7.0 Training loss: 196.3243 Explore P: 0.6032
Episode: 287 Total reward: 9.0 Training loss: 74.6766 Explore P: 0.6027
Episode: 288 Total reward: 10.0 Training loss: 306.0176 Explore P: 0.6021
Episode: 289 Total reward: 10.0 Training loss: 208.0414 Explore P: 0.6015
Episode: 290 Total reward: 11.0 Training loss: 4.2988 Explore P: 0.6008
Episode: 291 Total reward: 19.0 Training loss: 66.9396 Explore P: 0.5997
Episode: 292 Total reward: 9.0 Training loss: 4.1911 Explore P: 0.5992
Episode: 293 Total reward: 9.0 Training loss: 201.6795 Explore P: 0.5986
Episode: 294 Total reward: 15.0 Training loss: 6.0142 Explore P: 0.5978
Episode: 295 Total reward: 10.0 Training loss: 135.2905 Explore P: 0.5972
Episode: 296 Total reward: 13.0 Training loss: 134.8062 Explore P: 0.5964
Episode: 297 Total reward: 9.0 Training loss: 6.2818 Explore P: 0.5959
Episode: 298 Total reward: 42.0 Training loss: 179.0860 Explore P: 0.5934
Episode: 299 Total reward: 16.0 Training loss: 118.0448 Explore P: 0.5925
Episode: 300 Total reward: 13.0 Training loss: 65.8207 Explore P: 0.5917
Episode: 301 Total reward: 9.0 Training loss: 6.0449 Explore P: 0.5912
Episode: 302 Total reward: 11.0 Training loss: 186.6886 Explore P: 0.5906
Episode: 303 Total reward: 32.0 Training loss: 2.0367 Explore P: 0.5887
Episode: 304 Total reward: 16.0 Training loss: 7.6165 Explore P: 0.5878
Episode: 305 Total reward: 10.0 Training loss: 89.1637 Explore P: 0.5872
Episode: 306 Total reward: 21.0 Training loss: 57.3157 Explore P: 0.5860
Episode: 307 Total reward: 13.0 Training loss: 4.0560 Explore P: 0.5853
Episode: 308 Total reward: 11.0 Training loss: 231.8216 Explore P: 0.5846
Episode: 309 Total reward: 12.0 Training loss: 67.1159 Explore P: 0.5839
Episode: 310 Total reward: 15.0 Training loss: 5.7401 Explore P: 0.5831
Episode: 311 Total reward: 12.0 Training loss: 61.6580 Explore P: 0.5824
Episode: 312 Total reward: 14.0 Training loss: 5.1575 Explore P: 0.5816
Episode: 313 Total reward: 9.0 Training loss: 8.3886 Explore P: 0.5811
Episode: 314 Total reward: 10.0 Training loss: 236.4287 Explore P: 0.5805
Episode: 315 Total reward: 19.0 Training loss: 2.6279 Explore P: 0.5794
Episode: 316 Total reward: 18.0 Training loss: 287.5316 Explore P: 0.5784
Episode: 317 Total reward: 9.0 Training loss: 60.8404 Explore P: 0.5779
Episode: 318 Total reward: 16.0 Training loss: 185.3566 Explore P: 0.5770
Episode: 319 Total reward: 17.0 Training loss: 151.8662 Explore P: 0.5760
Episode: 320 Total reward: 26.0 Training loss: 59.6630 Explore P: 0.5745
Episode: 321 Total reward: 19.0 Training loss: 361.0844 Explore P: 0.5735
Episode: 322 Total reward: 15.0 Training loss: 54.1653 Explore P: 0.5726
Episode: 323 Total reward: 21.0 Training loss: 48.1900 Explore P: 0.5714
Episode: 324 Total reward: 11.0 Training loss: 64.6456 Explore P: 0.5708
Episode: 325 Total reward: 10.0 Training loss: 163.7968 Explore P: 0.5703
Episode: 326 Total reward: 13.0 Training loss: 57.2604 Explore P: 0.5695
Episode: 327 Total reward: 36.0 Training loss: 3.2904 Explore P: 0.5675
Episode: 328 Total reward: 15.0 Training loss: 51.3793 Explore P: 0.5667
Episode: 329 Total reward: 13.0 Training loss: 5.0598 Explore P: 0.5660
Episode: 330 Total reward: 9.0 Training loss: 5.3937 Explore P: 0.5655
Episode: 331 Total reward: 8.0 Training loss: 184.5423 Explore P: 0.5650
Episode: 332 Total reward: 11.0 Training loss: 57.2458 Explore P: 0.5644
Episode: 333 Total reward: 15.0 Training loss: 143.3090 Explore P: 0.5636
Episode: 334 Total reward: 8.0 Training loss: 92.1967 Explore P: 0.5631
Episode: 335 Total reward: 24.0 Training loss: 57.2746 Explore P: 0.5618
Episode: 336 Total reward: 13.0 Training loss: 677.0651 Explore P: 0.5611
Episode: 337 Total reward: 10.0 Training loss: 203.0349 Explore P: 0.5605
Episode: 338 Total reward: 16.0 Training loss: 247.5807 Explore P: 0.5597
Episode: 339 Total reward: 11.0 Training loss: 146.1578 Explore P: 0.5591
Episode: 340 Total reward: 21.0 Training loss: 237.6844 Explore P: 0.5579
Episode: 341 Total reward: 13.0 Training loss: 64.5212 Explore P: 0.5572
Episode: 342 Total reward: 17.0 Training loss: 54.5668 Explore P: 0.5563
Episode: 343 Total reward: 18.0 Training loss: 182.6976 Explore P: 0.5553
 
Episode: 344 Total reward: 17.0 Training loss: 52.1114 Explore P: 0.5544
Episode: 345 Total reward: 13.0 Training loss: 120.9219 Explore P: 0.5536
Episode: 346 Total reward: 9.0 Training loss: 51.8384 Explore P: 0.5532
Episode: 347 Total reward: 10.0 Training loss: 100.0393 Explore P: 0.5526
Episode: 348 Total reward: 25.0 Training loss: 40.9022 Explore P: 0.5513
Episode: 349 Total reward: 11.0 Training loss: 304.7490 Explore P: 0.5507
Episode: 350 Total reward: 10.0 Training loss: 171.5694 Explore P: 0.5501
Episode: 351 Total reward: 13.0 Training loss: 4.8219 Explore P: 0.5494
Episode: 352 Total reward: 10.0 Training loss: 158.3436 Explore P: 0.5489
Episode: 353 Total reward: 13.0 Training loss: 57.5901 Explore P: 0.5482
Episode: 354 Total reward: 14.0 Training loss: 191.3822 Explore P: 0.5474
Episode: 355 Total reward: 15.0 Training loss: 127.2292 Explore P: 0.5466
Episode: 356 Total reward: 20.0 Training loss: 44.8374 Explore P: 0.5456
Episode: 357 Total reward: 15.0 Training loss: 42.9976 Explore P: 0.5448
Episode: 358 Total reward: 22.0 Training loss: 4.7700 Explore P: 0.5436
Episode: 359 Total reward: 27.0 Training loss: 333.4543 Explore P: 0.5421
Episode: 360 Total reward: 14.0 Training loss: 3.5138 Explore P: 0.5414
Episode: 361 Total reward: 11.0 Training loss: 3.6784 Explore P: 0.5408
Episode: 362 Total reward: 13.0 Training loss: 4.2295 Explore P: 0.5401
Episode: 363 Total reward: 12.0 Training loss: 39.8406 Explore P: 0.5395
Episode: 364 Total reward: 12.0 Training loss: 47.8375 Explore P: 0.5389
Episode: 365 Total reward: 14.0 Training loss: 2.7068 Explore P: 0.5381
Episode: 366 Total reward: 11.0 Training loss: 35.2376 Explore P: 0.5375
Episode: 367 Total reward: 8.0 Training loss: 150.9313 Explore P: 0.5371
Episode: 368 Total reward: 17.0 Training loss: 53.7887 Explore P: 0.5362
Episode: 369 Total reward: 7.0 Training loss: 4.1688 Explore P: 0.5358
Episode: 370 Total reward: 13.0 Training loss: 104.9504 Explore P: 0.5352
Episode: 371 Total reward: 17.0 Training loss: 2.3102 Explore P: 0.5343
Episode: 372 Total reward: 15.0 Training loss: 168.4454 Explore P: 0.5335
Episode: 373 Total reward: 17.0 Training loss: 4.7997 Explore P: 0.5326
Episode: 374 Total reward: 18.0 Training loss: 41.2610 Explore P: 0.5317
Episode: 375 Total reward: 9.0 Training loss: 40.6870 Explore P: 0.5312
Episode: 376 Total reward: 16.0 Training loss: 118.0745 Explore P: 0.5304
Episode: 377 Total reward: 14.0 Training loss: 3.9151 Explore P: 0.5296
Episode: 378 Total reward: 21.0 Training loss: 133.7375 Explore P: 0.5285
Episode: 379 Total reward: 7.0 Training loss: 4.1724 Explore P: 0.5282
Episode: 380 Total reward: 9.0 Training loss: 43.2680 Explore P: 0.5277
Episode: 381 Total reward: 24.0 Training loss: 3.9858 Explore P: 0.5265
Episode: 382 Total reward: 10.0 Training loss: 2.5373 Explore P: 0.5259
Episode: 383 Total reward: 8.0 Training loss: 5.8194 Explore P: 0.5255
Episode: 384 Total reward: 11.0 Training loss: 3.9965 Explore P: 0.5250
Episode: 385 Total reward: 21.0 Training loss: 103.4828 Explore P: 0.5239
Episode: 386 Total reward: 14.0 Training loss: 4.5170 Explore P: 0.5232
Episode: 387 Total reward: 11.0 Training loss: 143.1322 Explore P: 0.5226
Episode: 388 Total reward: 12.0 Training loss: 142.0182 Explore P: 0.5220
Episode: 389 Total reward: 9.0 Training loss: 3.2205 Explore P: 0.5215
Episode: 390 Total reward: 18.0 Training loss: 198.6030 Explore P: 0.5206
Episode: 391 Total reward: 20.0 Training loss: 1.9397 Explore P: 0.5196
Episode: 392 Total reward: 19.0 Training loss: 32.7865 Explore P: 0.5186
Episode: 393 Total reward: 25.0 Training loss: 119.3021 Explore P: 0.5174
Episode: 394 Total reward: 10.0 Training loss: 108.6656 Explore P: 0.5168
Episode: 395 Total reward: 12.0 Training loss: 83.9824 Explore P: 0.5162
Episode: 396 Total reward: 12.0 Training loss: 2.5211 Explore P: 0.5156
Episode: 397 Total reward: 8.0 Training loss: 37.9082 Explore P: 0.5152
Episode: 398 Total reward: 13.0 Training loss: 2.9801 Explore P: 0.5146
Episode: 399 Total reward: 12.0 Training loss: 39.7969 Explore P: 0.5140
Episode: 400 Total reward: 10.0 Training loss: 70.4514 Explore P: 0.5135
Episode: 401 Total reward: 37.0 Training loss: 161.0591 Explore P: 0.5116
Episode: 402 Total reward: 13.0 Training loss: 38.3266 Explore P: 0.5109
Episode: 403 Total reward: 12.0 Training loss: 257.2733 Explore P: 0.5103
Episode: 404 Total reward: 14.0 Training loss: 175.0571 Explore P: 0.5096
Episode: 405 Total reward: 13.0 Training loss: 359.1575 Explore P: 0.5090
Episode: 406 Total reward: 13.0 Training loss: 224.0672 Explore P: 0.5084
Episode: 407 Total reward: 11.0 Training loss: 5.6231 Explore P: 0.5078
Episode: 408 Total reward: 11.0 Training loss: 188.7537 Explore P: 0.5073
Episode: 409 Total reward: 8.0 Training loss: 5.3791 Explore P: 0.5069
Episode: 410 Total reward: 9.0 Training loss: 139.6608 Explore P: 0.5064
Episode: 411 Total reward: 11.0 Training loss: 4.8168 Explore P: 0.5059
Episode: 412 Total reward: 27.0 Training loss: 156.8561 Explore P: 0.5045
Episode: 413 Total reward: 14.0 Training loss: 2.3906 Explore P: 0.5038
Episode: 414 Total reward: 21.0 Training loss: 267.5691 Explore P: 0.5028
Episode: 415 Total reward: 11.0 Training loss: 5.2903 Explore P: 0.5023
Episode: 416 Total reward: 15.0 Training loss: 4.3663 Explore P: 0.5015
Episode: 417 Total reward: 10.0 Training loss: 243.4221 Explore P: 0.5010
Episode: 418 Total reward: 13.0 Training loss: 30.9696 Explore P: 0.5004
Episode: 419 Total reward: 14.0 Training loss: 47.1488 Explore P: 0.4997
Episode: 420 Total reward: 10.0 Training loss: 37.0084 Explore P: 0.4992
Episode: 421 Total reward: 9.0 Training loss: 167.2498 Explore P: 0.4988
Episode: 422 Total reward: 13.0 Training loss: 143.5638 Explore P: 0.4981
Episode: 423 Total reward: 14.0 Training loss: 176.3878 Explore P: 0.4975
Episode: 424 Total reward: 15.0 Training loss: 36.0116 Explore P: 0.4967
Episode: 425 Total reward: 11.0 Training loss: 122.6848 Explore P: 0.4962
Episode: 426 Total reward: 16.0 Training loss: 305.8708 Explore P: 0.4954
Episode: 427 Total reward: 11.0 Training loss: 86.0788 Explore P: 0.4949
Episode: 428 Total reward: 11.0 Training loss: 1.7877 Explore P: 0.4943
Episode: 429 Total reward: 11.0 Training loss: 1.9195 Explore P: 0.4938
Episode: 430 Total reward: 22.0 Training loss: 149.6266 Explore P: 0.4928
Episode: 431 Total reward: 9.0 Training loss: 2.9920 Explore P: 0.4923
Episode: 432 Total reward: 16.0 Training loss: 81.0064 Explore P: 0.4915
Episode: 433 Total reward: 32.0 Training loss: 3.6627 Explore P: 0.4900
Episode: 434 Total reward: 14.0 Training loss: 1.4875 Explore P: 0.4893
Episode: 435 Total reward: 12.0 Training loss: 129.9571 Explore P: 0.4888
Episode: 436 Total reward: 14.0 Training loss: 35.6207 Explore P: 0.4881
Episode: 437 Total reward: 11.0 Training loss: 27.4855 Explore P: 0.4876
Episode: 438 Total reward: 11.0 Training loss: 25.9465 Explore P: 0.4870
Episode: 439 Total reward: 9.0 Training loss: 147.8021 Explore P: 0.4866
Episode: 440 Total reward: 8.0 Training loss: 104.9937 Explore P: 0.4862
Episode: 441 Total reward: 15.0 Training loss: 130.6971 Explore P: 0.4855
Episode: 442 Total reward: 17.0 Training loss: 30.2268 Explore P: 0.4847
Episode: 443 Total reward: 10.0 Training loss: 76.8926 Explore P: 0.4842
Episode: 444 Total reward: 22.0 Training loss: 110.9827 Explore P: 0.4832
Episode: 445 Total reward: 33.0 Training loss: 1.3407 Explore P: 0.4816
Episode: 446 Total reward: 13.0 Training loss: 100.3155 Explore P: 0.4810
Episode: 447 Total reward: 18.0 Training loss: 29.0237 Explore P: 0.4802
Episode: 448 Total reward: 16.0 Training loss: 2.1154 Explore P: 0.4794
Episode: 449 Total reward: 9.0 Training loss: 25.3996 Explore P: 0.4790
Episode: 450 Total reward: 11.0 Training loss: 2.0962 Explore P: 0.4785
Episode: 451 Total reward: 35.0 Training loss: 3.5105 Explore P: 0.4768
Episode: 452 Total reward: 16.0 Training loss: 58.5793 Explore P: 0.4761
Episode: 453 Total reward: 9.0 Training loss: 92.8209 Explore P: 0.4757
Episode: 454 Total reward: 9.0 Training loss: 32.9521 Explore P: 0.4753
Episode: 455 Total reward: 14.0 Training loss: 1.6682 Explore P: 0.4746
Episode: 456 Total reward: 9.0 Training loss: 25.1808 Explore P: 0.4742
Episode: 457 Total reward: 10.0 Training loss: 70.7330 Explore P: 0.4737
Episode: 458 Total reward: 12.0 Training loss: 63.4716 Explore P: 0.4732
Episode: 459 Total reward: 15.0 Training loss: 1.2367 Explore P: 0.4725
Episode: 460 Total reward: 9.0 Training loss: 24.6579 Explore P: 0.4721
Episode: 461 Total reward: 9.0 Training loss: 1.4992 Explore P: 0.4716
 
Episode: 462 Total reward: 10.0 Training loss: 53.9337 Explore P: 0.4712
Episode: 463 Total reward: 14.0 Training loss: 80.9779 Explore P: 0.4705
Episode: 464 Total reward: 18.0 Training loss: 161.0876 Explore P: 0.4697
Episode: 465 Total reward: 13.0 Training loss: 77.2158 Explore P: 0.4691
Episode: 466 Total reward: 27.0 Training loss: 160.6956 Explore P: 0.4679
Episode: 467 Total reward: 8.0 Training loss: 136.2783 Explore P: 0.4675
Episode: 468 Total reward: 23.0 Training loss: 81.8569 Explore P: 0.4665
Episode: 469 Total reward: 13.0 Training loss: 60.8198 Explore P: 0.4659
Episode: 470 Total reward: 11.0 Training loss: 2.8282 Explore P: 0.4654
Episode: 471 Total reward: 12.0 Training loss: 25.5829 Explore P: 0.4648
Episode: 472 Total reward: 18.0 Training loss: 58.5623 Explore P: 0.4640
Episode: 473 Total reward: 13.0 Training loss: 23.7653 Explore P: 0.4634
Episode: 474 Total reward: 13.0 Training loss: 1.2669 Explore P: 0.4628
Episode: 475 Total reward: 13.0 Training loss: 22.5147 Explore P: 0.4622
Episode: 476 Total reward: 11.0 Training loss: 86.0896 Explore P: 0.4617
Episode: 477 Total reward: 21.0 Training loss: 26.7197 Explore P: 0.4608
Episode: 478 Total reward: 15.0 Training loss: 27.3056 Explore P: 0.4601
Episode: 479 Total reward: 10.0 Training loss: 42.7007 Explore P: 0.4597
Episode: 480 Total reward: 9.0 Training loss: 2.3646 Explore P: 0.4593
Episode: 481 Total reward: 13.0 Training loss: 103.7992 Explore P: 0.4587
Episode: 482 Total reward: 15.0 Training loss: 0.8924 Explore P: 0.4580
Episode: 483 Total reward: 12.0 Training loss: 76.0441 Explore P: 0.4575
Episode: 484 Total reward: 13.0 Training loss: 2.5648 Explore P: 0.4569
Episode: 485 Total reward: 14.0 Training loss: 1.7612 Explore P: 0.4563
Episode: 486 Total reward: 9.0 Training loss: 1.6032 Explore P: 0.4559
Episode: 487 Total reward: 11.0 Training loss: 1.5759 Explore P: 0.4554
Episode: 488 Total reward: 11.0 Training loss: 1.7155 Explore P: 0.4549
Episode: 489 Total reward: 10.0 Training loss: 54.4765 Explore P: 0.4544
Episode: 490 Total reward: 13.0 Training loss: 1.6958 Explore P: 0.4539
Episode: 491 Total reward: 22.0 Training loss: 50.9862 Explore P: 0.4529
Episode: 492 Total reward: 11.0 Training loss: 48.6795 Explore P: 0.4524
Episode: 493 Total reward: 13.0 Training loss: 70.6810 Explore P: 0.4518
Episode: 494 Total reward: 10.0 Training loss: 1.9887 Explore P: 0.4514
Episode: 495 Total reward: 12.0 Training loss: 52.0395 Explore P: 0.4509
Episode: 496 Total reward: 14.0 Training loss: 46.9323 Explore P: 0.4502
Episode: 497 Total reward: 10.0 Training loss: 1.0111 Explore P: 0.4498
Episode: 498 Total reward: 32.0 Training loss: 1.1930 Explore P: 0.4484
Episode: 499 Total reward: 16.0 Training loss: 52.3551 Explore P: 0.4477
Episode: 500 Total reward: 19.0 Training loss: 1.7930 Explore P: 0.4469
Episode: 501 Total reward: 21.0 Training loss: 0.8184 Explore P: 0.4459
Episode: 502 Total reward: 20.0 Training loss: 2.0589 Explore P: 0.4451
Episode: 503 Total reward: 48.0 Training loss: 59.6035 Explore P: 0.4430
Episode: 504 Total reward: 24.0 Training loss: 42.3264 Explore P: 0.4419
Episode: 505 Total reward: 12.0 Training loss: 0.7002 Explore P: 0.4414
Episode: 506 Total reward: 36.0 Training loss: 1.9021 Explore P: 0.4399
Episode: 507 Total reward: 51.0 Training loss: 37.0904 Explore P: 0.4377
Episode: 508 Total reward: 15.0 Training loss: 1.8662 Explore P: 0.4371
Episode: 509 Total reward: 19.0 Training loss: 59.3327 Explore P: 0.4362
Episode: 510 Total reward: 27.0 Training loss: 1.0567 Explore P: 0.4351
Episode: 511 Total reward: 26.0 Training loss: 69.3772 Explore P: 0.4340
Episode: 512 Total reward: 41.0 Training loss: 39.2655 Explore P: 0.4323
Episode: 513 Total reward: 52.0 Training loss: 36.9986 Explore P: 0.4301
Episode: 514 Total reward: 7.0 Training loss: 59.4629 Explore P: 0.4298
Episode: 515 Total reward: 25.0 Training loss: 1.6226 Explore P: 0.4287
Episode: 516 Total reward: 8.0 Training loss: 18.7216 Explore P: 0.4284
Episode: 517 Total reward: 42.0 Training loss: 110.5357 Explore P: 0.4266
Episode: 518 Total reward: 24.0 Training loss: 47.3693 Explore P: 0.4256
Episode: 519 Total reward: 48.0 Training loss: 44.3758 Explore P: 0.4236
Episode: 520 Total reward: 52.0 Training loss: 83.5137 Explore P: 0.4215
Episode: 521 Total reward: 45.0 Training loss: 46.8493 Explore P: 0.4197
Episode: 522 Total reward: 22.0 Training loss: 1.6227 Explore P: 0.4188
Episode: 523 Total reward: 32.0 Training loss: 18.9334 Explore P: 0.4174
Episode: 524 Total reward: 16.0 Training loss: 50.9475 Explore P: 0.4168
Episode: 525 Total reward: 18.0 Training loss: 66.1837 Explore P: 0.4161
Episode: 526 Total reward: 35.0 Training loss: 26.0606 Explore P: 0.4146
Episode: 527 Total reward: 57.0 Training loss: 18.9686 Explore P: 0.4123
Episode: 528 Total reward: 26.0 Training loss: 52.6546 Explore P: 0.4113
Episode: 529 Total reward: 60.0 Training loss: 23.8468 Explore P: 0.4089
Episode: 530 Total reward: 17.0 Training loss: 1.3976 Explore P: 0.4082
Episode: 531 Total reward: 23.0 Training loss: 0.8544 Explore P: 0.4073
Episode: 532 Total reward: 19.0 Training loss: 67.2248 Explore P: 0.4066
Episode: 533 Total reward: 36.0 Training loss: 69.9822 Explore P: 0.4051
Episode: 534 Total reward: 27.0 Training loss: 37.5253 Explore P: 0.4041
Episode: 535 Total reward: 19.0 Training loss: 25.3770 Explore P: 0.4033
Episode: 536 Total reward: 32.0 Training loss: 118.2690 Explore P: 0.4021
Episode: 537 Total reward: 23.0 Training loss: 22.3955 Explore P: 0.4012
Episode: 538 Total reward: 21.0 Training loss: 1.3389 Explore P: 0.4003
Episode: 539 Total reward: 23.0 Training loss: 59.0277 Explore P: 0.3994
Episode: 540 Total reward: 17.0 Training loss: 63.3389 Explore P: 0.3988
Episode: 541 Total reward: 19.0 Training loss: 1.0334 Explore P: 0.3980
Episode: 542 Total reward: 25.0 Training loss: 22.7347 Explore P: 0.3971
Episode: 543 Total reward: 15.0 Training loss: 42.2014 Explore P: 0.3965
Episode: 544 Total reward: 17.0 Training loss: 43.8780 Explore P: 0.3958
Episode: 545 Total reward: 20.0 Training loss: 43.2003 Explore P: 0.3951
Episode: 546 Total reward: 17.0 Training loss: 42.4390 Explore P: 0.3944
Episode: 547 Total reward: 11.0 Training loss: 1.3647 Explore P: 0.3940
Episode: 548 Total reward: 24.0 Training loss: 60.2016 Explore P: 0.3931
Episode: 549 Total reward: 19.0 Training loss: 1.4988 Explore P: 0.3923
Episode: 550 Total reward: 53.0 Training loss: 1.0123 Explore P: 0.3903
Episode: 551 Total reward: 25.0 Training loss: 1.1477 Explore P: 0.3894
Episode: 552 Total reward: 19.0 Training loss: 1.7442 Explore P: 0.3886
Episode: 553 Total reward: 22.0 Training loss: 53.1502 Explore P: 0.3878
Episode: 554 Total reward: 18.0 Training loss: 1.4440 Explore P: 0.3871
Episode: 555 Total reward: 22.0 Training loss: 17.9785 Explore P: 0.3863
Episode: 556 Total reward: 26.0 Training loss: 17.8806 Explore P: 0.3853
Episode: 557 Total reward: 34.0 Training loss: 18.6301 Explore P: 0.3841
Episode: 558 Total reward: 21.0 Training loss: 18.0181 Explore P: 0.3833
Episode: 559 Total reward: 24.0 Training loss: 15.2906 Explore P: 0.3824
Episode: 560 Total reward: 24.0 Training loss: 23.8079 Explore P: 0.3815
Episode: 561 Total reward: 29.0 Training loss: 13.7452 Explore P: 0.3804
Episode: 562 Total reward: 19.0 Training loss: 35.6049 Explore P: 0.3797
Episode: 563 Total reward: 15.0 Training loss: 50.3544 Explore P: 0.3792
Episode: 564 Total reward: 19.0 Training loss: 38.8341 Explore P: 0.3784
Episode: 565 Total reward: 26.0 Training loss: 1.0228 Explore P: 0.3775
Episode: 566 Total reward: 31.0 Training loss: 55.9333 Explore P: 0.3764
Episode: 567 Total reward: 17.0 Training loss: 1.1589 Explore P: 0.3757
Episode: 568 Total reward: 20.0 Training loss: 17.3674 Explore P: 0.3750
Episode: 569 Total reward: 18.0 Training loss: 25.7730 Explore P: 0.3743
Episode: 570 Total reward: 13.0 Training loss: 1.7464 Explore P: 0.3739
Episode: 571 Total reward: 22.0 Training loss: 1.3465 Explore P: 0.3731
Episode: 572 Total reward: 14.0 Training loss: 18.8442 Explore P: 0.3726
Episode: 573 Total reward: 41.0 Training loss: 1.2352 Explore P: 0.3711
Episode: 574 Total reward: 17.0 Training loss: 1.2769 Explore P: 0.3705
Episode: 575 Total reward: 22.0 Training loss: 72.4087 Explore P: 0.3697
 
Episode: 576 Total reward: 16.0 Training loss: 1.2576 Explore P: 0.3691
Episode: 577 Total reward: 20.0 Training loss: 50.2282 Explore P: 0.3684
Episode: 578 Total reward: 26.0 Training loss: 1.1699 Explore P: 0.3675
Episode: 579 Total reward: 18.0 Training loss: 2.1392 Explore P: 0.3668
Episode: 580 Total reward: 24.0 Training loss: 1.2998 Explore P: 0.3660
Episode: 581 Total reward: 17.0 Training loss: 32.0099 Explore P: 0.3654
Episode: 582 Total reward: 15.0 Training loss: 47.9834 Explore P: 0.3648
Episode: 583 Total reward: 25.0 Training loss: 51.8507 Explore P: 0.3639
Episode: 584 Total reward: 24.0 Training loss: 31.2073 Explore P: 0.3631
Episode: 585 Total reward: 14.0 Training loss: 46.4814 Explore P: 0.3626
Episode: 586 Total reward: 20.0 Training loss: 1.5862 Explore P: 0.3619
Episode: 587 Total reward: 16.0 Training loss: 15.7063 Explore P: 0.3613
Episode: 588 Total reward: 15.0 Training loss: 15.8918 Explore P: 0.3608
Episode: 589 Total reward: 18.0 Training loss: 1.6494 Explore P: 0.3602
Episode: 590 Total reward: 17.0 Training loss: 22.3143 Explore P: 0.3596
Episode: 591 Total reward: 17.0 Training loss: 14.5927 Explore P: 0.3590
Episode: 592 Total reward: 29.0 Training loss: 1.7442 Explore P: 0.3580
Episode: 593 Total reward: 14.0 Training loss: 26.9850 Explore P: 0.3575
Episode: 594 Total reward: 16.0 Training loss: 28.2197 Explore P: 0.3569
Episode: 595 Total reward: 19.0 Training loss: 1.6336 Explore P: 0.3563
Episode: 596 Total reward: 15.0 Training loss: 17.6036 Explore P: 0.3557
Episode: 597 Total reward: 43.0 Training loss: 26.3213 Explore P: 0.3543
Episode: 598 Total reward: 21.0 Training loss: 2.1855 Explore P: 0.3535
Episode: 599 Total reward: 19.0 Training loss: 1.9711 Explore P: 0.3529
Episode: 600 Total reward: 15.0 Training loss: 14.6893 Explore P: 0.3524
Episode: 601 Total reward: 10.0 Training loss: 14.5980 Explore P: 0.3520
Episode: 602 Total reward: 12.0 Training loss: 35.2775 Explore P: 0.3516
Episode: 603 Total reward: 20.0 Training loss: 28.7336 Explore P: 0.3509
Episode: 604 Total reward: 16.0 Training loss: 35.9290 Explore P: 0.3504
Episode: 605 Total reward: 22.0 Training loss: 39.6673 Explore P: 0.3496
Episode: 606 Total reward: 21.0 Training loss: 1.1285 Explore P: 0.3489
Episode: 607 Total reward: 20.0 Training loss: 1.8414 Explore P: 0.3483
Episode: 608 Total reward: 18.0 Training loss: 17.5987 Explore P: 0.3476
Episode: 609 Total reward: 16.0 Training loss: 32.9972 Explore P: 0.3471
Episode: 610 Total reward: 25.0 Training loss: 16.6002 Explore P: 0.3463
Episode: 611 Total reward: 18.0 Training loss: 14.0257 Explore P: 0.3457
Episode: 612 Total reward: 18.0 Training loss: 12.4284 Explore P: 0.3451
Episode: 613 Total reward: 21.0 Training loss: 1.4310 Explore P: 0.3444
Episode: 614 Total reward: 28.0 Training loss: 29.7930 Explore P: 0.3434
Episode: 615 Total reward: 21.0 Training loss: 2.1540 Explore P: 0.3427
Episode: 616 Total reward: 19.0 Training loss: 11.8957 Explore P: 0.3421
Episode: 617 Total reward: 12.0 Training loss: 13.0865 Explore P: 0.3417
Episode: 618 Total reward: 17.0 Training loss: 16.3770 Explore P: 0.3411
Episode: 619 Total reward: 20.0 Training loss: 17.7443 Explore P: 0.3405
Episode: 620 Total reward: 25.0 Training loss: 22.0196 Explore P: 0.3396
Episode: 621 Total reward: 24.0 Training loss: 26.7169 Explore P: 0.3389
Episode: 622 Total reward: 21.0 Training loss: 17.0872 Explore P: 0.3382
Episode: 623 Total reward: 18.0 Training loss: 18.7493 Explore P: 0.3376
Episode: 624 Total reward: 19.0 Training loss: 19.5283 Explore P: 0.3369
Episode: 625 Total reward: 21.0 Training loss: 0.9885 Explore P: 0.3363
Episode: 626 Total reward: 29.0 Training loss: 31.4809 Explore P: 0.3353
Episode: 627 Total reward: 20.0 Training loss: 62.2083 Explore P: 0.3347
Episode: 628 Total reward: 27.0 Training loss: 14.8126 Explore P: 0.3338
Episode: 629 Total reward: 27.0 Training loss: 23.9413 Explore P: 0.3329
Episode: 630 Total reward: 18.0 Training loss: 45.5570 Explore P: 0.3323
Episode: 631 Total reward: 21.0 Training loss: 1.8272 Explore P: 0.3317
Episode: 632 Total reward: 16.0 Training loss: 12.8653 Explore P: 0.3311
Episode: 633 Total reward: 19.0 Training loss: 12.5386 Explore P: 0.3305
Episode: 634 Total reward: 22.0 Training loss: 1.7979 Explore P: 0.3298
Episode: 635 Total reward: 23.0 Training loss: 62.1245 Explore P: 0.3291
Episode: 636 Total reward: 17.0 Training loss: 29.7350 Explore P: 0.3286
Episode: 637 Total reward: 13.0 Training loss: 1.7507 Explore P: 0.3281
Episode: 638 Total reward: 26.0 Training loss: 2.1978 Explore P: 0.3273
Episode: 639 Total reward: 24.0 Training loss: 1.9452 Explore P: 0.3266
Episode: 640 Total reward: 23.0 Training loss: 35.8989 Explore P: 0.3258
Episode: 641 Total reward: 16.0 Training loss: 37.0134 Explore P: 0.3253
Episode: 642 Total reward: 32.0 Training loss: 13.5314 Explore P: 0.3243
Episode: 643 Total reward: 18.0 Training loss: 35.4613 Explore P: 0.3238
Episode: 644 Total reward: 24.0 Training loss: 13.6842 Explore P: 0.3230
Episode: 645 Total reward: 20.0 Training loss: 14.3947 Explore P: 0.3224
Episode: 646 Total reward: 19.0 Training loss: 34.2863 Explore P: 0.3218
Episode: 647 Total reward: 24.0 Training loss: 30.1938 Explore P: 0.3210
Episode: 648 Total reward: 17.0 Training loss: 17.2096 Explore P: 0.3205
Episode: 649 Total reward: 27.0 Training loss: 15.8344 Explore P: 0.3197
Episode: 650 Total reward: 15.0 Training loss: 1.5648 Explore P: 0.3192
Episode: 651 Total reward: 26.0 Training loss: 36.3581 Explore P: 0.3184
Episode: 652 Total reward: 17.0 Training loss: 18.5584 Explore P: 0.3179
Episode: 653 Total reward: 25.0 Training loss: 1.1819 Explore P: 0.3171
Episode: 654 Total reward: 17.0 Training loss: 13.4275 Explore P: 0.3166
Episode: 655 Total reward: 22.0 Training loss: 42.1334 Explore P: 0.3159
Episode: 656 Total reward: 24.0 Training loss: 33.1622 Explore P: 0.3152
Episode: 657 Total reward: 20.0 Training loss: 1.8370 Explore P: 0.3146
Episode: 658 Total reward: 26.0 Training loss: 14.5950 Explore P: 0.3138
Episode: 659 Total reward: 27.0 Training loss: 12.8420 Explore P: 0.3130
Episode: 660 Total reward: 26.0 Training loss: 21.8282 Explore P: 0.3122
Episode: 661 Total reward: 17.0 Training loss: 16.9570 Explore P: 0.3117
Episode: 662 Total reward: 20.0 Training loss: 37.0740 Explore P: 0.3111
Episode: 663 Total reward: 33.0 Training loss: 33.5082 Explore P: 0.3101
Episode: 664 Total reward: 22.0 Training loss: 16.5312 Explore P: 0.3094
Episode: 665 Total reward: 19.0 Training loss: 13.6077 Explore P: 0.3088
Episode: 666 Total reward: 36.0 Training loss: 44.5663 Explore P: 0.3078
Episode: 667 Total reward: 18.0 Training loss: 1.5546 Explore P: 0.3072
Episode: 668 Total reward: 26.0 Training loss: 2.2223 Explore P: 0.3065
Episode: 669 Total reward: 24.0 Training loss: 33.6925 Explore P: 0.3057
Episode: 670 Total reward: 18.0 Training loss: 2.3399 Explore P: 0.3052
Episode: 671 Total reward: 26.0 Training loss: 26.6113 Explore P: 0.3044
Episode: 672 Total reward: 27.0 Training loss: 22.8833 Explore P: 0.3037
Episode: 673 Total reward: 21.0 Training loss: 1.7737 Explore P: 0.3030
Episode: 674 Total reward: 27.0 Training loss: 15.3553 Explore P: 0.3022
Episode: 675 Total reward: 39.0 Training loss: 37.5223 Explore P: 0.3011
Episode: 676 Total reward: 18.0 Training loss: 38.9573 Explore P: 0.3006
Episode: 677 Total reward: 33.0 Training loss: 48.8197 Explore P: 0.2996
Episode: 678 Total reward: 24.0 Training loss: 14.2282 Explore P: 0.2989
Episode: 679 Total reward: 31.0 Training loss: 19.4401 Explore P: 0.2980
Episode: 680 Total reward: 27.0 Training loss: 12.5953 Explore P: 0.2973
Episode: 681 Total reward: 18.0 Training loss: 30.5212 Explore P: 0.2967
Episode: 682 Total reward: 37.0 Training loss: 12.5600 Explore P: 0.2957
Episode: 683 Total reward: 29.0 Training loss: 12.0794 Explore P: 0.2949
Episode: 684 Total reward: 41.0 Training loss: 37.9058 Explore P: 0.2937
Episode: 685 Total reward: 29.0 Training loss: 13.0553 Explore P: 0.2929
Episode: 686 Total reward: 25.0 Training loss: 10.4583 Explore P: 0.2922
Episode: 687 Total reward: 32.0 Training loss: 1.7663 Explore P: 0.2913
Episode: 688 Total reward: 36.0 Training loss: 11.4921 Explore P: 0.2903
Episode: 689 Total reward: 26.0 Training loss: 24.0411 Explore P: 0.2895
Episode: 690 Total reward: 23.0 Training loss: 1.8664 Explore P: 0.2889
 
Episode: 691 Total reward: 15.0 Training loss: 19.5617 Explore P: 0.2885
Episode: 692 Total reward: 20.0 Training loss: 38.2688 Explore P: 0.2879
Episode: 693 Total reward: 29.0 Training loss: 28.7784 Explore P: 0.2871
Episode: 694 Total reward: 27.0 Training loss: 1.8644 Explore P: 0.2864
Episode: 695 Total reward: 18.0 Training loss: 18.3993 Explore P: 0.2859
Episode: 696 Total reward: 21.0 Training loss: 10.5064 Explore P: 0.2853
Episode: 697 Total reward: 16.0 Training loss: 1.9630 Explore P: 0.2848
Episode: 698 Total reward: 21.0 Training loss: 1.8284 Explore P: 0.2843
Episode: 699 Total reward: 20.0 Training loss: 27.3245 Explore P: 0.2837
Episode: 700 Total reward: 29.0 Training loss: 2.2797 Explore P: 0.2829
Episode: 701 Total reward: 16.0 Training loss: 26.9969 Explore P: 0.2825
Episode: 702 Total reward: 25.0 Training loss: 19.0728 Explore P: 0.2818
Episode: 703 Total reward: 16.0 Training loss: 20.8745 Explore P: 0.2814
Episode: 704 Total reward: 18.0 Training loss: 20.4065 Explore P: 0.2809
Episode: 705 Total reward: 19.0 Training loss: 21.2243 Explore P: 0.2804
Episode: 706 Total reward: 16.0 Training loss: 2.0775 Explore P: 0.2799
Episode: 707 Total reward: 16.0 Training loss: 1.7367 Explore P: 0.2795
Episode: 708 Total reward: 13.0 Training loss: 2.2948 Explore P: 0.2792
Episode: 709 Total reward: 20.0 Training loss: 14.9405 Explore P: 0.2786
Episode: 710 Total reward: 25.0 Training loss: 1.4793 Explore P: 0.2780
Episode: 711 Total reward: 16.0 Training loss: 37.2686 Explore P: 0.2775
Episode: 712 Total reward: 23.0 Training loss: 33.6244 Explore P: 0.2769
Episode: 713 Total reward: 22.0 Training loss: 32.0614 Explore P: 0.2763
Episode: 714 Total reward: 13.0 Training loss: 11.7230 Explore P: 0.2760
Episode: 715 Total reward: 38.0 Training loss: 20.4808 Explore P: 0.2750
Episode: 716 Total reward: 17.0 Training loss: 53.6520 Explore P: 0.2745
Episode: 717 Total reward: 24.0 Training loss: 24.0791 Explore P: 0.2739
Episode: 718 Total reward: 15.0 Training loss: 1.8089 Explore P: 0.2735
Episode: 719 Total reward: 23.0 Training loss: 31.6596 Explore P: 0.2729
Episode: 720 Total reward: 23.0 Training loss: 1.6994 Explore P: 0.2723
Episode: 721 Total reward: 23.0 Training loss: 22.4184 Explore P: 0.2717
Episode: 722 Total reward: 20.0 Training loss: 1.9713 Explore P: 0.2712
Episode: 723 Total reward: 18.0 Training loss: 43.3923 Explore P: 0.2707
Episode: 724 Total reward: 35.0 Training loss: 65.1941 Explore P: 0.2698
Episode: 725 Total reward: 28.0 Training loss: 50.1570 Explore P: 0.2690
Episode: 726 Total reward: 28.0 Training loss: 32.8347 Explore P: 0.2683
Episode: 727 Total reward: 29.0 Training loss: 20.7505 Explore P: 0.2676
Episode: 728 Total reward: 34.0 Training loss: 25.2606 Explore P: 0.2667
Episode: 729 Total reward: 27.0 Training loss: 35.2319 Explore P: 0.2660
Episode: 730 Total reward: 18.0 Training loss: 1.6823 Explore P: 0.2655
Episode: 731 Total reward: 15.0 Training loss: 22.2978 Explore P: 0.2652
Episode: 732 Total reward: 19.0 Training loss: 16.7197 Explore P: 0.2647
Episode: 733 Total reward: 30.0 Training loss: 35.8051 Explore P: 0.2639
Episode: 734 Total reward: 31.0 Training loss: 1.7307 Explore P: 0.2631
Episode: 735 Total reward: 28.0 Training loss: 1.4637 Explore P: 0.2624
Episode: 736 Total reward: 24.0 Training loss: 12.3057 Explore P: 0.2618
Episode: 737 Total reward: 23.0 Training loss: 16.2029 Explore P: 0.2612
Episode: 738 Total reward: 22.0 Training loss: 51.9970 Explore P: 0.2607
Episode: 739 Total reward: 28.0 Training loss: 57.8088 Explore P: 0.2600
Episode: 740 Total reward: 28.0 Training loss: 33.2181 Explore P: 0.2593
Episode: 741 Total reward: 34.0 Training loss: 30.4247 Explore P: 0.2584
Episode: 742 Total reward: 20.0 Training loss: 44.4929 Explore P: 0.2579
Episode: 743 Total reward: 31.0 Training loss: 23.2682 Explore P: 0.2572
Episode: 744 Total reward: 32.0 Training loss: 0.9635 Explore P: 0.2564
Episode: 745 Total reward: 24.0 Training loss: 39.4390 Explore P: 0.2558
Episode: 746 Total reward: 31.0 Training loss: 14.4840 Explore P: 0.2550
Episode: 747 Total reward: 29.0 Training loss: 50.7362 Explore P: 0.2543
Episode: 748 Total reward: 21.0 Training loss: 21.2310 Explore P: 0.2538
Episode: 749 Total reward: 22.0 Training loss: 15.8453 Explore P: 0.2533
Episode: 750 Total reward: 20.0 Training loss: 20.7925 Explore P: 0.2528
Episode: 751 Total reward: 18.0 Training loss: 41.0768 Explore P: 0.2524
Episode: 752 Total reward: 36.0 Training loss: 36.4878 Explore P: 0.2515
Episode: 753 Total reward: 23.0 Training loss: 30.9428 Explore P: 0.2509
Episode: 754 Total reward: 41.0 Training loss: 23.6438 Explore P: 0.2499
Episode: 755 Total reward: 32.0 Training loss: 34.8364 Explore P: 0.2492
Episode: 756 Total reward: 27.0 Training loss: 2.6238 Explore P: 0.2485
Episode: 757 Total reward: 38.0 Training loss: 3.4941 Explore P: 0.2476
Episode: 758 Total reward: 20.0 Training loss: 2.1313 Explore P: 0.2472
Episode: 759 Total reward: 33.0 Training loss: 2.1583 Explore P: 0.2464
Episode: 760 Total reward: 47.0 Training loss: 21.4127 Explore P: 0.2453
Episode: 761 Total reward: 36.0 Training loss: 35.4438 Explore P: 0.2444
Episode: 762 Total reward: 35.0 Training loss: 2.5081 Explore P: 0.2436
Episode: 763 Total reward: 44.0 Training loss: 22.4208 Explore P: 0.2426
Episode: 764 Total reward: 47.0 Training loss: 1.3513 Explore P: 0.2415
Episode: 765 Total reward: 42.0 Training loss: 2.5675 Explore P: 0.2405
Episode: 766 Total reward: 27.0 Training loss: 1.3236 Explore P: 0.2399
Episode: 767 Total reward: 26.0 Training loss: 37.6401 Explore P: 0.2393
Episode: 768 Total reward: 21.0 Training loss: 2.1596 Explore P: 0.2388
Episode: 769 Total reward: 20.0 Training loss: 1.1591 Explore P: 0.2384
Episode: 770 Total reward: 18.0 Training loss: 12.8392 Explore P: 0.2379
Episode: 771 Total reward: 21.0 Training loss: 39.4694 Explore P: 0.2375
Episode: 772 Total reward: 17.0 Training loss: 29.8259 Explore P: 0.2371
Episode: 773 Total reward: 21.0 Training loss: 2.8776 Explore P: 0.2366
Episode: 774 Total reward: 17.0 Training loss: 23.5343 Explore P: 0.2362
Episode: 775 Total reward: 15.0 Training loss: 36.7789 Explore P: 0.2359
Episode: 776 Total reward: 20.0 Training loss: 33.5064 Explore P: 0.2354
Episode: 777 Total reward: 17.0 Training loss: 45.5135 Explore P: 0.2350
Episode: 778 Total reward: 25.0 Training loss: 1.4325 Explore P: 0.2345
Episode: 779 Total reward: 28.0 Training loss: 40.1476 Explore P: 0.2339
Episode: 780 Total reward: 18.0 Training loss: 22.3480 Explore P: 0.2335
Episode: 781 Total reward: 19.0 Training loss: 1.3658 Explore P: 0.2330
Episode: 782 Total reward: 23.0 Training loss: 1.9773 Explore P: 0.2325
Episode: 783 Total reward: 25.0 Training loss: 25.0460 Explore P: 0.2320
Episode: 784 Total reward: 24.0 Training loss: 2.2842 Explore P: 0.2314
Episode: 785 Total reward: 19.0 Training loss: 1.4259 Explore P: 0.2310
Episode: 786 Total reward: 23.0 Training loss: 22.5368 Explore P: 0.2305
Episode: 787 Total reward: 18.0 Training loss: 1.5761 Explore P: 0.2301
Episode: 788 Total reward: 19.0 Training loss: 2.1508 Explore P: 0.2297
Episode: 789 Total reward: 24.0 Training loss: 0.6809 Explore P: 0.2292
Episode: 790 Total reward: 21.0 Training loss: 22.6215 Explore P: 0.2287
Episode: 791 Total reward: 21.0 Training loss: 2.4681 Explore P: 0.2282
Episode: 792 Total reward: 43.0 Training loss: 3.2375 Explore P: 0.2273
Episode: 793 Total reward: 27.0 Training loss: 19.1006 Explore P: 0.2267
Episode: 794 Total reward: 50.0 Training loss: 40.1865 Explore P: 0.2256
Episode: 795 Total reward: 30.0 Training loss: 1.1850 Explore P: 0.2250
Episode: 796 Total reward: 29.0 Training loss: 19.5229 Explore P: 0.2244
Episode: 797 Total reward: 35.0 Training loss: 46.5944 Explore P: 0.2236
Episode: 798 Total reward: 26.0 Training loss: 61.9751 Explore P: 0.2231
Episode: 799 Total reward: 39.0 Training loss: 1.3313 Explore P: 0.2222
Episode: 800 Total reward: 31.0 Training loss: 0.4808 Explore P: 0.2216
Episode: 801 Total reward: 29.0 Training loss: 26.9866 Explore P: 0.2210
Episode: 802 Total reward: 28.0 Training loss: 44.2159 Explore P: 0.2204
Episode: 803 Total reward: 33.0 Training loss: 2.5670 Explore P: 0.2197
 
Episode: 804 Total reward: 47.0 Training loss: 28.8208 Explore P: 0.2187
Episode: 805 Total reward: 34.0 Training loss: 21.1968 Explore P: 0.2180
Episode: 806 Total reward: 49.0 Training loss: 1.1453 Explore P: 0.2170
Episode: 807 Total reward: 27.0 Training loss: 1.0948 Explore P: 0.2164
Episode: 808 Total reward: 26.0 Training loss: 21.9196 Explore P: 0.2159
Episode: 809 Total reward: 25.0 Training loss: 0.6520 Explore P: 0.2154
Episode: 810 Total reward: 35.0 Training loss: 17.9688 Explore P: 0.2147
Episode: 811 Total reward: 26.0 Training loss: 18.2446 Explore P: 0.2141
Episode: 812 Total reward: 49.0 Training loss: 12.5480 Explore P: 0.2131
Episode: 813 Total reward: 25.0 Training loss: 56.8996 Explore P: 0.2126
Episode: 814 Total reward: 22.0 Training loss: 24.0264 Explore P: 0.2122
Episode: 815 Total reward: 36.0 Training loss: 1.7117 Explore P: 0.2114
Episode: 816 Total reward: 29.0 Training loss: 0.9000 Explore P: 0.2109
Episode: 817 Total reward: 17.0 Training loss: 26.5903 Explore P: 0.2105
Episode: 818 Total reward: 34.0 Training loss: 12.4661 Explore P: 0.2098
Episode: 819 Total reward: 29.0 Training loss: 1.6859 Explore P: 0.2093
Episode: 820 Total reward: 31.0 Training loss: 21.5388 Explore P: 0.2086
Episode: 821 Total reward: 34.0 Training loss: 26.2270 Explore P: 0.2080
Episode: 822 Total reward: 49.0 Training loss: 31.2804 Explore P: 0.2070
Episode: 823 Total reward: 84.0 Training loss: 0.9057 Explore P: 0.2054
Episode: 824 Total reward: 31.0 Training loss: 5.1741 Explore P: 0.2047
Episode: 825 Total reward: 33.0 Training loss: 36.1686 Explore P: 0.2041
Episode: 826 Total reward: 45.0 Training loss: 19.7224 Explore P: 0.2032
Episode: 827 Total reward: 52.0 Training loss: 14.4222 Explore P: 0.2022
Episode: 828 Total reward: 84.0 Training loss: 0.7655 Explore P: 0.2006
Episode: 829 Total reward: 104.0 Training loss: 22.4926 Explore P: 0.1987
Episode: 830 Total reward: 82.0 Training loss: 13.8936 Explore P: 0.1971
Episode: 831 Total reward: 97.0 Training loss: 1.2262 Explore P: 0.1953
Episode: 832 Total reward: 150.0 Training loss: 14.1760 Explore P: 0.1925
Episode: 833 Total reward: 111.0 Training loss: 18.4774 Explore P: 0.1905
Episode: 834 Total reward: 57.0 Training loss: 0.7088 Explore P: 0.1895
Episode: 836 Total reward: 2.0 Training loss: 32.1611 Explore P: 0.1859
Episode: 837 Total reward: 183.0 Training loss: 16.2897 Explore P: 0.1827
Episode: 838 Total reward: 199.0 Training loss: 0.7189 Explore P: 0.1793
Episode: 839 Total reward: 125.0 Training loss: 1.0318 Explore P: 0.1772
Episode: 840 Total reward: 196.0 Training loss: 14.6790 Explore P: 0.1740
Episode: 841 Total reward: 170.0 Training loss: 12.0185 Explore P: 0.1712
Episode: 842 Total reward: 138.0 Training loss: 0.4832 Explore P: 0.1690
Episode: 843 Total reward: 139.0 Training loss: 29.1615 Explore P: 0.1668
Episode: 844 Total reward: 142.0 Training loss: 0.6331 Explore P: 0.1646
Episode: 845 Total reward: 138.0 Training loss: 14.6433 Explore P: 0.1625
Episode: 846 Total reward: 18.0 Training loss: 0.4949 Explore P: 0.1622
Episode: 847 Total reward: 97.0 Training loss: 0.8922 Explore P: 0.1607
Episode: 848 Total reward: 56.0 Training loss: 15.0335 Explore P: 0.1599
Episode: 849 Total reward: 41.0 Training loss: 0.5737 Explore P: 0.1593
Episode: 850 Total reward: 32.0 Training loss: 0.9207 Explore P: 0.1588
Episode: 851 Total reward: 23.0 Training loss: 0.8286 Explore P: 0.1585
Episode: 852 Total reward: 31.0 Training loss: 1.1070 Explore P: 0.1580
Episode: 853 Total reward: 29.0 Training loss: 18.8364 Explore P: 0.1576
Episode: 854 Total reward: 38.0 Training loss: 15.2359 Explore P: 0.1570
Episode: 855 Total reward: 21.0 Training loss: 23.7219 Explore P: 0.1567
Episode: 856 Total reward: 41.0 Training loss: 16.5359 Explore P: 0.1561
Episode: 857 Total reward: 36.0 Training loss: 16.4368 Explore P: 0.1556
Episode: 858 Total reward: 26.0 Training loss: 1.1428 Explore P: 0.1552
Episode: 859 Total reward: 19.0 Training loss: 1.0721 Explore P: 0.1549
Episode: 860 Total reward: 18.0 Training loss: 21.4183 Explore P: 0.1547
Episode: 861 Total reward: 19.0 Training loss: 16.9860 Explore P: 0.1544
Episode: 862 Total reward: 16.0 Training loss: 20.0951 Explore P: 0.1542
Episode: 863 Total reward: 23.0 Training loss: 46.4651 Explore P: 0.1538
Episode: 864 Total reward: 29.0 Training loss: 38.0354 Explore P: 0.1534
Episode: 865 Total reward: 22.0 Training loss: 21.6426 Explore P: 0.1531
Episode: 866 Total reward: 16.0 Training loss: 1.3308 Explore P: 0.1529
Episode: 867 Total reward: 36.0 Training loss: 14.6124 Explore P: 0.1524
Episode: 868 Total reward: 38.0 Training loss: 2.2404 Explore P: 0.1518
Episode: 869 Total reward: 25.0 Training loss: 56.1150 Explore P: 0.1515
Episode: 870 Total reward: 20.0 Training loss: 80.7461 Explore P: 0.1512
Episode: 871 Total reward: 18.0 Training loss: 1.0519 Explore P: 0.1509
Episode: 872 Total reward: 18.0 Training loss: 1.3266 Explore P: 0.1507
Episode: 873 Total reward: 30.0 Training loss: 12.4855 Explore P: 0.1502
Episode: 874 Total reward: 27.0 Training loss: 28.9670 Explore P: 0.1499
Episode: 875 Total reward: 18.0 Training loss: 1.0440 Explore P: 0.1496
Episode: 876 Total reward: 19.0 Training loss: 11.4867 Explore P: 0.1494
Episode: 877 Total reward: 25.0 Training loss: 1.4191 Explore P: 0.1490
Episode: 878 Total reward: 13.0 Training loss: 1.8280 Explore P: 0.1488
Episode: 879 Total reward: 19.0 Training loss: 1.6492 Explore P: 0.1486
Episode: 880 Total reward: 18.0 Training loss: 1.6710 Explore P: 0.1483
Episode: 881 Total reward: 21.0 Training loss: 2.3961 Explore P: 0.1480
Episode: 882 Total reward: 19.0 Training loss: 18.9070 Explore P: 0.1478
Episode: 883 Total reward: 13.0 Training loss: 2.6808 Explore P: 0.1476
Episode: 884 Total reward: 13.0 Training loss: 59.1691 Explore P: 0.1474
Episode: 885 Total reward: 19.0 Training loss: 1.7372 Explore P: 0.1471
Episode: 886 Total reward: 17.0 Training loss: 36.3027 Explore P: 0.1469
Episode: 887 Total reward: 19.0 Training loss: 1.1371 Explore P: 0.1466
Episode: 888 Total reward: 24.0 Training loss: 3.7757 Explore P: 0.1463
Episode: 889 Total reward: 23.0 Training loss: 91.8193 Explore P: 0.1460
Episode: 890 Total reward: 17.0 Training loss: 19.6242 Explore P: 0.1458
Episode: 891 Total reward: 21.0 Training loss: 24.0443 Explore P: 0.1455
Episode: 892 Total reward: 15.0 Training loss: 24.5820 Explore P: 0.1453
Episode: 893 Total reward: 15.0 Training loss: 2.7756 Explore P: 0.1451
Episode: 894 Total reward: 17.0 Training loss: 2.6152 Explore P: 0.1449
Episode: 895 Total reward: 26.0 Training loss: 18.4040 Explore P: 0.1445
Episode: 896 Total reward: 22.0 Training loss: 1.6110 Explore P: 0.1442
Episode: 897 Total reward: 16.0 Training loss: 1.7658 Explore P: 0.1440
Episode: 898 Total reward: 22.0 Training loss: 1.2601 Explore P: 0.1437
Episode: 899 Total reward: 34.0 Training loss: 22.6562 Explore P: 0.1432
Episode: 900 Total reward: 30.0 Training loss: 21.8743 Explore P: 0.1428
Episode: 901 Total reward: 33.0 Training loss: 1.0008 Explore P: 0.1424
Episode: 902 Total reward: 18.0 Training loss: 1.0024 Explore P: 0.1422
Episode: 903 Total reward: 19.0 Training loss: 2.0848 Explore P: 0.1419
Episode: 904 Total reward: 24.0 Training loss: 2.9288 Explore P: 0.1416
Episode: 905 Total reward: 21.0 Training loss: 1.5359 Explore P: 0.1413
Episode: 906 Total reward: 30.0 Training loss: 12.0244 Explore P: 0.1409
Episode: 907 Total reward: 20.0 Training loss: 159.1610 Explore P: 0.1407
Episode: 908 Total reward: 26.0 Training loss: 2.1205 Explore P: 0.1403
Episode: 909 Total reward: 17.0 Training loss: 15.5954 Explore P: 0.1401
Episode: 910 Total reward: 16.0 Training loss: 22.4336 Explore P: 0.1399
Episode: 911 Total reward: 19.0 Training loss: 1.7750 Explore P: 0.1397
Episode: 912 Total reward: 20.0 Training loss: 131.3284 Explore P: 0.1394
Episode: 913 Total reward: 24.0 Training loss: 1.8948 Explore P: 0.1391
Episode: 914 Total reward: 29.0 Training loss: 2.0523 Explore P: 0.1387
Episode: 915 Total reward: 24.0 Training loss: 2.1779 Explore P: 0.1384
Episode: 916 Total reward: 31.0 Training loss: 3.0125 Explore P: 0.1380
Episode: 917 Total reward: 23.0 Training loss: 3.5614 Explore P: 0.1377
 
Episode: 918 Total reward: 28.0 Training loss: 2.2340 Explore P: 0.1374
Episode: 919 Total reward: 37.0 Training loss: 1.1904 Explore P: 0.1369
Episode: 920 Total reward: 40.0 Training loss: 1.2616 Explore P: 0.1364
Episode: 921 Total reward: 43.0 Training loss: 0.8278 Explore P: 0.1358
Episode: 922 Total reward: 44.0 Training loss: 14.9794 Explore P: 0.1353
Episode: 923 Total reward: 34.0 Training loss: 3.7697 Explore P: 0.1349
Episode: 924 Total reward: 31.0 Training loss: 2.6347 Explore P: 0.1345
Episode: 925 Total reward: 33.0 Training loss: 15.3097 Explore P: 0.1341
Episode: 926 Total reward: 50.0 Training loss: 3.6046 Explore P: 0.1334
Episode: 927 Total reward: 38.0 Training loss: 16.8126 Explore P: 0.1330
Episode: 928 Total reward: 37.0 Training loss: 0.9605 Explore P: 0.1325
Episode: 929 Total reward: 79.0 Training loss: 2.9432 Explore P: 0.1316
Episode: 931 Total reward: 16.0 Training loss: 1.1470 Explore P: 0.1290
Episode: 932 Total reward: 83.0 Training loss: 2.0115 Explore P: 0.1280
Episode: 933 Total reward: 167.0 Training loss: 21.6101 Explore P: 0.1260
Episode: 934 Total reward: 49.0 Training loss: 1.2986 Explore P: 0.1255
Episode: 935 Total reward: 74.0 Training loss: 1.3522 Explore P: 0.1246
Episode: 936 Total reward: 106.0 Training loss: 2.1100 Explore P: 0.1234
Episode: 937 Total reward: 170.0 Training loss: 102.9330 Explore P: 0.1215
Episode: 939 Total reward: 88.0 Training loss: 1.6688 Explore P: 0.1183
Episode: 940 Total reward: 132.0 Training loss: 1.3766 Explore P: 0.1169
Episode: 941 Total reward: 121.0 Training loss: 4.0064 Explore P: 0.1156
Episode: 942 Total reward: 95.0 Training loss: 2.5706 Explore P: 0.1146
Episode: 944 Total reward: 98.0 Training loss: 19.4265 Explore P: 0.1115
Episode: 945 Total reward: 163.0 Training loss: 2.0109 Explore P: 0.1099
Episode: 946 Total reward: 190.0 Training loss: 0.8350 Explore P: 0.1080
Episode: 947 Total reward: 170.0 Training loss: 1.6171 Explore P: 0.1064
Episode: 948 Total reward: 150.0 Training loss: 3.0087 Explore P: 0.1049
Episode: 949 Total reward: 146.0 Training loss: 45.3638 Explore P: 0.1036
Episode: 951 Total reward: 80.0 Training loss: 24.4540 Explore P: 0.1010
Episode: 952 Total reward: 168.0 Training loss: 31.4414 Explore P: 0.0995
Episode: 953 Total reward: 169.0 Training loss: 0.9907 Explore P: 0.0980
Episode: 954 Total reward: 174.0 Training loss: 133.5918 Explore P: 0.0964
Episode: 956 Total reward: 86.0 Training loss: 126.7740 Explore P: 0.0940
Episode: 957 Total reward: 170.0 Training loss: 32.8894 Explore P: 0.0926
Episode: 958 Total reward: 152.0 Training loss: 1.0307 Explore P: 0.0913
Episode: 959 Total reward: 159.0 Training loss: 1.7562 Explore P: 0.0901
Episode: 960 Total reward: 170.0 Training loss: 1.8498 Explore P: 0.0887
Episode: 961 Total reward: 162.0 Training loss: 1.5826 Explore P: 0.0874
Episode: 962 Total reward: 144.0 Training loss: 268.1501 Explore P: 0.0863
Episode: 963 Total reward: 169.0 Training loss: 0.7906 Explore P: 0.0851
Episode: 964 Total reward: 134.0 Training loss: 1.0930 Explore P: 0.0841
Episode: 965 Total reward: 154.0 Training loss: 1.9819 Explore P: 0.0829
Episode: 966 Total reward: 198.0 Training loss: 0.9171 Explore P: 0.0815
Episode: 967 Total reward: 153.0 Training loss: 3.4019 Explore P: 0.0804
Episode: 968 Total reward: 138.0 Training loss: 2.9622 Explore P: 0.0794
Episode: 969 Total reward: 163.0 Training loss: 2.6915 Explore P: 0.0783
Episode: 970 Total reward: 131.0 Training loss: 1.5915 Explore P: 0.0774
Episode: 971 Total reward: 147.0 Training loss: 1.9916 Explore P: 0.0765
Episode: 972 Total reward: 163.0 Training loss: 165.3958 Explore P: 0.0754
Episode: 973 Total reward: 125.0 Training loss: 181.0614 Explore P: 0.0746
Episode: 974 Total reward: 125.0 Training loss: 1.3310 Explore P: 0.0738
Episode: 975 Total reward: 137.0 Training loss: 1.1295 Explore P: 0.0729
Episode: 976 Total reward: 125.0 Training loss: 1.7053 Explore P: 0.0721
Episode: 977 Total reward: 147.0 Training loss: 2.8502 Explore P: 0.0712
Episode: 978 Total reward: 130.0 Training loss: 2.0895 Explore P: 0.0704
Episode: 979 Total reward: 118.0 Training loss: 2.9704 Explore P: 0.0697
Episode: 980 Total reward: 115.0 Training loss: 2.1063 Explore P: 0.0690
Episode: 981 Total reward: 126.0 Training loss: 2.1950 Explore P: 0.0683
Episode: 982 Total reward: 121.0 Training loss: 2.9289 Explore P: 0.0676
Episode: 983 Total reward: 129.0 Training loss: 2.5116 Explore P: 0.0668
Episode: 984 Total reward: 141.0 Training loss: 217.4426 Explore P: 0.0661
Episode: 985 Total reward: 113.0 Training loss: 202.0555 Explore P: 0.0654
Episode: 986 Total reward: 103.0 Training loss: 5.0593 Explore P: 0.0649
Episode: 987 Total reward: 34.0 Training loss: 3.7027 Explore P: 0.0647
Episode: 988 Total reward: 108.0 Training loss: 1.5037 Explore P: 0.0641
Episode: 989 Total reward: 105.0 Training loss: 3.0609 Explore P: 0.0635
Episode: 990 Total reward: 31.0 Training loss: 214.1037 Explore P: 0.0634
Episode: 991 Total reward: 102.0 Training loss: 245.1903 Explore P: 0.0628
Episode: 992 Total reward: 110.0 Training loss: 0.9902 Explore P: 0.0622
Episode: 993 Total reward: 105.0 Training loss: 2.2175 Explore P: 0.0617
Episode: 994 Total reward: 111.0 Training loss: 2.7350 Explore P: 0.0611
Episode: 995 Total reward: 103.0 Training loss: 2.0451 Explore P: 0.0606
Episode: 996 Total reward: 103.0 Training loss: 1.9076 Explore P: 0.0601
Episode: 997 Total reward: 33.0 Training loss: 0.9464 Explore P: 0.0599
Episode: 998 Total reward: 98.0 Training loss: 1.3951 Explore P: 0.0594
Episode: 999 Total reward: 98.0 Training loss: 264.1575 Explore P: 0.0589

可视化训练结果

我们在下面绘制了每个阶段的总奖励。滚动平均值用蓝色表示。

%matplotlib inline
import matplotlib.pyplot as plt
​
def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
Text(0,0.5,'Total Reward')

玩街机游戏

Cart-Pole 是一个非常简单的游戏。但是,可以使用同一模型训练智能体玩非常复杂的游戏,例如 Pong 或 Space Invaders。你需要使用卷积层从屏幕图片上获取状态,而不是采取在此游戏中使用的状态。

作为一项挑战,我将请你来使用深度 Q 学习训练智能体玩街机游戏。为了获得指导,请参阅以下原始论文:https://s3.cn-north-1.amazonaws.com.cn/static-documents/nd101/DLND+documents/nature14236.pdf

posted on 2019-03-11 20:47  paulonetwo  阅读(637)  评论(0编辑  收藏  举报

导航