【Python】DQN处理CartPole-v1

DQN是强化学习中的一种方法,是对Q-Learning的扩展。

通过引入深度神经网络、经验回放和目标网络等技术,使得Q-Learning算法能够在高维、连续的状态空间中应用,解决了传统Q-Learning方法在这些场景下的局限性。

Q-Learning可以见之前的文章

算法的几个关键点:

1. 深度学习估计状态动作价值函数:DQN利用Q-Learning算法思想,估计一个Q函数Q(s,a),表示在状态s下采取a动作得到的期望回报,估计该函数时利用深度学习的方法。

2. 经验回放:为了打破数据的相关性和提高样本效率,DQN引入了经验回放池。智能体在与环境交互时,会将每一个时间步的经验(s,a,r,s')存入回放池,每次更新网络时,随机从回放池中抽取一个小批量经验进行训练。

3. 目标网络:DQN算法使用两个神经网络:一个在线网络(用于选择动作)和一个目标网络(用于计算目标Q值)。目标网络的参数每隔一段时间才会从在线网络复制,以稳定训练过程。目标Q值的计算公式为:y = r+g*max(Q(s',a')),其中r为奖励,g为折扣因子,Q为目标网络。

代码如下:

import gym
import random
import warnings

import torch
import torch.nn as nn
import torch.optim as optim
warnings.filterwarnings("ignore")

class Net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, hidden_size)
        self.linear3 = nn.Linear(hidden_size, output_size)
    def forward(self, x):
        x = torch.relu(self.linear1(x))
        x = torch.relu(self.linear2(x))
        x = self.linear3(x)
        return x     

if __name__ == '__main__':

    negative_reward = -10.0
    positive_reward = 10.0
    x_bound = 1.0
    gamma = 0.9
    batch_size = 64
    capacity = 1000
    buffer=[]
    env = gym.make('CartPole-v1')
    
    state_space_num = env.observation_space.shape[0]
    action_space_dim = env.action_space.n  

    q_net = Net(state_space_num, 256, action_space_dim)
    target_q_net = Net(state_space_num, 256, action_space_dim)
    
    optimizer = optim.Adam(q_net.parameters(), lr=5e-4)

    for i in range(3000):
        state = env.reset()
                
        step = 0
        while True:
           # env.render()
            step +=1
            epsi = 1.0 / (i + 1)
            if random.random() < epsi:
                action = random.randrange(action_space_dim)
            else:
                state_tensor =  torch.tensor(state, dtype=torch.float).view(1,-1)
                action = torch.argmax(q_net(state_tensor)).item()
            
            next_state, reward, done, _ = env.step(action)
            x, x_dot, theta, theta_dot = state
            if (abs(x) > x_bound):
                r1 = 0.5 * negative_reward
            else:
                r1 = negative_reward * abs(x) / x_bound + 0.5 * (-negative_reward)
            if (abs(theta) > env.theta_threshold_radians):
                r2 = 0.5 * negative_reward
            else:
                r2 = negative_reward * abs(theta) / env.theta_threshold_radians + 0.5 * (-negative_reward)
            reward = r1 + r2
            if (done) and (step < 499):
                reward += negative_reward
                   
            if len(buffer)==capacity:
                buffer.pop(0)
            buffer.append((state, action, reward, next_state))
            
            state = next_state

            if len(buffer) < batch_size:
                continue
            
            samples = random.sample(buffer,batch_size)
            s0, a0, r1, s1 = zip(*samples)

            s0 = torch.tensor( s0, dtype=torch.float)
            a0 = torch.tensor( a0, dtype=torch.long).view(batch_size, 1)
            r1 = torch.tensor( r1, dtype=torch.float).view(batch_size, 1)
            s1 = torch.tensor( s1, dtype=torch.float)
            
            q_value = q_net(s0).gather(1, a0)
            q_target = r1 + gamma * torch.max(target_q_net(s1).detach(), dim=1)[0].view(batch_size, -1)

            loss_fn = nn.MSELoss()
            loss = loss_fn(q_value, q_target)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if i % 10==0:
                target_q_net.load_state_dict(q_net.state_dict())

            if done:
                print(i,step)
                break

    env.close()

基本在迭代100多次之后都能稳定到500步。

posted @ 2024-06-08 16:09  Dsp Tian  阅读(20)  评论(0编辑  收藏  举报