强化学习算法实例Q-Learning代码(一维场景探索目标)

前言

1 Q-Learning算法实现

首先,需要知道Q表和其更新公式

  • Q表,定义了状态(state)和行为(action)
  • Q表更新,Q(s1,a2)=Q(s1,a2)+lrdiff,diff(差距)=现实-估计=R+rmaxQ(s2)-Q(s1,a2)

然后,算法工作流程是:

  • 按照Q表或随机选择当前状态下的行为
  • 然后经过这个行为后,获取环境的反馈(下一个状态和收益reward)
  • 然后进行Q表更新

2 一维场景
构建一个一维场景: #----T, 角色 # 只能进行左右移动, 直到探索到T位置结束。
参考

代码

import numpy as np
import pandas as pd
import time

N_STATES = 6  # 6个状态,一维数组长度
ACTIONS = [-1, 1]  # 两个状态,-1:left, 1:right
epsilon = 0.9  # greedy
alpha = 0.1  # 学习率
gamma = 0.9  # 奖励递减值
max_episodes = 10  # 最大回合数
fresh_time = 0.3  # 移动间隔时间

# q_table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)


# choose action: 1. 随机探索以及对于没有探索过的位置进行探索,否则选择reward最大的那个动作
def choose_action(state, table):
    state_actions = table.iloc[state, :]
    if np.random.uniform() > epsilon or state_actions.all() == 0:
        action = np.random.choice(ACTIONS)
    else:
        action = state_actions.argmax()
    return action


def get_env_feedback(state, action):
    new_state = state + action
    reward = 0
    if action > 0:
        reward += 0.5
    if action < 0:
        reward -= 0.5
    if new_state == N_STATES - 1:
        reward += 1
    if new_state < 0:
        new_state = 0
        reward -= 1
    return new_state, reward


def update_env(state, epoch, step):
    env_list = ['-'] * (N_STATES - 1) + ['T']
    if state == N_STATES - 1:
        # 达到目的地
        print("")
        print("epoch=" + str(epoch) + ", step=" + str(step), end='')
        time.sleep(2)
    else:
        env_list[state] = '#'
        print('\r' + ''.join(env_list), end='')
        time.sleep(fresh_time)


def q_learning():
    for epoch in range(max_episodes):
        step = 0  # 移动步骤
        state = 0  # 初始状态
        update_env(state, epoch, step)
        while state != N_STATES - 1:
            cur_action = choose_action(state, q_table)
            new_state, reward = get_env_feedback(state, cur_action)
            q_pred = q_table.loc[state, cur_action]
            if new_state != N_STATES - 1:
                q_target = reward + gamma * q_table.loc[new_state, :].max()
            else:
                q_target = reward
            q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
            state = new_state
            update_env(state, epoch, step)
            step += 1
    return q_table


q_learning()

参考

MorvanZhou/Reinforcement-learning-with-tensorflow

posted @ 2021-03-07 20:33  -Rocky-  阅读(818)  评论(0编辑  收藏  举报