Q-Learning

https://www.geeksforgeeks.org/q-learning-in-python/

Reinforcement Learning briefly is a paradigm of Learning Process in which a learning agent learns, overtime, to behave optimally in a certain environment by interacting continuously in the environment. The agent during its course of learning experience various different situations in the environment it is in. These are called states. The agent while being in that state may choose from a set of allowable actions which may fetch different rewards(or penalties). The learning agent overtime learns to maximize these rewards so as to behave optimally at any given state it is in.

Q-Learning is a basic form of Reinforcement Learning which uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.

Q-Values or Action-Values: Q-values are defined for states and actions. $Q(S, A)$ is an estimation of how good is it to take the action $A$ at the state $S$ . This estimation of $Q(S, A)$ will be iteratively computed using the TD- Update rule which we will see in the upcoming sections.

Rewards and Episodes: An agent over the course of its lifetime starts from a start state, makes a number of transitions from its current state to a next state based on its choice of action and also the environment the agent is interacting in. At every step of transition, the agent from a state takes an action, observes a reward from the environment, and then transits to another state. If at any point of time the agent ends up in one of the terminating states that means there are no further transition possible. This is said to be the completion of an episode.

Temporal Difference or TD-Update:
The Temporal Difference or TD-Update rule can be represented as follows :

This update rule to estimate the value of Q is applied at every time step of the agents interaction with the environment. The terms used are explained below. :

$S$ : Current State of the agent.

$A$ : Current Action Picked according to some policy.

$S'$ : Next State where the agent ends up.

$A'$ : Next best action to be picked using current Q-value estimation, i.e. pick the action with the maximum Q-value in the next state.

$R$ : Current Reward observed from the environment in Response of current action.

$\gamma$ (>0 and <=1) : Discounting Factor for Future Rewards. Future rewards are less valuable than current rewards so they must be discounted. Since Q-value is an estimation of expected rewards from a state, discounting rule applies here as well.

$\alpha$ : Step length taken to update the estimation of Q(S, A).

Choosing the Action to take using $\epsilon$ -greedy policy:
$\epsilon$ -greedy policy of is a very simple policy of choosing actions using the current Q-value estimations. It goes as follows :

With probability $(1-$\epsilon$)$ choose the action which has the highest Q-value.

With probability $($\epsilon$)$ choose any action at random.

白话解释

https://zhuanlan.zhihu.com/p/110338833

https://zhuanlan.zhihu.com/p/110410276

Frozen Lake

https://www.gymlibrary.dev/environments/toy_text/frozen_lake/

This environment is part of the Toy Text environments. Please read that page first for general information.

Action Space

Discrete(4)

Observation Space

Discrete(16)

Import

gym.make("FrozenLake-v1")

Frozen lake involves crossing a frozen lake from Start(S) to Goal(G) without falling into any Holes(H) by walking over the Frozen(F) lake. The agent may not always move in the intended direction due to the slippery nature of the frozen lake.

Action Space

The agent takes a 1-element vector for actions. The action space is (dir), where dir decides direction to move in which can be:

0: LEFT

1: DOWN

2: RIGHT

3: UP

Observation Space

The observation is a value representing the agent’s current position as current_row * nrows + current_col (where both the row and col start at 0). For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. For example, the 4x4 map has 16 possible observations.

Rewards

Reward schedule:

Reach goal(G): +1

Reach hole(H): 0

Reach frozen(F): 0


Action Space	Discrete(4)
Observation Space	Discrete(16)
Import	`gym.make("FrozenLake-v1")`

env 定义

https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py

Q-Learning based on gym

https://github.com/yahsiuhsieh/frozen-lake/blob/main/src/q_learning.py

import gym
import pybulletgym
import numpy as np

from utils import testPolicy, learnModel, plot


def epsilonGreedyExplore(env, state, Q_table, e, episodes):
    """
    epsilon-greedy exploration stratedy
    : param env: object, gym environment
    : param state: int, current state
    : param Q_table: ndarray, Q table
    : param e: int, current episode
    : param episodes: int, total number of episodes
    : return: action: int, chosen action {0,1,2,3}
    """
    prob = 1 - e / episodes
    if np.random.rand() < prob:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q_table[state, :])
    return action


def softmaxExplore(env, state, Q_table, tau=1):
    """
    Softmax exploration stratedy
    : param env: object, gym environment
    : param state: int, current state
    : param Q_table: ndarray, Q table
    : param tau: int, parameter for softmax
    : return: action: int, chosen action {0,1,2,3}
    """
    num_action = env.action_space.n
    action_prob = np.zeros(num_action)
    denominator = np.sum(np.exp(Q_table[state, :] / tau))

    for a in range(num_action):
        action_prob[a] = np.exp(Q_table[state, a] / tau) / denominator
    action = np.random.choice([0, 1, 2, 3], 1, p=action_prob)[0]
    return action


def Qlearning(
    env, alpha, gamma, episodes=5000, evaluate_policy=True, strategy="epsilon-greedy"
):
    """
    Q learning
    : param env: object, gym environment
    : param episodes: int, training episode, defaults to 5000
    : param evaluate_policy: bool, flag to disable recording success rate
    : param strategy: string, different exploration strategy, 'epsilon-greedy' or 'softmax'
    : return:
        policy: ndarray, a deterministic policy
        success_rate: list, success rate for each episode
    """
    # get size of state and action space
    num_state = env.observation_space.n
    num_action = env.action_space.n

    # init
    success_rate = []
    policy = np.zeros(num_state, dtype=int)
    Q_table = np.random.rand(num_state, num_action)

    for i in range(episodes):
        state = env.reset()
        done = False
        while not done:
            # choose action, 'epsilon-greedy' or 'softmax'
            if strategy == "epsilon-greedy":
                action = epsilonGreedyExplore(env, state, Q_table, i, episodes)
            else:
                action = softmaxExplore(env, state, Q_table)

            new_state, reward, done, _ = env.step(action)

            # update Q table
            Q_table[state][action] += alpha * (
                reward + gamma * max(Q_table[new_state, :]) - Q_table[state][action]
            )
            state = new_state

        # get deterministic policy from Q table
        for s in range(num_state):
            policy[s] = np.argmax(Q_table[s, :])

        # get success rate of current policy
        if evaluate_policy:
            if i % 100 == 0:
                success_rate.append(testPolicy(policy))
    return policy, success_rate


if __name__ == "__main__":
    env = gym.make("FrozenLake-v0")
    env.reset()

    # test different alpha with fixed gamma(0.99)
    alphas = [0.05, 0.1, 0.25, 0.5]
    for alpha in alphas:
        _, success_rate = Qlearning(env, alpha=alpha, gamma=0.99)
        print(
            "alpha = {}, gamma = {}: {:.1f}%".format(
                alpha, 0.99, success_rate[-1] * 100
            )
        )
        plot(
            success_rate,
            "Average success rate v.s. Episode (alpha={}, gamma=0.99)".format(alpha),
        )

    # test different gamma with fixed alpha(0.05)
    gammas = [0.9, 0.95, 0.99]
    for gamma in gammas:
        _, success_rate = Qlearning(env, alpha=0.05, gamma=gamma)
        print(
            "alpha = {}, gamma = {}: {:.1f}%".format(
                0.05, gamma, success_rate[-1] * 100
            )
        )
        plot(
            success_rate,
            "Average success rate v.s. Episode (alpha=0.05, gamma={})".format(gamma),
        )

posted @ 2022-11-14 22:38 lightsong 阅读(127) 评论(0) 编辑收藏举报

刷新页面返回顶部

Stay Hungry,Stay Foolish!

lightsong

{Web: [React, Vue, NodeJS, HTTP]，DevOps:[Jenkins,Docker,K8S], Languages:[Python, JS, C, Lua, Shell, Groovy]}

Q-Learning

Q-Learning

白话解释

Frozen Lake

Action Space

Observation Space

Rewards

env 定义

Q-Learning based on gym

公告