基本概念

State

\[s_i\quad, \quad S = \{s_i\} \]

表示状态和状态空间（集合）

Action

\[a_i \quad , \quad A = \{a_i\} \]

表示动作和动作空间（集合）
可用Tabular representation表示

Policy

\[\pi \quad , \quad \pi (a_i | s_j) = c_{k} \]

用概率形式表示动作可能的结果
针对一个状态的概率之和为1
可用Tabular representation表示

Deterministic policy （确定性情况）

对于一个状态S_j，一个动作a_i对他的概率为1，其余动作对该状态的概率均为0

Stochastic policy（不确定性情况）

不存在某一个动作对一个状态的概率为1

Reward

positive reward -> encouragement
negative reward -> punishment

\[p(r=-1|s_1, a_1) = 1 \quad \& \quad p(r \neq -1 | s_1,a_1) = 0 \]

Discount rate

\[\gamma \in [0,1) \]

Discounted return

\[\begin{align} \text{discounted return} &= p_1 + \gamma p_2 + \gamma ^2 p_3 + \gamma ^3 p_4 + \gamma ^4 p_5 + \gamma ^5 p_6 + \dots \\ \text{In the case: }& p_1 =0 , p_2=0 , p_3=0 , p_4=1 , p_5=1 , p_6=1 \\ \text{discounted return} &= \gamma ^3 (1+ \gamma + \gamma ^2 + \dots) \\ &=\gamma ^3 \frac{1}{1-\gamma}. \end{align} \]

Roles:

the sum becomes finite;
balance the far and near future rewards:
- \[\text{If } \gamma \text{ is close to 0, the value of the discounted return is dominated by the rewards obtained in the near future.} \]
- \[\text{If } \gamma \text{ is close to 1, the value of the discounted return is dominated by the rewards obtained in the far future.} \]

Markov decision process (MDP)

Markov property: memoryless property （不具有记忆性：与历史无关）

\[p(s_{t+1}|a_{t+1},s_t, \dots ,a_1,s_0) = p(s_{t+1}|a_{t+1},s_t), \\ p(r_{t+1}|a_{t+1},s_t, \dots ,a_1,s_0) = p(r_{t+1}|a_{t+1},s_t). \]

Markov process 是带有概率的动作
被赋予了 policy 的 Markov process 是 Markov decision process

posted on 2023-06-10 10:59 POLAYOR 阅读(26) 评论(0) 编辑收藏举报

刷新页面返回顶部

POLAYOR

基本概念

基本概念

State

Action

Policy

Deterministic policy （确定性情况）

Stochastic policy（不确定性情况）

Reward

Discount rate

Discounted return

Markov decision process (MDP)

导航

公告