【RL】CH1-Basic Concepts

1.7 Markov decision processes

This section presents these concepts in a more formal way under the framework of Markov decision processes (MDPs).

An MDP is a general framework for describing stochastic dynamical systems. The key ingredients of an MDP are listed below.

Sets:

  • State space: the set of all states, denoted as S.
  • Action space: a set of actions, denoted as A(s), associated with each state sS.
  • Reward set: a set of rewards, denoted as R(s,a), associated with each state-action pair (s,a).

Model:

  • State transition probability: At state s, when taking action a, the probability of transitioning to state s is p(ss,a). It holds that sSp(ss,a)=1 for any (s,a).
  • Reward probability: At state s, when taking action a, the probability of obtaining reward r is p(rs,a). It holds that rR(s,a)p(rs,a)=1 for any (s,a).

Policy: At state s, the probability of choosing action a is π(as). It holds that aA(s)π(as)=1 for any sS.

Markov property: The Markov property refers to the memoryless property of a stochastic process. Mathematically, it means that

p(st+1st,at,st1,at1,,s0,a0)=p(st+1st,at),p(rt+1st,at,st1,at1,,s0,a0)=p(rt+1st,at),

where t represents the current time step and t+1 represents the next time step. Equation (1.4) indicates that the next state or reward depends merely on the current state and action and is independent of the previous ones. The Markov property is important for deriving the fundamental Bellman equation of MDPs, as shown in the next chapter.

Here, p(ss,a) and p(rs,a) for all (s,a) are called the model or dynamics. The model can be either stationary or nonstationary (or in other words, time-invariant or time-variant). A stationary model does not change over time; a nonstationary model may vary over time. For instance, in the grid world example, if a forbidden area may pop up or disappear sometimes, the model is nonstationary. In this book, we only consider stationary models.

posted @   鸽鸽的书房  阅读(8)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
点击右上角即可分享
微信分享提示