【RL】CH1-Basic Concepts
1.7 Markov decision processes
This section presents these concepts in a more formal way under the framework of Markov decision processes (MDPs).
An MDP is a general framework for describing stochastic dynamical systems. The key ingredients of an MDP are listed below.
Sets:
- State space: the set of all states, denoted as .
- Action space: a set of actions, denoted as , associated with each state .
- Reward set: a set of rewards, denoted as , associated with each state-action pair .
Model:
- State transition probability: At state , when taking action , the probability of transitioning to state is . It holds that for any .
- Reward probability: At state , when taking action , the probability of obtaining reward is . It holds that for any .
Policy: At state , the probability of choosing action is . It holds that for any .
Markov property: The Markov property refers to the memoryless property of a stochastic process. Mathematically, it means that
where represents the current time step and represents the next time step. Equation (1.4) indicates that the next state or reward depends merely on the current state and action and is independent of the previous ones. The Markov property is important for deriving the fundamental Bellman equation of MDPs, as shown in the next chapter.
Here, and for all are called the model or dynamics. The model can be either stationary or nonstationary (or in other words, time-invariant or time-variant). A stationary model does not change over time; a nonstationary model may vary over time. For instance, in the grid world example, if a forbidden area may pop up or disappear sometimes, the model is nonstationary. In this book, we only consider stationary models.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人