博弈论算法 CFR算法
扩展性博弈 与 CFR算法
算法 | 鲁棒采样变体 | 神经网络变体 | 后悔值 | 后悔值匹配 | 策略更新 | 收敛速度 | 解概念 | 发表 | 时间 |
CFR:Regret Minimization in Games with Incomplete Information (neurips.cc) | NE | NIPS | 2007 | ||||||
MCCFR:OS-MCCFR/ES-MCCFR: Monte Carlo Sampling for Regret Minimization in Extensive Games (neurips.cc) | √ | NE | NIPS | 2009 | |||||
CFR+:Heads-up limit hold’em poker is solved Science | √ | NE | Science | 2015 | |||||
CFVnets DeepStack: Expert-level artificial intelligence in heads-up no-limit poker | √ | √ | NE | Science | 2017 | ||||
VR-MCCFR:Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games Using Baselines | √ | NE | AAAI | 2019 | |||||
LCFR:Solving Imperfect-Information Games via Discounted Regret Minimization Proceedings | √ | NE | AAAI | 2019 | |||||
DCFR: Deep Counterfactual Regret Minimization (mlr.press) | √ | NE | PMLR | 2019 | |||||
CFR-S :Learning to Correlate in Multi-Player General-Sum Sequential Games | √ | CCE | NIPS | 2019 | |||||
CFR-Jr :Learning to Correlate in Multi-Player General-Sum Sequential Games | √ | CCE | NIPS | 2019 |
主题 | Paper | 学校 | 发表 | 时间 |
无限注德州扑克 | DeepStack:DeepStack: Expert-level artificial intelligence in heads-up no-limit poker | Alberta | Science | 2017 |
无限注德州扑克 | Libratus:Superhuman AI for heads-up no-limit poker: Libratus beats top professionals | CMU | Science | 2018 |
六人无限注德州扑克 | Pluribus:AI surpasses humans at six-player poker | CMU | Science | 2019 |
主要内容 | Paper | Code | 发表期刊 | 时间 |
采用反事实基线(counterfactual baseline)来解决信用分配的问题 | COMA:Counterfactual Multi-Agent Policy Gradients | AAAI | 2018 | |
将后悔值融入行动器评判器(Actor-critic)的梯度更新过程,提出了无模型的多智能体强化学习方法后悔策略梯度 | RPG:Actor-Critic Policy Optimization in Partially Observable Multiagent Environments | NIPS | 2018 | |
AlphaHoldem:AlphaHoldem: High-Performance Artificial Intelligence for Heads-Up No-Limit Poker via End-to-End Reinforcement Learning | AAAI | 2022 |
Learning, regret minimization, and equilibria[J] :是cmu的讲义
Time and Space: Why Imperfect Information... | ERA (ualberta.ca): 是 stack 作者的博士论文 2017
Equilibrium Finding for Large Adversarial Imperfect-Information Game: Noam Brown 的博士论文 2020
基于状态抽象和残局解算的二人非限制性德州扑克策略的研究 - 中国知网 (cnki.net) : 哈工大(深研院)硕士论文 2017
基于实时启发式搜索的非完备信息博弈策略研究 - 中国知网 (cnki.net) : 哈工大(深研院)硕士论文 2018
[An Introduction to Counterfactual Regret Minimization](cfr.pdf (gettysburg.edu)) 本地
浅谈德州扑克AI核心算法:CFR - 掘金 (juejin.cn)
表示形式—— 博弈树
图中有两个参与者 ,进行了两个阶段的博弈
- 根节点:博弈的起点,玩家进行决策。关于博弈怎么开始,博弈的顺序,可以有预定的顺序也可以通过掷色子、投硬币决定等。
- 非叶子结点:决策结点:表示这个时候哪个博弈玩家做出决策。****
- 叶子结点:代表每个玩家在此时的收益。收益只存在于叶子结点
虚线框:信息集 ,同一信息集下可以执行的策略是一致的
信息集 information set
上图是一个二人单步行动的博弈。玩家P1 有两个可选行动,玩家P2有三个可选行动。P1先执行动作,P2后执行动作。收益是零和的,表示P2的收益,也就是P1的损失。
上图a和b的区别就只有信息集的不同。在图a中,P2的两个决策结点在同一个虚线框中,表示P2在决策的时候并不知道P1选择的动作及结果,即P2在决策并没有获得额外信息。 在图b中,P2的决策结点在不同的虚线框中,因此P2观察到了P1选择了哪个行动,也就是从根节点到当前决策结点的路径是P2所知道的,此时P2有着完美信息。
上图b清楚地表示了参与者1先动,参与者2观察到参与者1的行动。然而,有些博弈并不是这样,如图a所示,参与者并不是一直能观察到另一 个人的选择(例如,同时行动或者行动被隐藏)。
完美信息的博弈是指在博弈的任何阶段,每个参与者都清楚博弈之前发生的所有行动,也即每个信息集都是一个单元素集合。 没有完美信息的博弈就是不完美信息博弈。

二人零和单步 —— 二人非零和单步——多人非零和单步
有限: 时间序列上有限 ,有中止结点
1 feedback
如上图2.8 所示的博弈为二人零和 feed back 博弈,需要满足两个条件:
求解saddle point 的方法
从最后的叶子结点,也就是最后一层开始,求解这一单步策略交互的 鞍点均衡策略 ,计算收益值。将这一步的博弈剪枝操作,剪下来,用计算的均衡收益替换掉这个博弈树。
2 openloop
这种对于每个参与者每个阶段都只有一个信息集的博弈称之为 openloop型的扩展性博弈。
N 人博弈
定义 Definition 3.10 extensive form tree structure
Definition 3.10 An extensive form of an N-person nonzero-sum finite game without chan moves is a tree structure with
- a specific vertex indicating the starting point of the game
- cost functions , each one assigning a real number to each terminal vertex of the tree,where the th cost dunction determines the loss to be incurred to 。
- a partition of the nodes of the tree into player sets
- a subpartition of each player set into information sets ,such that the same number of branches emanates from every node belonging to the same information set and no node follows another node in the same information set .
图示 Figure3.1
Two typical nonzero-sum finite games in extensive form are depicted in Fig. 3.1.
The first one represents a 3-player single act nonzero sum finite game in extensive form in which the information sets of the players are such that both P2 and P3 have access to the action of P1.
The second extensive form of Fig. 3.1, on the other hand, represents a 2-player multi-act nonzero-sum fnite game in which P1 acts twice and P2 only once.
In both extensive formns, theset of alternatives for each player is the same at all information sets and it consise of two elements.
The outcome corresponding to each possible path is denotes by an ordered N-tuple of numbers , wherestands for the number of players and a' stands for the corresponding cost to .
图示 Figure3.2 3.3
Definition 3.14 inferior
Definition 3.14 Let and be two single-act -person games in extensive form ,and further let and denote the strategy sets of in and ,respectively. Then , is said to be informationally inferior to if for all ,with strict inclusion for at least one .
Proposition 3.7 Let be an -person single-act game that informationally inferior to some other single-act -person game ,say . Then
- any Nash equilibrium solution of also constitues a Nash equilibrium solution for ,
- if is a Nash equilibrium solution of so that for all ,then it also constitues a Nash equilibrium solution for
定义 Definition 3.15 nested/ladder-nested
Definition 3.15
In an extensive form of a single act nonzero-sum finite game with a fixed order of play, a player is said to be a precedent of another player if the former is situated closet to the vertex of the tree than the latter.
The extensive forrm is said to be nested if each player has access to thei nformation acquired by all his precedents.If, furthermore, the only diference(if any) between the information available to a player () and his closest (immediate) precedent (say ) involves only the actions of , and only at those nodes corresponding to the branches of the tree emanating from singleton information sets of , and this so for all players, the extensiuve form is said to be ladder-nested.
A single- act nonzero-sum finite gameis said to be nested(respectively, ladder-nested) if it admits an extensive form that is nested
(respectively, ladder-nested).17:Note that in 2-person single-act games the concepts of “nestednes" and "ladder-nstetes" coincide, and every extensive form is, by defnitin, ldder-nested
Remark 3.7
The single act extensive forms of Figs. 3.1(a) and 3.2 are both ladder-nested.
If the extensive form of Fig. 3.1(a) is modified so that both nodes of P2 are included in the same information set, then it is only nested, but not ladder-nested, since P3 can differentiate between different actions of P1 but P2 cannot.
Finally, if the extensive form of Fig.3.1(a) is modified so that this time P3 has a single information set (see Fig. 3.4(a)), then the resulting extensive form becomes non-nested, since then even though P2 is a precedent of P3 he actually knows more than P3 does.
The single-act game that this extensive form describes is, however, nested since it also admits the extensive form description depicted in Fig. 3.4(b).
One advantage of dealing with ladder-nested extensive forms is that they can recursively be decomposed into simpler tree structures which are basically static in nature. This enables one to obtain a class of Nash equilibria of such games recursively, by solving static games at each step of the recursive procedure.
Before providing the details of this recursive procedure, let us introduce some terminology.
子集定义 Definition 3.16
Definition 3.16
For a given single- act dynamic game in nested extensive forrm (say, ), let denote a singleton information set of 's immediate follower (say ); consider the part of the tree structure of , which is cut off at , has as its vertex and has as immediate branches only those that enter into that inforrmation set of .
Then, this tree structure is called a sub-extensive form of. (Here, we adopt the convention that the starting vertex of the original extensive form is the singleton inforrmation set of the first-acting player. )
拆解1 将ladder-nested 拆解为静态的
Definition 3.17
A sub-extensive form of a nested extensive form of a single-act game is static if every player appearing in this tree structure has a single information set.
Remark 3.8 The single act game depicted in Fig3.2 admits two sub-extensive forms which are as follows:
The extensive form of Fig. 3.1(a), on the other hand, admits a total of for sub-extensive forms which we do not display here. It should be noted that each sub-extensive form is itself an extensive form describing a simpler game. The first one displayed above describes a degenerate 2-player game in which P1 has only one alternative. The second one again describes a 2-player game in which the players each have two alternatives. Both of these sub-extensive forms will be called static since the first one is basically a one player game and the second one describes a static 2-player game.
拆解2 将nested拆解为动态的 Definition 3.19
Definition 3.19 A nested extensive (or sub-extensive) form of a single-act game is said to be undecomposable if it does not admit any simpler sub-extensive form. It is said to be dynamic, if at least one of the players has more than one information set.
定义 Definition 3.21
Definition 3.21 A muiti-act N -person nonzero-sum game in extensive form with a fixed onder of play is called an N-person
nonzero-sum feedback game Jin extensive form, if
at the time of his act, each player has perfect information concerning the current level of play, i.e.,no information set contains nodes of the tree belonging to different levels of play,
information sets of the first- acting player at every level of play are singletons, and the information sets of the other players at every level of play are such that none of them includes nodes corresponding to branches emanating from two or more different information sets of the first-acting player, i.e., each player knows the state of the game at every level of play.
If, furthermore,
the single-act games corresponding to the information sets of the first-acting player at each level of play are of the ladder-nested (respectively nested) type (cf. Def. 3.15), then the multi-act game is clled an N-person nonzer-sum fedack game in ladder nested (eseptivele, neste) extensive form.
扩展型博弈 Extensive Form Games
an extensive form game is a tuple .
is a set of states, including , the initial state of the game. A state can alsobe called a history, because a game state is exactly history of the sequence of actions taken from the initial state. I will use to indicate concatenation,so is the sequence of actions in , followed by action .
对于任何一个在集合中的序列,它是一个在某一次游戏中发生状态的序列,将此次博弈中发生的所有动作按照时间先后依次排列起来即得到。从博弈树的角度来讲,是从根节点到达博弈树中任意某个节点的路径。基于此我们可以做如下定义: 表示是的子串, 表示是的真子串。对应到博弈树中则表示是的一个孩子节点。
is the set of terminal (leaf) states, and gives the payoff to player p if the game ends at state .
: 终止状态的集合(对应博弈树中的叶子节点)
is a set of all players acting in the game,
and is a function which describes which player is currently acting at a non-terminal state .
The actions available at a non-terminal state are defined implicitly by the set of histories , so that .
注意这个 $ h\cdot aha$
For a history where a stochastic event is about to happen, like a die roll or card deal, is a special“ chance player” . The value gives the probability of chance event occurring if the game is in state .
:机会玩家(可理解为发牌员)做出所有合法动作的概率分布,可进一步用 来表示当游戏处于状态时,机会事件发生的概率
describes what information is hidden from the players in a game, defined by a partition of all non-terminal, non-chance states.
must satisfy the constraints that and , we have and .
A set is called an information set, and for every state the player will only know that the current state of the game is one of the states in, but not exactly which one.
A player’s strategy, also known as their policy, determines how they choose actions at states where they are acting .The term strategy profile is used to refer to a tuple consisting of a strategy for each player.
I will use to refer to a strategy for player , and to refer to a strategy profile .
Given a strategy profile and some player strategy , I will use the tuple to refer to the strategy profile where player ’s strategy is , and their opponent plays according to their strategy in .
策略的概率 Strategy Probabilities
gives the probability of player making action given they have reached information set$ I \in \mathcal{I}{p} \sigma(I, a)=\sigma(I, a) $. I will use the vector to speak of the probability distribution over .
$ \sigma_{p}(I, a) pIa$的概率
:表示一个概率分布,即处于状态集时,做出所有合法动作的概率分布(将游戏中所有的信息集上的 组合起来,即可得到完整的策略
gives the probability of reaching state if all players follow profile .
gives the probability of reaching state if chance and ’s opponents make the actions to reach , and player acts according to .
We can also extend the notation by flipping which players are following and use to refer to the probability of reaching state if player makes the actions to reach , and chance and ’s opponents act according to .
All of these probabilities can also be extended to consider subsequences of actions. We can speak of as the probability of player making the actions needed to move from state to state .
1. $ \pi{\sigma}(h)=\pi_{p}(h) \pi_{-p}^{\sigma}(h) $ :这个等式表明所有玩家遵循策略到达状态的概率等于每个玩家分别遵守策略到达状态的概率相乘;
2. $ \pi_{p}^{\sigma}(I, a)=\pi_{p}^{\sigma}(I) \sigma(I, a) πσ(I^k,a) $概率的乘积;
其中右边代表玩家从初始状态出发,遵循策略,到达到达信息集的概率 ,表示在信息集下做出动作的概率
3. $ \pi{\sigma}(z)=\pi(h) \pi^{\sigma}(z \mid h) $ :这个等式是等式2的推论,表明从初始状态出发,遵循策略到达状态的概率等于从初始状态出发,遵循策略到达状态(h是z的子串)的概率乘以遵循策略,从状态出发到达状态的概率。
策略的价值 Strategy Values
式子中, 按着前面的定义即为 玩家到达终止状态(叶子节点)所获得的收益;
这个收益即表示 玩家 从博弈起点到中间状态 再根据策略到达终点得到的收益。
可以将右端前一项根据概率式1 进行拆分 ,得到
根据此定义,整局游戏的收益即为博弈树根节点的收益 $ u_{p}{\sigma}=u_{p}(\varnothing) $
反事实值 the counterfactual value
Counterfactual value differs from the standard expected value in that it does not consider the player s own probability of reaching the state (thus the use of counterfactual in the name: instead of following σ, what if the player had instead played to reach here?)
反事实值不同于标准期望值,因为它不考虑玩家自己达到状态的概率(因此在名称中使用反事实: 如果不遵循策略σ,如果玩家到达这里的概率是怎么样的?)
The counterfactual value for player of state is
右端第一项 表示 其他玩家选择策略 从起点到达中间结点的概率 ;
第二项 表示路径 经过中间结点,然后根据策略到达最终结点的概率 ,
右端第三项 表示 玩家在最终结点的收益 , 然后对所有经过中间结点到达最终结点的路径进行求和。
但我们知道想要根据策略从博弈起点到达最终结点,玩家也要根据策略 从起点到达中间状态,但上面式子缺失了这一块
反事实值也就表示,不遵循策略的时候从起点到达中间状态的收益 ;
反事实值越大,表示不遵循策略是好的 ;
同样的,将概念扩展到信息集上有 the counterfactual value for player of an information set is
最优反应 Best Response
策略空间 Strategy Space
Behaviour Strategies
A behaviour strategy directly specifies the probabilities . When playing a game according to a behaviour strategy, at every information set the player samples from the distribution to get an action.
Similar to sequence form strategies, the size of a behaviour strategy is .
The first down side of behaviour strategies is that the expected value is not linear in the strategy probabilities, or even a convex-concave function.
The second problem is that computing the average of a behaviour strategy is not a matter of computing the average of the specified action probabilities: it is not entirely reasonable to give full weight to a strategy’s action probabilities for an information set that is reached infrequently. Instead, to compute the average behaviour strategy we must weight by , which is equivalent to averaging the equivalent sequence form strategies.
遗憾 Regret
Much of my work makes use of the mathematical concept of regret. In ordinary use, regret is a feeling that we would rather have done something else. The mathematical notion of regret tries to capture the colloquial sense of regret, and is a hindsight measurement of how much better we could have done compared to how we did do.
Assume we have a fixed set of actions , a sequence of probability distributions over , and a sequence of value functions .
注解:While is fixed, both and are free to vary over time.
虽然可选动作集 是固定的,但 策略 和 收益 都可以随时间自由变化。
注意这里的是动作集合上的收益值,其概念和定义 与之前定义在某一状态的收益 以及 定义在某一状态的 反事实值都不一样 ,这个定义更类似于即时奖励的概念 所以这个文章在这一点上符号表达是有些混乱的,便于理解可以将其改为 r
Given some set of alternative strategies, where maps time to a probability distribution over, our regret at time with respect to is
The regret can then be written in terms of the choice regrets as
这里的的含义是? 当前时刻 ,在动作集合上 动作的概率分布(动作的选择,也就是策略,理解:石头剪刀布1/3,1/3,1/3 和1/2,1/2,0就是两种不同的策略)乘以 动作的奖励 ,也就是当前时刻根据所采取的策略所得到的期望奖励。 将策略(动作集合的分布概率)进行替换后会得到新的奖励的期望。两者的差别就是策略的选择(例如赌博,all in 还是压哪几个),也就产生了遗憾。
In order for regret to be meaningful, we need bounded values, so that there is some that $ \left|v{t}(a)-v(b)\right| \leq L \forall t \forall a, b \in A $.
External Regret 外部遗憾
Using different sets gives different regret measures.
The most common, external regret, uses such that , is a probability distribution placing all weight on a single action , and . That is, external regret considers how we would have done if we had always played a single, fixed action instead.
在遗憾最小化算法中,使用一种特殊的遗憾值:External Regret,即用一特定动作代替当前策略,产生的遗憾值。
理解:所谓策略就是动作集合上概率的分布,选择特定的动作也就是比如 原来的策略剪刀石头布的动作的选择概率分布为,
改变策略可以有很多种可能性例如等等 ,这个概念考虑将策略变为一个固定的动作的选择,也就是将剪刀石头布的动作的选择概率分布变为
在线学习和遗憾值最小化 Online Learning and Regret Minimisation
Let us say we have a repeated, online, decision making problem with a fixed set of actions. At each time step, we must choose an action without knowing the values in advance. After making an action we receive some value (also known as reward), and also get to observe the values for all other actions.The values are arbitrary, not fixed or stochastic, and might be chosen in an adversarial fashion.
The adversary has the power to look at our decisionmaking rule for the current time step before we make the action, but if we have a randomised policy the adversary does not have the power to determine the private randomness used to sample from the distribution.
This setting is often called expert selection, where the actions can be thought of as experts, and we are trying to minimise loss rather than maximise value.
Given such a problem, with arbitrary adversarial values, regret is a natural measure of performance. Looking at our accumulated value by itself has little meaning, because it might be low, but could still be almost as high as it would have been with any other policy for selecting actions. So, we would like to have an algorithm for selecting actions that guarantees we have low regret.
Because of the bounds on values, we can never do worse than regret. Cesa-Bianchi et al. give a lower bound: for any and sufficiently large and , any algorithm has at least external regret in the worst case
the average regret 平均遗憾
We are often interested in the average behaviour, or the behaviour in the limit, and so we might consider dividing the total regret by to get the average regret.
If our total regret is sub-linear, average regret will approach as , and our average value (or reward) approaches the best-case value. Despite the arbitrary, potentially adversarial selection of values at each time step, there are multiple algorithms which guarantee sub-linear external regret
遗憾匹配算法 Regret-Matching Algorithm
Blackwell introduced the Regret-matching algorithm , which can be used to solve the online learning problem discussed in Section 2.7.1.
Given the regrets for each action, the regret-matching algorithm specifies a strategy
That is, the current strategy is the normalised positive regrets.
其含义为:特定某一动作 在全部动作 的比重 ,或者说是根据历史的遗憾值将这个动作归一化为一个当前选择动作的概率
Rather than computing , an implementation of regret-matching will usually maintain a single regret value for each action, and incrementally update the regrets after each observation. We can rewrite the action regrets as
Using , which depends only on the newly observed values , we can update the previous regrets to get .
使用 遗憾匹配的实现通常会为每个动作保持一个遗憾值,并在每次观察后增量更新遗憾 。这样自然会消耗大量空间进行存储
于是改用作迭代式子,只需要根据当前观测的 加上之前的遗憾,既可以用来更新行动当前的遗憾
CFR 算法
CFR is a self-play algorithm using regret minimisation. Zinkevich et al. introduced the idea of counterfactual value, a new utility function on states and information sets, and use this value to independently minimise regret at every information set. By using many small regret minimisation problems, CFR overcomes the prohibitive memory cost of directly using a regret minimisation algorithm over the space of pure strategies.
CFR 是一种使用后悔最小化的自我博弈算法。 津克维奇等人。引入了反事实值的概念,这是一种关于状态和信息集的新效用函数,并使用该值独立地最小化每个信息集的遗憾。 通过使用许多小的遗憾最小化问题,CFR 克服了在纯策略空间上直接使用遗憾最小化算法的过高内存成本。
Counterfactual values, defined in Section 2, can be combined with any standard regret concept to get a counterfactual regret at an information set.
第 2 节中定义的反事实值可以与任何标准遗憾概念相结合,以在信息集上获得反事实遗憾.
CFR uses external regret, so that the counterfactual regret at time of an action at a player information set is
根据之前在信息集上定义的反事实值的概念和反事实值和策略价值的关系:$u_{p}{\sigma}(h)=\pi_{p}(h) v_{p}^{\sigma}(h) $ ,就可以得到基于反事实遗憾值的遗憾的定义。
其含义为: 疑惑?
Given the sequence of past strategies and values for, any regret-minimising algorithm can be used to generate a new strategy over the actions , with a sub-linear bound on total external regret .CFR uses the regret-matching algorithm.
Combining at each player information set gives us a complete behaviour strategy , and repeating the process for all players gives us a strategy profile .
结合每个玩家 信息集的 为我们提供了一个完整的行为策略 ,并且对所有玩家重复该过程为我们提供了一个策略组
Generate strategy profile σt from the regrets, as described above.
For all , and :
Update the average strategy profile to include the new strategy profile.
For all , and :
Using the new strategy profile, compute counterfactual values.
For all , and :
Update the regrets using the new counterfactual values.
For all , and :


