人工智能导论——重要知识

Finite-time Analysis of the Multiarmed Bandit Problem

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy’s success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

翻译：
强化学习策略面临探索与利用的困境，即在探索环境以找到有利可图的行动与尽可能多地采取经验最佳行动之间寻求平衡。衡量一项政策在解决这一困境方面是否成功的常用标准是“regret”，即由于没有始终遵循全局最优政策而造成的损失。关于探索/开发困境的一个最简单的例子便是多臂老hu机问题。Lai和Robbins是第一个证明这个问题的遗憾至少关于次数呈对数增长的人。从那时起，赖和罗宾斯以及其他许多人制定了asymptotically(渐近地)实现这一遗憾的政策。在这项工作中，我们证明了最优对数regret也可以随着时间的推移uniformly(一致地,均匀的)实现，使用简单有效的策略，并且对于所有有界支持的奖励分布。

笔者注：
uniformly意味着策略不会过于偏向于早期的探索或早期的利用。

Keywords

bandit problems, adaptive allocation rules, finite horizon regret

翻译：赌博机问题，自适应分配规则，有限时间遗憾

笔者注：
"自适应分配规则（adaptive allocation rules）"指的是一种根据当前情况和反馈信息调整资源或行动分配的策略。对于多臂赌博机问题，在每次尝试后根据奖励结果进行调整的分配规则可以被视为自适应分配规则。另一个例子是在网络流量控制中，根据当前网络负载和延迟情况来动态分配带宽的规则也可以看作是自适应分配规则。

"有限时间遗憾（finite horizon regret）"是指在一个有限的时间内，由于没有选择最优行动而导致的损失。当我们的策略能够以较小的有限时间遗憾达到最佳对数遗憾时，这意味着我们的学习算法在有限时间内能够逐渐接近最优策略，尽管在个别时间步骤中可能存在一些遗憾，但总体上表现较好。

Note

The exploration versus exploitation dilemma can be described as the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. The simplest instance of this dilemma is perhaps the
multi-armed bandit, a problem extensively studied in statistics (Berry & Fristedt, 1985) that has also turned out to be fundamental in different areas of artificial intelligence, such as reinforcement learning (Sutton & Barto, 1998) and evolut ionary programming(Holland,1992).
探索与利用的困境可以描述为在寻求平衡的过程中，既要探索环境以找到有利可图的行动，又要尽可能经验性地选择最佳行动。这个困境的最简单实例可能是多臂赌博机问题，该问题在统计学中得到了广泛研究（Berry＆Fristedt, 1985），并且在人工智能的不同领域中也被证明是基础问题，例如强化学习（Sutton＆Barto，1998）和进化编程（Holland，1992）。

在多臂赌博机问题中，我们面对多个赌博机（臂），每个赌博机的支付分布是未知的。目标是在一系列游戏中最大化总奖励。探索策略涉及尝试不同的赌博机以收集关于它们的支付概率的信息，而利用策略则专注于玩过去显示出更高回报的赌博机。

挑战在于找到探索和利用之间的合适平衡。如果只根据已有知识利用当前最佳赌博机，可能会错过潜在的更好选择。相反，如果只是探索新的赌博机而不利用当前已知的最佳选项，可能会错过即时奖励。

针对探索与利用的困境，已经发展出了多种算法和策略。这些包括ε-贪心方法、上置信界算法、汤普森采样等。这些方法旨在智能地分配资源，既探索不确定的选项，又利用基于可用信息的最佳行动。

总之，探索与利用的困境是决策问题中的一个基本挑战。它涉及在探索环境以发现有利可图的行动与利用当前已知的最佳行动以最大化短期或长期回报之间找到合适的平衡点。

笔者注：没什么好说的，去了解一下那三个方法即可。

In its most basic formulation, a K-armed bandit problem is defined by random variables $X_{i,n}$ for $1≤i ≤ K$ and $n ≥1$, where each i is the index of a gambling machine (i.e., the “arm” of a bandit). Successive plays of machine i yield rewards $X_{i,1}, X_{i,2}, . . . $which are independent and identically distributed according to an unknown law with unknown expectation μ_i . Independence also holds for rewards across machines; i.e., X_{i,s} and X_{j,t} are independent (and usually not identically distributed) for each 1 ≤ i < j ≤ K and each s, t ≥ 1.

在其最基本的形式中，K臂赌博机问题由随机变量$X_{i,n}$定义，其中$1≤i ≤K$且$n ≥1$，其中每个$i$表示一个赌博机（即赌徒的“臂”）。对于机器$i$连续的玩法产生的奖励为$X_{i,1}, X_{i,2}, . . .$ ，这些奖励是独立同分布的，服从未知的分布，其期望值为$μ_i$。机器之间的奖励也是独立的；即对于每个$1 ≤ i < j ≤ K$和每个$s, t ≥ 1$，$X_{i,s}$和$X_{j,t}$是相互独立的（通常不是相同分布）。

A policy, or allocation strategy, A is an algorithm that chooses the next machine to play based on the sequence of past plays and obtained rewards. Let $T_{i} (n)$ be the number of times machine $i$ has been played by A during the first $n$ plays. Then the regret of A after $n$ plays is defined by:
策略，或分配策略，A是一种算法，它根据过去的游戏顺序和获得的奖励来选择下一台机器。设$T_i (n)$为机器$i$在前$n$次博弈中被A选择的次数。那么A在n次比赛后的遗憾度定义为:

\[\mu^*n-\mu_j\sum_{j=1}^K\textit{E}[T_j(n)]\quad\mathrm{where~}\mu^*\overset{\mathrm{def}}{\operatorname*{=}}\max_{1\leq i\leq K}\mu_i \]

Thus the regret is the expected loss due to the fact that the policy does not always play the best machine. In their classical paper, Lai and Robbins (1985) found, for specific families of reward distributions (indexed by a single real parameter),
因此，遗憾是由于策略不总是选择最佳机器而导致的预期损失。在经典论文中，Lai和Robbins（1985）针对特殊的奖励分布（以单个实数参数为索引）做出了以下发现。
policies satisfying$$\mathbb{E}[T_j(n)]\leq\left(\frac1{D(p_j|p^*)}+o(1)\right)\ln n$$
where $o(1) → 0$ as $n→∞$ and

\[D(p_j\|p^*)\stackrel{\mathrm{def}}{=}\int p_j\ln\frac{p_j}{p^*} \]

is the Kullback-Leibler divergence between the reward density $p_j$ of any suboptimal machine j and the reward density $p∗$ of the machine with highest reward expectation $μ ∗$. Hence, under these policies the optimal machine is played exponentially more often than any other machine, at least asymptotically. Lai and Robbins also proved that this regret is the best possible.

因此，在这些策略下，至少在渐近情况下，最优机器的使用频率呈指数增长，超过其他任何机器。Lai和Robbins还证明了这种遗憾是最佳的。

笔者注：
KL散度（Kullback-Leibler divergence,也叫信息散度/相对熵）是用来度量两个概率分布相似度的指标，它作为经典损失函数被广泛地用于聚类分析与参数估计等机器学习任务中.从采样角度出发对KL散度进行直观解释，KL散度描述了我们用分布来估计数据的真实分布的编码损失。
$\begin{aligned} &\text{假设对随机变量}\xi\text{,存在两个概率分布}P,Q \text{。如果}\xi\text{为离散随机变量,定义从}P\text{到}Q\text{的KL散度为}. \\ &\begin{aligned}\mathbb{D}_{\mathrm{KL}}(P||Q)=\sum_iP(i)\ln(\frac{P(i)}{Q(i)})\end{aligned} \\ &\text{如果}\xi\text{为连续随机变量,则定义从}P\text{到}Q\text{的KL散度为}\mathrm{:} \\ &\mathbb{D}_{\mathrm{KL}}(P||Q)=\int_{-\infty}^{\infty}p(\mathbf{x})\ln(\frac{p(\mathbf{x})}{q(\mathbf{x})})d\mathbf{x} \end{aligned}$

Namely, for any allocation strategy and for any suboptimal machine j,$\mathbb{E}[T_{j}(n)]\geq(\ln n)/D(p_{j}\|p^{*})$asymptotically, provided that(前提是) the reward distributions satisfy some mild assumptions.

These policies work by associating a quantity called upper confidence index to each machine.
The computation of this index is generally hard. In fact, it relies on the entire sequence
of rewards obtained so far from a given machine. Once the index for each machine is computed,
the policy uses it as an estimate for the corresponding reward expectation, picking
for the next play the machine with the current highest index. More recently, Agrawal (1995)
introduced a family of policies where the index can be expressed as simple function of
the total reward obtained so far from the machine. These policies are thus much easier to
compute than Lai and Robbins’, yet their regret retains the optimal logarithmic behavior
(though with a larger leading constant in some cases).

这些策略通过为每台机器关联一个称为"上限置信度index"的量来工作。计算这个index通常很困难，事实上，它依赖于到目前为止从给定机器获得的全部奖励序列。一旦计算出每台机器的index，策略将其用作对应奖励期望的估计值，并选择具有当前最高index的机器进行下一次游戏。
近期，Agrawal（1995）引入了一系列策略，其中index可以表示为迄今为止从机器获得的总奖励的简单函数。相比起Lai和Robbins的策略，这些策略的计算要容易得多。而且，它们的regret值仍保持着最优的对数行为（尽管在某些情况下主导常数较大）。

In this paper we strengthen previous results by showing policies that achieve logarithmic regret uniformly over time, rather than only asymptotically. Our policies are also simple to implement and computationally efficient. In Theorem 1 we show that a simple variant of Agrawal’s index-based policy has finite-time regret logarithmically bounded for arbitrary sets of reward distributions with bounded support (a regret with better constants is proven in Theorem 2 for a more complicated version of this policy). A similar result is shown in Theorem 3 for a variant of the well-known randomized ε-greedy heuristic. Finally, in Theorem 4 we show another index-based policy with logarithmically bounded finite-time regret for the natural case when the reward distributions are normally distributed with unknown means and variances.

在这篇论文中，我们通过展示一种能够在整个时间范围内均匀地实现对数级后悔值（而非仅渐进地）的策略，加强了之前的结果。我们的策略实现简单且计算效率高。在定理1中，我们证明了Agrawal基于index的策略的一个简单变体，在具有有界support的任意奖励分布集合上具有有界对数的有限时间后悔（更好常数的后悔在定理2中被证明，对应于这个策略的更复杂版本）。定理3中展示了一个著名的随机化ε-greedy启发式算法的变体也具有类似的结果。最后，在定理4中，我们展示了另一种基于index的策略，在奖励分布为未知均值和方差的正态分布的自然情况下具有有界对数的有限时间后悔。

Main results

Our first result shows that there exists an allocation strategy, UCB1, achieving logarithmic
regret uniformly over n and without any preliminary knowledge about the reward distributions
(apart from the fact that their support is in [0, 1]). The policy UCB1 (sketched in
figure 1) is derived from the index-based policy of Agrawal (1995). The index of this policy
is the sum of two terms. The first term is simply the current average reward. The second term
is related to the size (according to Chernoff-Hoeffding bounds, see Fact 1) of the one-sided
confidence interval for the average reward within which the true expected reward falls with
overwhelming probability.
我们的第一个结果表明存在一种分配策略 UCB1，可以在 n 的整个范围内uniformly实现对数后悔，而无需任何关于奖励分布的预先知识（除了它们的support范围在[0, 1]之间）。UCB1 策略（在图1中简要概述）是从 Agrawal（1995）的基于index的策略导出的。该策略的index是两个项的和。第一个项是当前的平均奖励。第二个项与单侧置信区间的大小有关（根据 Chernoff-Hoeffding 不等式，详见事实一），该置信区间内真实的期望奖励极有可能落入其中。

笔者注：UCB即Upper Confidence Bound.
图一：

Theorem 1.
For all $K > 1$, if policy UCB1 is run on $K$ machines having arbitrary reward
distributions $P_1, . . . , P_K$ with support in [0, 1], then its expected regret after any number $n$ of plays is at most
$\left[8\sum_{i:\mu_i<\mu^*}\left(\frac{\ln n}{\Delta_i}\right)\right]+\left(1+\frac{\pi^2}3\right)\left(\sum_{j=1}^K\Delta_j\right)$
where $μ_1, . . . , μ_K$ are the expected values of $P_1, . . . , P_K$

To prove Theorem 1 we show that, for any suboptimal machine $j$ ,
$\mathbb{E}[T_j(n)]\leq\frac8{\Delta_j^2}\ln n\quad(2)$
plus a small constant. The leading constant $8/\Delta_{i}^{2}$ is worse than the corresponding constant $1/D(p_j\parallel p^*)$ in Lai and Robbins’ result (1). In fact, one can show that $D(p_{j}\parallel p^{*})\geq2\Delta_{j}^{2}$ where the constant 2 is the best possible.
Using a slightly more complicated policy, which we call UCB2 (see figure 2), we can bring the main constant of (2) arbitrarily close to $1/(2\Delta_j^2)$.The policy UCB2 works as follows.
The plays are divided in epochs. In each new epoch a machine $i$ is picked and then played $τ(r_{i + 1}) − τ(r_i )$ times where $τ$ is an exponential function and $r_i$ is the number of
epochs played by that machine so far. The machine picked in each new epoch is the one
maximizing $\bar{x_i}+a_{n,r_i}$,where $n$ is the current number of plays, $\overline{x_i}$ is the current average reward for machine $i$ , and $a_{n,r}=\sqrt{\frac{(1+\alpha)\ln(en/\tau(r))}{2\tau(r)}}$
where
$\tau(r)=\lceil(1+\alpha)^r\rceil $
In the next result we state a bound on the regret of UCB2. The constant $c_α$, here left unspecified,
is defined in (18) in the appendix, where the theorem is also proven.

笔者注：$\Delta_{i}=\mu^{*}-\mu_{i}$

图二：

Theorem 2.
For all $K > 1$, if policy UCB2 is run with input $0 < α < 1$ on $K$ machines
having arbitrary reward distributions $P_1, . . . , P_K$ with support in [0, 1], then its expected
regret after any number
$n\geq\max_{i:\mu_i<\mu^*}\frac1{2\Delta_i^2}$
of plays is at most
$\sum_{i:\mu_i<\mu^*}\left(\frac{(1+\alpha)(1+4\alpha)\ln\left(2e\Delta_i^2n\right)}{2\Delta_i}+\frac{c_\alpha}{\Delta_i}\right)\quad (4)$
where $μ_1, . . . , μ_K$ are the expected values of $P_1, . . . , P_K$ .
Remark. By choosing α small, the constant of the leading term in the sum (4) gets arbitrarily
close to $\frac1{2\Delta_i^2}$; however, $c_α →∞$as $α → 0$. The two terms in the sum can be
traded-off by letting $α = α_n$ be slowly decreasing with the number n of plays.

A simple and well-known policy for the bandit problem is the so-called $ε$-greedy rule
(see Sutton,&Barto, 1998). This policy prescribes to play with probability $1−ε$ the machine
with the highest average reward, and with probability $ε$ a randomly chosen machine. Clearly,
the constant exploration probability ε causes a linear (rather than logarithmic) growth in
the regret. The obvious fix is to let ε go to zero with a certain rate, so that the exploration
probability decreases as our estimates for the reward expectations become more accurate.
It turns out that a rate of $1/n$, where $n$ is, as usual, the index of the current play, allows
to prove a logarithmic bound on the regret. The resulting policy, $ε_n$-GREEDY, is shown in
figure 3.

笔者注：经典的$\epsilon$方法中$\epsilon$是固定的，这导致遗憾线性增加，我们希望让$\epsilon$随着试验次数增多而减小，即越来越多的利用。

Theorem 3.
For all $K > 1$ and for all reward distributions $P_1, . . . , P_K$ with support in
[0, 1], if policy $ε_n$-GREEDY is run with input parameter
$0<d\leq\min_{i:\mu_i<\mu^*}\Delta_i,$
then the probability that after any number $n ≥cK/d$ of plays $ε_n$-GREEDY chooses a suboptimal
machine $j$ is at most
$\begin{aligned} \frac{c}{d^2n}&+2\bigg(\frac c{d^2}\ln\frac{(n-1)d^2e^{1/2}}{cK}\bigg)\bigg(\frac{cK}{(n-1)d^2e^{1/2}}\bigg)^{c/(5d^2)} \\ &+\frac{4e}{d^2}\biggl(\frac{cK}{(n-1)d^2e^{1/2}}\biggr)^{c/2}. \end{aligned}$

Remark. For $c$ large enough (e.g. $c > 5$) the above bound is of order $c/(d^2n)+o(1/n)$ for
$n→∞$, as the second and third terms in the bound are $O(1/n^{1+ε})$ for some $ε > 0$ (recall
that $0 < d < 1$). Note also that this is a result stronger than those of Theorems 1 and 2, as
it establishes a bound on the instantaneous regret. However, unlike Theorems 1 and 2, here
we need to know a lower bound $d$ on the difference between the reward expectations of the
best and the second best machine.
注意，它比定理1和定理2的优势在于它建立了a bound on the instantaneous regret。然而，与定理1和定理2不同的是，我们需要知道最佳机器和次佳机器之间奖励期望差异的下界。
笔者注：换言之，不再是粗略的“对足够大的n”，所以说instantaneous。但对d是有要求的。

图三：

Our last result concerns a special case, i.e. the bandit problem with normally distributed
rewards. Surprisingly, we could not find in the literature regret bounds (not even asymptotical)
for the case when both the mean and the variance of the reward distributions are
unknown. Here, we show that an index-based policy called UCB1-NORMAL, see figure 4,
achieves logarithmic regret uniformly over $n$ without knowing means and variances of the
reward distributions. However, our proof is based on certain bounds on the tails of the $χ2$
and the Student distribution that we could only verify numerically. These bounds are stated
as Conjecture 1 and Conjecture 2 in the Appendix.
翻译：我们的最后一个结果涉及一个特殊情况，即正态分布奖励的赌博机问题。令人惊讶的是，在文献中我们无法找到奖励分布的均值和方差都未知的情况下的后悔界限（即使是渐近界限）。在这里，我们展示了一种称为UCB1-NORMAL的基于index的策略，见图4，它能够在不知道奖励分布的均值和方差的情况下，在n次游戏中实现对数级别的后悔。然而，我们的证明是基于对χ²分布和学生t分布尾部的某些界限，这些界限只能通过数值验证。这些界限被陈述为附录中的猜想1和猜想2。

笔者注：学生t分布用于根据小样本来估计呈正态分布且方差未知的总体的均值。对于大样本的抽样分布，由中心极限定理，无论总体分布是否为正态分布，其均值$x_bar$的抽样分布为近似正态分布，同时对于较大的$n(n>=30)$，$s$将会是$σ$的优良估计。对于小样本来说，如果总体分布为（近似）正态分布，则样本均值也符合（近似）正态分布，但是小样本的的方差不是总体方差σ的优良估计，这时需要用到t分布来刻画总体的方差。

图四：

The choice of the index in UCB1-NORMAL is based, as for UCB1, on the size of the onesided
confidence interval for the average reward within which the true expected reward falls
with overwhelming probability. In the case of UCB1, the reward distribution was unknown,
and we used Chernoff-Hoeffding bounds to compute the index. In this case we know that the distribution is normal, and for computing the index we use the sample variance as an
estimate of the unknown variance.

在UCB1-NORMAL中，index的选择与UCB1类似，基于单侧置信区间的大小，该区间内真实期望奖励以极高的概率落在其中。对于UCB1，奖励分布是未知的，我们使用Chernoff-Hoeffding不等式来计算index。而在这种情况下，我们知道分布是正态的，为了计算指标，我们使用样本方差作为未知方差的估计值。

Theorem 4.
For all $K > 1$, if policy UCB1-NORMAL is run on K machines having normal
reward distributions $P_1, . . . , P_K$ , then its expected regret after any number $n$ of plays is at
most
$256(\log n)\left(\sum_{i:\mu_i<\mu^*}\frac{\sigma_i^2}{\Delta_i}\right)+\left(1+\frac{\pi^2}2+8\left.\log n\right)\left(\sum_{j=1}^K\Delta_j\right)\right.$

where $μ_1, . . . , μ_K$ and $σ^2_ 1 , . . . , σ^2_K$
are the means and variances of the distributions
$P_1, . . . , P_K$.

As a final remark for this section, note that Theorems 1–3 also hold for rewards that are not
independent across machines, i.e. $X_{i,s}$ and $X _{j,t}$ might be dependent for any $s, t$, and $i \neq j$ .
Furthermore, we also do not need that the rewards of a single arm are i.i.d., but only the
weaker assumption that $\mathbb{E} [Xi,t | Xi,1, . . . , Xi,t−1] = μ_i$ for all $1 ≤ t ≤ n.$

Proofs

Recall that, for each $1 ≤ i ≤ K$, $E[X_{i,n}] = μ_i$ for all $n ≥ 1$ and $μ ∗ = max_{1≤i≤K} μ_i$ . Also,
for any fixed policy A, $T_i (n)$ is the number of times machine $i$ has been played by A in the
first $n$ plays. Of course, we always have $\sum_{i=1}^{K}T_i(n)=n$
.We also define the r.v.’s $I_1, I_2$, . . .,
where It denotes the machine played at time t.

【r.v.是随机变量的缩写】

For each $1 ≤ i ≤ K$ and $n ≥ 1$ define
$\bar{X}_{i,n}=\frac1n\sum_{t=1}^nX_{i,t}.$

Given $μ_1, . . . , μ_K$ , we call optimal the machine with the least index $i$ such that $μ_i = μ ∗$.
In what follows, we will always put a superscript “ ∗ ” to any quantity which refers to the
optimal machine. For example we write $T ^∗ (n)$ and $\overline{X}^*_n$ instead of $T_i (n)$ and $\overline{X_{i,n}}$, where $i$ is
the index of the optimal machine.

Some further notation: For any predicate $\Pi$ we define $\{\Pi(x)\}$ to be the indicator fuction of the event $\Pi(x)$; i.e., $\{\Pi(x)\} = 1 \quad if\quad \Pi(x)$ is true and $\{\Pi(x)\} = 0$ otherwise. Finally,
$Var[X]$ denotes the variance of the random variable $X$.
Note that the regret after n plays can be written as
$\sum_{j:\mu_j<\mu^*}\Delta_j\boldsymbol{E}[T_j(n)]\quad\quad\quad\quad\quad\quad(5)$

So we can bound the regret by simply bounding each$\boldsymbol{E}[T_j(n)]$

We will make use of the following standard exponential inequalities for bounded random variables (see, e.g., the appendix of Pollard, 1984).

Fact 1(Chernoff-Hoeffding bound)

Let $X_1, . . . , X_n$ be random variables with common
range [0, 1] and such that $E[X_t |X_1, . . . , X_{t−1}] = μ$. Let $S_n = X_1 + ·· ·+ X_n$. Then for
all $a ≥ 0$
$P\{S_n\geq n\mu+a\}\leq e^{-2a^2/n}\quad\mathrm{~and~}\quad P\{S_n\leq n\mu-a\}\leq e^{-2a^2/n}$
Fact 2 (Bernstein inequality)
Let $X_1, . . . , X_n$ be random variables with range [0, 1] and
$\begin{aligned} &\sum_{t=1}^{n}\mathrm{Var}[X_{t}\mid X_{t-1},\ldots,X_{1}]=\sigma^{2}. \\ \text{Le} tS_{n}=X_{1}+\cdots+X_{n}.Thenforalla\geq0 \\ &P\{S_{n}\geq E[S_{n}]+a\}\leq\exp\biggl\{-\frac{a^{2}/2}{\sigma^{2}+a/2}\biggr\}. \end{aligned}$

Proof of Theorem 1: Let $c_{t,s} =\sqrt{(2 ln t)/s.}$
For any machine $i$ , we upper bound $T_i (n)$
on any sequence of plays. More precisely, for each $t ≥ 1$ we bound the indicator function
of $I_t = i$ as follows. Let $\ell$ be an arbitrary positive integer.

\[\begin{aligned} T_{i}(n)& =1+\sum_{t=K+1}^n\{I_t=i\} \\ &\leq\ell+\sum_{t=K+1}^n\{I_t=i,T_i(t-1)\geq\ell\} \\ &\leq\ell+\sum_{t=K+1}^n\left\{\bar{X}_{T^*(t-1)}^*+c_{t-1,T^*(t-1)}\right.\leq\bar{X}_{i,T_i(t-1)} \\ &\left.+c_{t-1,T_{i}(t-1)},T_{i}(t-1)\geq\ell\right\} \\ &\leq\ell+\sum_{t=K+1}^n\left\{\min_{0<s<t}\bar{X}_s^*+c_{t-1,s}\leq\max_{\ell\leq s_i<t}\bar{X}_{i,s_i}+c_{t-1,s_i}\right\} \\ &\leq\ell+\sum_{t=1}^\infty\sum_{s=1}^{t-1}\sum_{s_i=\ell}^{t-1}\{\bar{X}_s^*+c_{t,s}\leq\bar{X}_{i,s_i}+c_{t,s_i}\}. \end{aligned}\]

Now observe that $\bar{X}_s^*+c_{t,s}\leq\bar{X}_{i,s_i}+c_{t,s_i}$ implies that at least one of the following must
hold $\begin{gathered} \bar{X}_{s}^{*} \leq\mu^{*}-c_{t,s} (\mathbb{7}) \\ \bar{X}_{i,s_{i}} \geq\mu_i+c_{t,s_i} \left(8\right) \\ \mu^{*} <\mu_{i}+2c_{t,s_{i}}. \text{(9)} \end{gathered}$

We bound the probability of events (7) and (8) using Fact 1 (Chernoff-Hoeffding bound)

$P\{\bar{X}_s^*\leq\mu^*-c_{t,s}\}\leq e^{-4\ln t}=t^{-4}$
$P\left\{\bar{X}_{i,s_i}\geq\mu_i+c_{t,s_i}\right\}\leq e^{-4\ln t}=t^{-4}.$

For $\ell = \lceil(8\ln n)/\Delta_i^2\rceil$.(9) is false. In fact $\mu^*-\mu_i-2c_{t,s_i}=\mu^*-\mu_i-2\sqrt{2(\ln t)/s_i}\geq\mu^*-\mu_i-\Delta_i=0$

4. Experiments

5. Conclusions

We have shown simple and efficient policies for the bandit problem that, on any set of reward
distributions with known bounded support, exhibit uniform logarithmic regret. Our policies
are deterministic and based on upper confidence bounds, with the exception of εn-GREEDY,
a randomized allocation rule that is a dynamic variant of the ε-greedy heuristic. Moreover,
our policies are robust with respect to the introduction of moderate dependencies in the
reward processes.

我们已经展示了解决赌博机问题的简单有效策略，对于任何已知边界支持的奖励分布集合，这些策略都表现出统一的对数遗憾值。我们的策略是确定性的，并基于上置信区间（UCB）算法，除了εn-GREEDY外，这是一种随机分配规则，是ε-greedy启发式的动态变体。此外，我们的策略对奖励过程中适度依赖的引入具有鲁棒性。

This work can be extended in many ways. A more general version of the bandit problem
is obtained by removing the stationarity assumption on reward expectations (see Berry &
Fristedt, 1985; Gittins, 1989 for extensions of the basic bandit problem). For example,
suppose that a stochastic reward process {Xi,s : s = 1, 2, . . .} is associated to each machine
i = 1, . . . , K. Here, playing machine i at time t yields a reward Xi,s and causes the current
state s of i to change to s +1, whereas the states of other machines remain frozen. A wellstudied
problem in this setup is the maximization of the total expected reward in a sequence
of n plays. There are methods, like the Gittins allocation indices, that allow to find the
optimal machine to play at each time n by considering each reward process independently
from the others (even though the globally optimal solution depends on all the processes).
However, computation of the Gittins indices for the average (undiscounted) reward criterion
used here requires preliminary knowledge about the reward processes (see, e.g., Ishikida &
Varaiya, 1994). To overcome this requirement, one can learn the Gittins indices, as proposed
in Duff (1995) for the case of finite-state Markovian reward processes. However, there are no
finite-time regret bounds shown for this solution. At the moment, we do not know whether
our techniques could be extended to these more general bandit problems.
这项工作可以在许多方面进行扩展。通过去除对奖励期望的固定性假设（参见Berry＆Fristedt，1985; Gittins，1989），可以得到赌博机问题的更一般版本。例如，假设每个机器i = 1，...，K都与一个随机奖励过程$\{X_{i，s}：s = 1，2，...\}$相关联。在时间t玩机器i会产生奖励Xi，s并导致i的当前状态s变为s + 1，而其他机器的状态保持不变。在这种设置下，一个经过充分研究的问题是在一系列n次游戏中最大化总期望奖励。有一些方法，比如Gittins分配指数，可以通过独立地考虑每个奖励过程来找到每个时间n应该玩哪台机器以获得最优解（尽管全局最优解取决于所有过程）。但是，用于本文中使用的平均（未折现）奖励标准的Gittins指数计算需要先了解奖励过程的基本信息（例如Ishikida＆Varaiya，1994）。为了克服这个要求，可以像Duff（1995）提出的针对有限状态马尔可夫奖励过程的情况一样学习Gittins指数。然而，并没有展示有关此解决方案的有限时间遗憾界。目前，我们不知道我们的技术是否可以扩展到这些更一般的赌博机问题。

笔者注：大概明白了，想去掉每个老hu机臂的奖励固定分布假设，出现了$X_{i,s}$，那么应该怎么样最大化总期望奖励？如果对奖励分布随时间的变化毫无信息那当然是没办法的，这里就要用到Gittins指数。
吉廷斯指数是一种可以通过具有特定特征的特定随机过程实现的奖励衡量标准。换句话说，该过程有一个最终的结束状态，并通过在每个中间状态选择结束选项来演化。在退出特定状态时，获得的奖励是与从实际最终状态到最终最终状态的所有状态相关联的概率预期奖励的总和。该索引是一个实标量。

Consensus Seeking in Multiagent Systems Under Dynamically Changing Interaction Topologies

在动态变化的交互拓扑结构下的多Agent系统中如何达成共识。

Abstract

This note considers the problem of information consensus among multiple agents in the presence of limited and unreliable information exchange with dynamically changing interaction topologies. Both discrete and continuous update schemes are proposed for information
consensus. This note shows that information consensus under dynamically changing interaction topologies can be achieved asymptotically if the union of the directed interaction graphs have a spanning tree frequently enough as the system evolves.

本文考虑的问题是在存在有限和不可靠的信息交换的动态变化交互拓扑中，多个Agent之间的信息一致性。针对信息一致性，提出了离散和连续更新方案。本文表明，如果系统演化过程中的有向交互图的并集足够频繁地形成一棵生成树，则可以渐进地实现动态变化交互拓扑下的信息一致性。

Keywords

Cooperative control, graph theory, information consensus,
multiagent systems, switched systems

协作控制，图论，信息一致性，多智能体系统，switch系统【请参考知乎文章https://zhuanlan.zhihu.com/p/145805774】

introduction

The study of information flow and interaction among multiple agents
in a group plays an important role in understanding the coordinated
movements of these agents. As a result, a critical problem for coordinated
control is to design appropriate protocols and algorithms such
that the group of agents can reach consensus on the shared information
in the presence of limited and unreliable information exchange
and dynamically changing interaction topologies. Consensus problems
have recently been addressed in [1]–[7], to name a few. In this note,
we extend the results of [2] to the case of directed graphs and present
conditions for consensus of information under dynamically changing
interaction topologies.
In contrast to [2], directed graphs will be used to represent the interaction
(information exchange) topology between agents, where information
can be exchanged via communication or direct sensing. A
preliminary result for information consensus is presented in [8], where
a linear update scheme is proposed for directed graphs. However, the
analysis in [8] was not able to utilize all available communication links.
A solution to this issue was presented in [4] for time-invariant communication
topologies. Information consensus for dynamically evolving
information was addressed in [9] in the context of spacecraft formation
flying where the exchanged information is the configuration of the virtual
structure associated with the (dynamically evolving) formation.
In many applications, the interaction topology between agents may
change dynamically. For example, communication links between
agents may be unreliable due to disturbances and/or subject to communication
range limitations. If information is being exchanged by
direct sensing, the locally visible neighbors of a vehicle will likely
change over time. In [2], a theoretical explanation is provided for
the observed behavior of the Vicsek model [10]. Possible changes
over time in each agent’s nearest neighbors is explicitly taken into
account, and is an example of information consensus under dynamically
changing interaction topologies. Furthermore, it is shown in [2]
that consensus can be achieved if the union of the interaction graphs
for the team are connected frequently enough as the system evolves.

研究多个agent之间的信息流和互动在理解这些agent的协调移动中发挥着重要作用。因此，协调控制的关键问题是设计适当的协议和算法，使agent群能够在有限且不可靠的信息交换和动态变化的互动拓扑的情况下达成共识。近年来，一些研究已经解决了共识问题，包括[1] - [7]等。在本文中，我们将[2]的结果扩展到有向图的情况，并给出了在动态变化的互动拓扑下实现信息共识的条件。

与[2]不同的是，本文中使用有向图表示agent之间的相互作用（信息交换）拓扑，其中信息可以通过通信或直接感应进行交换。[8]中提出了针对有向图的线性更新方案，给出了信息共识的初步结果。然而，在[8]中的分析无法利用所有可用的通信链路，[4]则提供了一个针对时间不变通信拓扑的解决方案。[9]针对动态演化信息的情况在太空飞行器编队控制中讨论了信息共识，其中交换的信息是与（动态演化的）编队相关联的虚拟结构的配置。

在许多应用中，agent之间的交互拓扑可能会动态变化。例如，由于干扰和/或受到通信范围限制，agent之间的通信链路可能不可靠。如果通过直接感知交换信息，则vehicle的本地可见邻居可能随时间而变化。在[2]中，为Vicsek模型[10]的观察行为提供了理论解释。显式考虑了每个agent的最近邻居在时间上的可能变化，是动态变化互动拓扑下信息共识的一个例子。此外，在[2]中还表明，如果团队的互动图的联合体经常连接，则可以实现共识。

However, the approach in [2] is based on bidirectional information
exchange, modeled by undirected graphs. Extensions of this work to
second-order dynamics are discussed in [16] and [17]
There are a variety of practical applications where information only
flows in one direction. For example, in leader-following scenarios, the
leader may be the only vehicle equipped with a communication transmitter.
For heterogeneous teams, some vehicles may have transceivers,
while other less capable members only have receivers. There is a need
to extend the results reported in [2] to interaction topologies with directional
information exchange.
In addition, in [2] certain constraints are imposed on the weighting
factors in the information update schemes, which may be extended to
more general cases. For example, it may be desirable to weigh the information
from different agents differently to represent the relative confidence
of each agent’s information or relative reliability of different
communication or sensing links.
The objective of this note is to extend [2] to the case of directed
graphs and explore the minimum requirements to reach consensus by
using graph theory and matrix theory. As a comparison, [5] applies
a set-valued Lyapunov approach to consider discrete-time consensus
algorithms with unidirectional time-dependent communication links.
In addition, [3] solves the average-consensus problem with directed
graphs, which requires the graph to be strongly connected and balanced.
We show that under certain assumptions consensus1 can be
achieved asymptotically under dynamically changing interaction
topologies if the union of the collection of interaction graphs across
some time intervals has a spanning tree frequently enough. The
spanning tree requirement is a milder condition than connectedness
and is therefore more suitable for practical applications.We also allow
the relative weighting factors to be time-varying, which provides additional
flexibility. As a result, the convergence conditions and update
schemes in [2] are shown to be a special case of a more general result.
An additional contribution of this note is that we explicitly show
that a nonnegative matrix with the same positive row sums has its
spectral radius (its row sum in this case) as a simple eigenvalue if
and only if the directed graph of this matrix has a spanning tree. In
contrast, the Perron–Frobenius Theorem [11] for nonnegative matrices
only deals with irreducible matrices, that is, matrices with strongly connected
graphs. Besides having a spanning tree, if this matrix also has
positive diagonal entries, we show that its row sum is the unique eigenvalue
of maximum modulus.
The note is organized as follows. In Section II, we establish the notation
and formally state the problem. Section III contains the main
results, and Section IV offers our concluding remarks.

然而，[2]中的方法是基于双向信息交换，通过无向图建模。该工作的扩展到二阶动力学讨论在[16]和[17]中提到。
在许多实际应用中，信息只向一个方向流动。例如，在领导者-跟随者场景中，领导者可能是唯一一个配备有通信发射机的vehicle。对于异构团队，一些vehicle可能具有收发器，而其他功能不足的成员只有接收器。需要将[2]中报告的结果扩展到具有定向信息交换的交互拓扑。
此外，在[2]中对信息更新方案的加权因子施加了某些约束条件，这些条件可以扩展到更一般的情况。例如，可以希望对不同代理的信息进行不同的加权，以表示每个代理的信息的相对置信度或不同通信或感知链接的相对可靠性。
本文的目标是将[2]扩展到定向图的情况，并使用图论和矩阵理论探索通过达成共识所需的最小要求。作为比较，[5]应用了一种集合值李雅普诺夫方法来考虑具有单向时变通信链接的离散时间共识算法。此外，[3]解决了具有定向图的平均共识问题，要求图形是强连通且平衡的。
我们展示了在一定假设下，如果一段时间间隔内交互图的集合的并集频繁地具有生成树，则可以在动态变化的交互拓扑下达到渐近共识。生成树的要求是一个较为温和的条件，比连通性更适用于实际应用。我们还允许相对权重因子是时变的，这提供了额外的灵活性。因此，[2]中的收敛条件和更新方案被证明是更一般结果的特例。
本文的另一个贡献是我们明确地证明，具有相同正行和的非负矩阵其谱半径（在这种情况下就是行和）作为简单特征值的充要条件是该矩阵的定向图具有生成树。相比之下，针对非负矩阵的Peron-Frobenius定理[11]仅处理不可约矩阵，即具有强连通图的矩阵。除了具有生成树外，如果该矩阵还具有正对角元素，则我们证明其行和是最大模的唯一特征值。
本文的组织结构如下：第二节中，我们建立符号并正式陈述问题。第三节包含主要结果，第四节提供我们的结论性意见。

笔者注：看到这里人已经懵了。。。
参考https://zhuanlan.zhihu.com/p/503103566
明白了实际道路自动驾驶就涉及到多智能体问题中的强化学习问题。
一般有三种主要的学习算法结构。
第一种是不管其他人，我就在驾校中单独训练，并把其他智能体看成环境的一部分。这种学习模式叫做独立式学习(independent learning). 这种模式的好处是简单快捷，即把单智能体的学习方法照搬到每一个单独的智能体即可。但是缺点也很明显，在同一个环境中，你在“补习”的同时，别人也在“补习”，从而打破了环境的稳态性(stationarity)，结果就是谁都没学好。这种强化学习方法在相对离散动作的小规模多智能体问题中具有一定效果，但是在高维度状态-动作空间的复杂问题中，表现差强人意。
第二种学习模式就是集中式（centralized）学习，即把所有智能体的状态和动作集中到一起，形成一个增广 (augmented) 的状态-动作空间，然后使用单智能体算法直接学习。这种学习方法的问题在于一旦智能体数量过于庞大，这个增广空间的大小就会以指数级增长，以至于智能体无法进行充分的空间探索。与此同时，学的时候也很累，庞大的状态-动作空间需要庞大的神经网络，训练起来费时费力费电。.
除了上述两种，还有一种学习算法结构叫做集中式训练-分布式执行 (centralized training decentralized execution). 意思就是训练期间所有的智能体能看到全局信息，就是你也知道别人怎么开车；执行的时候每个智能体智仅依靠局部的状态信息做决策。这种算法结构虽然在训练的时候比较费力，但是可以实际部署应用，因为每个智能体仅依赖局部信息作决策，而不需要复杂的通讯网络和所有其他智能体保持联络。
多智能体强化学习也存在着诸多挑战。

环境的非稳态性 (non-stationarity)。你学我学他也学，你卷我卷他更卷，使得整体的评价机制/回报函数的准确性降低，原本学会的良好的策略会随着学习过程的推进不断变差，最终造成你学了的不再有用，他卷的也白卷
非完整观测 (Partial observability), 在大部分的智能体系统中，每个智能体在执行过程中无法获得完整的全局信息，而只能根据观测到的局部信息来做一个最优决策，司机的视野是有个范围的。这就是局部可观测的马尔科夫过程 (partially observable Markov decision process)。其难点在于整个过程的马尔科夫性不再完整，使得环境表现出非马尔科夫性(non-Markovian)。
学习交流方式 (learn communication)。要合作完成某项任务的时候，智能体间可以通过通讯来交换观测信息，策略参数等，比如夜晚双方会车的时候需要暂时关闭远光灯“以示友好”，或者超车的时候闪几下远光灯提醒前车注意，这种属于指明通讯内容的学习方法
算法的稳定性和收敛性 (convergence) 挑战

啥是马尔可夫过程？马尔可夫过程是一种数学模型，用于描述具有马尔可夫性质的随机事件序列的演化过程。

以离散随机过程为例:
设随机变量$X_0,X_1,\cdots,X_T$构成一个随机过程。这些随机变量的所有可能取值的集合被称为状态空间（state space）。如果$X_{t+1}$ 对于过去状态的条件概率分布仅是 $X_t$的函数，那么$p\left(X_{t+1}=x_{t+1}\mid X_{0:t}=x_{0:t}\right)=p\left(X_{t+1}=x_{t+1}\mid X_t=x_t\right)$
其中$X_{0:t}$表示变量集合$X_0,X_1,...X_t$，$x_{0:t}$为在状态空间中的状态序列$x_0,x_1,...x_t$。

马尔可夫过程具有以下两个关键特点：
马尔可夫性质（Markov Property）：在给定当前状态下，未来状态的概率只依赖于当前状态，而与历史状态无关。换句话说，未来的发展只与当前状态有关，与过去的发展路径无关。
离散时间或连续时间：马尔可夫过程可以基于离散时间或连续时间进行建模。在离散时间情况下，在每个时间步骤，系统从一个状态转移到另一个状态；而在连续时间情况下，状态的变化是连续的，可以通过随机微分方程来描述。
马尔可夫过程通常由状态空间和状态转移概率组成。状态空间是所有可能状态的集合，而状态转移概率表示从一个状态到另一个状态的转移概率。这些概率可以通过转移矩阵或转移函数来表示。
马尔可夫过程可以用于模拟随机系统的演化，预测未来状态的概率分布，以及进行决策和优化问题的求解。

有关马尔可夫过程、马尔可夫奖励过程【MRP】及贝尔曼方程、马尔科夫决策过程等概念的具体细节可见博客
https://cloud.tencent.com/developer/article/2338235

PROBLEM STATEMENT

Let $\mathcal{A}=\{A_{i}\quad|\quad i\quad\in\quad\mathcal{I}\}$ be a set of $n$ agents, where $\mathcal{I}~=~\{1,2,\ldots,n\}$. A directed graph G will be used to model the interaction topology among these agents. In G, the $i$ th node represents the $i$ th agent $A_i$ and a directed edge from $A_i$ to $A_j$ denoted as ($A_i$; $A_j$ ) represents a unidirectional information exchange link from $A_i$ to $A_j$ , that is, agent j can receive or obtain information from agent $i, (i; j) \in I$
If there is a directed edge from $A_i$ to $A_j$ ,
$A_i$ is defined as the parent node and $A_j$ is defined as the child node.
The interaction topology may be dynamically changing, therefore let $\bar{\mathcal{G}}=\{\mathcal{G}_{1},\mathcal{G}_{2},\ldots,\mathcal{G}_{M}\}$ denote the set of all possible directed interaction
graphs defined for A.In applications, the possible interaction
topologies will likely be a subset of $\bar{\mathcal{G}}$.Obviously, $\bar{\mathcal{G}}$ has finite elements.
The union of a group of directed graphs
$\{\mathcal{G}_{i_1},\mathcal{G}_{i_2},\ldots,\mathcal{G}_{i_m}\}\subset\bar{\mathcal{G}}$
is a directed graph with nodes given by $A_i$, $i \in I$ and edge set given by
the union of the edge sets of $\begin{aligned}\mathcal{G}_{i_j},j=1,\ldots,m.\end{aligned}$

A directed path in graph $\mathcal{G}$ is a sequence of edges $(A_{i_{1}},A_{i_{2}}),(A_{i_{2}},A_{i_{3}}),(A_{i_{3}},A_{i_{4}}),\ldots $ in that graph. Graph $\mathcal{G}$ is called strongly connected if there is a directed path from $A_i$
to $A_j$ and $A_j$ to $A_i$ between any pair of distinct nodes $A_i$ and $A_j$

A directed tree is a directed graph, where every node,
except the root, has exactly one parent. A spanning tree of a directed
graph is a directed tree formed by graph edges that connect all the
nodes of the graph

We say that a graph has (or contains)
a spanning tree if a subset of the edges forms a spanning tree. Let $M_{n}(\mathbb{R})$ represent the set of all $n \times n$ real matrices.Given a matrix $A=[a_{ij}]\in M_{n}(\mathbb{R})$, the directed graph of $A$, denoted by $\Gamma(A)$, is the directed graph on $n$ nodes $V_i,i\in \mathcal{I}$ such that there is a directed
edge in $\Gamma(A)$ from $V_j$ to $V_i$ if and only if $a_{ij} \neq 0$

笔者注：邻接矩阵？是叫这个不。。

Let $\xi_{i}\in\mathbb{R},i\in\mathcal{I}$ , represent the $i$ th information state associated
with the $i$ th agent. The set of agents $\mathcal{A}$ is said to achieve consensus
asymptotically if for any $\xi_i(0),i\in\mathcal{I},\|\xi_i(t)-\xi_j(t)\|\to0\mathrm{~as~}t\to\infty$ for each $(i,j) \in \mathcal{I}$.

consensus asymptotically 一致渐进性

Given $T$ as the sampling period, we propose the following discretetime
consensus scheme:

CONSENSUS OF INFORMATION UNDER DYNAMICALLY CHANGING INTERACTION TOPOLOGIES

CONCLUSION

This note has considered the problem of information consensus
under dynamically changing interaction topologies and weighting
factors. We have used directed graphs to represent information exchanges
among multiple agents, taking into account the general case
of unidirectional information exchange. We also proposed discrete
and continuous update schemes for information consensus and gave
conditions for asymptotic consensus under dynamically changing interaction
topologies and weighting factors using these update schemes.
The reader is referred to [15] for simulation examples that illustrate
the results presented in this note.

这篇文章考虑了在动态变化的交互拓扑和加权因子下的信息共识问题。我们使用有向图来表示多个agent之间的信息交换，考虑了信息交换是单向的一般情况。我们还提出了离散和连续的更新方案，用于实现信息共识，并给出了使用这些更新方案在动态变化的交互拓扑和加权因子下达到渐近共识的条件。读者可以参考[15]中的仿真示例，以了解本文中所提出的结果。

多无人机协同编队飞行控制研究现状及发展

本文作者：宗群，王丹丹，邵士凯，张博渊，（天津大学电气与自动化工程学院）；韩宇（天津大学电子信息工程学院）。
一、任务分配

二、航迹规划

三、编队控制

队型设计
编队飞行控制方法
最常用的是leader-follower法。还有基于行为法、虚拟结构法、图论法和基于一致性方法。

leader-follower法：leader跟踪一个预先给定的轨迹，follower和leader轨迹保持一定构型，并速度达到一致。leader可以看成是目标追踪的对象，或是整个多智能体的共同利益。
宾西法尼亚大学的Desai团队。

虚拟结构法：
将编队作为一个虚拟刚体，在编队中设定一个虚拟长机或虚拟几何中心，队中所有无人机都参照虚拟长机或虚拟几何中心运动。

图论法：
图论法利用拓扑图上的顶点来描述单个无人机，两点之间的边用来表述无人机间的关联/约束拓扑关系，例如感知、通信或控制连接等，将控制理论引入图中，可以构建编队控制策略。
刚性图论在编队中应用取得了比较大的进展。一般来讲，刚性图处理的对象是无向图，即无人机之间的联系是双向的。在很多实际情况中，为了简洁通信量，多无人机系统常常利用有向图表示。2007年，比利时鲁汶大学Hendrickx等等提出了有向刚性的概念，给出了有向刚性的定义并给出了生成有向刚性图的策略。该团队在2D刚性图的基础上延伸到3D甚至更高维数，给出高维空间上的刚性图与有向图持久性的充分必要条件。2013年，澳大利亚莫纳什大学Barca等引入图论到多机器人编队中，完善领航者-追随者的多机器人控制机制，使多机器人形成二维编队而不需要彼此通信。2014年，美国Zhang等利用一阶积分模型和刚性图理论，设计了指数稳定的编队控制器，使得编队中多个个体间形成期望队形。2016年，美国路易斯安那州立大学Ramazani等针对不同平面运动个体间的协同控制问题，利用刚性图论分别对单积分和双积分模型进行了协同仿真实验。燕山大学罗小元等针对多智能体最优刚性编队问题，设计了最优持久编队自动生成算法，生成了最优持久编队。
利用刚性图可以表示任意队形，且图论有成熟的理论作为研究基础，但是仿真研究实现较难。

一致性法：
一致性是指智能体利用与之通信的邻居智能体的状态信息更新自身的状态，并最终使所有智能体状态达到一致. 采用一致性理论进行多无人机编队控制研究，基于分布式网络实现无人机之间信息感知与交互，可以实时应对突发情况，提高编队安全性。
一致性概念最早出现在计算机分布式计算网络化的动态系统中。2004年，美国加利福尼亚大学Saber等得到多智能体系统一致收敛的充要条件是拓扑图是连通的。2005年，任伟等证明了有向网络拓扑只要存在生成树结构，则所智能体可实现全局一致收敛。任伟等和Jadbabaie等研究得到动态网络拓扑下多智能体实现一致性收敛的充要条件：在任意时间段的网络拓扑都有一个生成树结构，则多智能体可实现一致性收敛. 任伟等进而将一致性控制引入到多智能体编队控制系统中。2009年，韩国首尔大学Seo等针对多无人机时变编队问题，采用基于一致性反馈线性化方法，保证了多无人机按照给定时变队形进行编队飞行。2011年，德克萨斯大学Jamshidi等等针对异构多智能体协同编队控制问题，利用全球定位系统对无人机进行航路点导航，对每架无人机设计一致性协议，实现地面机器人和多架无人机的协同控制。2012年，Matthew等针对微小无人机紧密编队通过一个三维环境的问题，通过引入相对位置误差，设计非线性一致性分散控制器，实现了4架四旋翼无人机的紧密编队飞行。2014年，日本庆应大学Kuriki等针对多无人机协同控制问题，提出了具有防撞能力一致性协同编队控制策略，实现了四旋翼协同编队飞行并解决了机间防撞。2015年，该团队采用分散式模型预测控制和基于一致性的控制，实现了多无人机具有防撞能力的协同编队飞行。2013年，东南大学李世华等通过有限时间一致性算法处理了有领航者和无领航者的编队控制问题。2013年，邢关生等研究小型旋翼机群编队问题，在串级控制系统框架下提出一种基于Hamilton环的通信拓扑设计方案。天津大学宗群等针对飞行器姿态同步问题，采用leader-follower结合基于行为和一致性的方法，设计了有限时间姿态同步控制器，实现了多个飞行器有限时间同步问题。

信息感知与数据融合
编队通信
编队仿真平台

笔者注：刚性图论是分布式控制理论之中的概念。
参考 https://blog.csdn.net/mkb9559/article/details/120549234

一般的（Generic）
Generic [1]: “We say a configuration q qq is generic if the entries of q qq are algebraically independent over the rational numbers, namely, there is no non-zero polynomial with rational coefficients that vanishes at the entries of q qq.”
一般的：简单来说，二维情况下图上没有共线的点。
等价与全等（Equivalent and Congruent）
简单来说，等价只保证两个图所有边等长，全等要求任意两点距离相等。
刚性（Rigid）
Rigid [3]: “Roughly speaking, a formation is rigid if its only smooth motions are those corresponding to translation or rotation of the whole formation.”
刚性：简单来说，和刚体类似，刚性图只能被整体的移动和旋转。

基于群体智能的无人机集群协同对抗系统的设计与实现（学位论文）电子科技大学2020

国外研究现状：
DRAPA提出了自动混合控制（MICA）项目，旨在提高无人机系统的自主性和协调能力。其中还包括多无人机控制架构，多无人机自主编队控制方法，无人机协同作战建模与仿真技术等，其目标是为了实现较少操作人员对大规模无人作战飞机编队的控制；
自主协商编队(Autonomous Negotiating Team ANT)项目则是分阶段讨论对无人机集群的控制策略。该项目采用协商算法，对无人机集群在冲突协调与协同任务分配技术开展了研究，在制定任务计划阶段初步实现无人机集群的自主协同控制；
广域搜索弹药项目(Wide Area Search Munitions, WASM针对无人机侦察/打击一体化关键技术，提出了基于编队的任务分配、任务协调、航迹规划和轨迹控制四个协同控制层次的无人机集群协同控制分布式体系结构，实现面向大规模复杂任务场景的层次化调控。
由加州理工大学、加州大学洛杉矶分校等高校联合研究对抗环境下分布式自主平台协同控制问题，包括最具代表性的有扁平分散网络结构下的分布式信息采集、处理与决策；复杂电磁环境下的可靠通信网络设计；大规模作战平台任务分配与协调；非结构化作战环境与突发事件下的实时决策。欧盟委员会信息社会技术计划资助的多异构无人机实时协同和控制项目专注于民用领域中多异构平台的实时协同和控制。其目标是设计和实现用于由多个异构无人机组成的集成检测和监视系统的分布式控制结构，并集成分布式信息感知和实时图像处理技术。在无人机集群协同控制理论研究方面，由于无人机的自治特性，无人机集群协同方法逐步映射到多智能系统进行研究。
多智能体系统(Multi-Agent Systems)是由在同一环境下的多个具有自主性个体组成的智能系统，系统内多个智能体之间可以形成合作的关系。面向于无人机集群的多智能体协同对抗研究起源于经典的Agent理论。在多智能体系统中，由于多个智能体之间有相互影响，具有非马氏过程特性。Jun Lung Hu在1998年证明了应用在动态环境下的多智能体协作最终会收敛到 Nash 平衡点，为 MAS系统中的广泛使用提供了理论基础。在此基础上，多智能体协作理论在无人集群领域得到越来越多的应用。Beard等就无人机集群的协同问题进行研究，并指出集中式控制模式下，对通信能力与中心节点计算能力依赖较强，且难以满足动态控制的要求，分布式控制方式能够充分发挥各个无人机的自主能力。

非线性多智能体系统一致性分布式控制王寅秋

概述

近几十年来系统出现了规模庞大这一显著特点并且控制系统的内部关联日益复杂化，所以应用传统的
集中式控制方法导致计算量巨大，发生故障的机率变大，分布式控制方法是解决大系统控制问题的一个很有效的方法，特别是针对大系统控制的一种特殊情况––––多智能体协作控制非常有效。多智能体系统是指局部之间可
以相互作用的由大量简单个体所组成的复杂网络系统。例如，在自然界中，鱼群的集
体行为，鸟类的编队飞行，都可以被视为多智能体系统的运动；多机器人协作编队控
制，智能电网的调度，无人机编队飞行控制，也都是多智能体系统控制方面的问题。
多智能体系统控制问题相比于传统的控制问题，具有以下不同：
（1）控制不仅来自于系统的外部，同时也来自系统的内部；
（2）各智能体之间是相互影响，相互耦合的，所以各智能体有可能即是控制者
也是被控对象；
（3）每个智能体均能进行决策和通信，都有自己的利益；
（4）每个智能体在完成自己的控制目标的同时，也要顾及整体的目标。
通过智能体之间的相互协作，多智能体系统可以完成单个系统无法胜任的许多任
务。另外，多智能体系统还具有以下优点：
（1）多个简单智能体的成本总和通常小于单个复杂个体的成本，所以多智能体
系统能够有效地降低成本；
（2）当多智能体系统中某个智能体损坏，对整个系统的影响较小；
（3）改变智能体之间的网络通信或是多智能体之间的结构，可以使多智能体系
统完成不同的任务，所以多智能体系统相比于单个系统具有更强的灵活性。
为了完成多智能体的分布式协作，最主要是为单个智能体设计分布式动力学规则
以实现整个多智能体系统的协作以完成某一个特定的任务。目前，多智能体协作控制
主要研究方向有同步、编队、群聚等。实质上，以上研究内容均以多智能
体系统的一致性为重要基础。多智能体系统的一致性，是指在多智能体网络中的所有
智能体的状态均达到一致。多智能体的运动是由其中所有智能体的初始状态、网络通
信拓扑、各智能体的动力学模型和一致性控制协议所决定的。这里，一致性控制协议
是指以使多智能体系统达到一致为目的，以单个智能体本身以及其在通信网络中的邻
居之间的信息交互为手段的控制协议。由于一致性问题是多智能体协作控制问题的基
础，所以，近十年来，世界上许多学者都在这方面做出了突出贡献，得到了大量优秀
的成果。
对于多智能体系统，非线性是一个无法回避的问题。因为，从实质上来说，所有
的系统都是具有非线性动力学特性的，只是在处理的时候为了简便，忽略某些对于系
统来说影响很小的非线性特性。另外，很多被控对象，例如四旋翼无人机或是机器人，
其本身就是一个强非线性系统，所以研究这类多智能体系统的协作控制时，无法使用
已有的关于线性多智能体系统的结论。同时，由于智能体之间的通信模型可能是非线
性的，所以研究非线性框架下的多智能体一致性控制问题更具有现实意义。对于线性
多智能体系统，如果设计非线性分布式一致性控制协议，可能会得到较好的控制效果，
例如收敛时间缩短或是超调量减小，或是完成某些线性控制协议所无法完成的任务，
例如保证多智能体系统在有限时间内达到完全一致或是保证系统的输入有界。

前人研究

多智能体系统分布式一致性控制问题研究概述
1995 年，Vicsek研究了粒子群的一致性问题，指出如果粒子群中的每个粒子均
朝其邻居和自身的平均方向移动，则随着时间变化，所有粒子运动的方向将会趋于并
保持一致。这一工作为之后其他研究一致性问题的工作提供了基础。Jadbabie 等人在
文献中证明了当网络通信拓扑为无向图时，一阶多智能体网络能够达到一致的充
分性条件是网络拓扑是在连续有界的时间内联合连通的，首次给出了分布式一致性问
题的理论分析结果。【笔者评：看来这是万恶之源】
Olfati-Saber 和Murray于2004 年提出了解决单积分器多智能体
系统一致性问题的一般性框架，并指出，如果网络智能体之间的通信网络是有向平衡
强连通的，则网络中所有的智能体的状态最终收敛到所有智能体初始状态的平均值，
即达到平均一致，并利用频域分析方法分析如果通信含有时滞，多智能体系统的收敛
情况。与此同时，Moreau[21]应用集值Lyapunov 方法和凸性理论分析离散多智能体网
络的一致性问题，证明只要智能体之间的耦合满足一定的凸性条件，并且网络是联合
连通的，那么多智能体网络能够达到一致。

在文献[22]和[23]中【就是我们看到的首个论文】，Ren 和Bread 将多智能体之间的通信网络推广到更加一般的
情况——加权有向图，证明如果单个智能体的动力学模型如果是
$x_i(t)=\sum_{j\in\mathcal{N}_i(t)}a_{ij}(x_j(t)-x_i(t))$
多智能体系统能够达到一致的充要条件是网络拓扑含有一个生成树。

前面主要介绍了单积分器多智能体网络的一致性研究成果，但是在现实世界中，
许多被控对象都具有二阶动力学模型，通常表示为$\dot{x}_i=\nu_i,\quad\dot{\nu}_i=u_i$
上式中， $x_i$ 可以被理解为智能体的位置，$v_i$ 可以被理解为智能体的速度。Xie 和Wang[24]
提出一种需要智能体绝对速度的一致性算法.应用该算法，智能体的最终收敛值是常值，所有智能体的速度最终趋近于零。
在文献[25]中，作者给出另一种控制算法$u_i=\sum_{j\in\mathcal{N}_i(t)}a_{ij}[(x_j-x_i)+\gamma(\nu_j-\nu_i)]$
并证明既是【错字，即使】网络拓扑含有一颗生成树，多智能体网络也不一定能够达到一致，参数$\gamma$需
要大于一个与网络拓扑的拉普拉斯矩阵特征值有关的常数，多智能体网络才能够达到
一致。Yu[26,27]给出了有向双积分器通信网络达到一致的充分条件：第一，网络拓扑含
有一颗生成树；第二，网络拓扑对应的拉普拉斯矩阵的实部和虚部满足一定条件。文
献[29]和[30]分别设计在输入饱和限制下的二阶多智能体网络分布式一致性控制协议。

以上文献均没有考虑系统中存在时变领导者的情况。如果多智能体网络中含有一个时变领导者，Ren[32]针对一阶系统的情况给出了几种分布式控制算法，解决了系统的跟踪协调一致的问题。Qin[28]等考虑了二阶多智能体网络对于时变参考信号的跟踪
控制问题。在文献[31]中，作者对一类二阶多智能体系统的一致性问题进行了更为详尽的研究。Cao 和Ren[33]通过设计非连续的滑模控制算法，实现了对多智能网络中的动态领导者的精确跟踪。Hong[34,35]等在假设每个跟随者都能够获得领导者信息的情况下，同样设计分布式控制协议解决了跟踪多智能体系统的控制问题。文献[36]设计了分布式滑模估计器，能够在有限时间内精确估计领导者的信息。对于网络中有多个领导者的情况，文献[37]将其定义为多智能体分布式包含控制，并给出了一阶积分器多智能系统的相关结果。Cao[38]等还讨论了二阶多智能体网络的分布式包含控制，并设计了相应的控制协议。Mei[39]等研究了多拉格朗日系统的分布式包含控制，而Li 等在文献[40]中对一般线性系统的包含控制问题进行了研究。
对于一般线性模型，Ma 和Zhang[41]在有向固定通信拓扑下研究了一般时不变线性系统的可一致性问题，指出由一般时不变线性系统组成的多智能体网络如果能够达到一致，必要条件是通信网络拓扑含有一颗有向生成树，且(A, B)可镇定，(A,C)可
检测。文献[42]给出了固定拓扑下一般线性系统多智能体网到一致的充分条件，并提出了一致性区域的概念。对于含有时滞的高阶多智能体网络的分布式一致性控制问题，文献[43,44]也给出了相应的理论结果。

多智能体系统分布式H ∞ 一致性控制问题

H∞ 控制理论是经典控制理论的重要分支，其主要研究如果系统受到外部扰动或
是内部摄动的情况下，如何设计控制器以保证闭环系统满足一定的H∞ 性能指标，即
外部扰动或是内部摄动对整个系统运行的影响最小。因为研究这一问题更具有现实意
义，所以H∞ 控制受到越来越多的学者的关注。

非线性多智能体系统一致性控制问题

posted @ 2023-09-16 11:01 藤君阅读(65) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

藤君的小窝

人工智能导论——重要知识

Finite-time Analysis of the Multiarmed Bandit Problem

Abstract

Keywords

Note

Main results

Proofs

4. Experiments

5. Conclusions

Consensus Seeking in Multiagent Systems Under Dynamically Changing Interaction Topologies

Abstract

Keywords

introduction

PROBLEM STATEMENT

CONSENSUS OF INFORMATION UNDER DYNAMICALLY CHANGING INTERACTION TOPOLOGIES

CONCLUSION

多无人机协同编队飞行控制研究现状及发展

基于群体智能的无人机集群协同对抗系统的设计与实现（学位论文）电子科技大学2020

非线性多智能体系统一致性分布式控制王寅秋

概述

前人研究

公告

藤君的小窝

人工智能导论——重要知识

Finite-time Analysis of the Multiarmed Bandit Problem

Abstract

Keywords

Note

Main results

Proofs

4. Experiments

5. Conclusions

Consensus Seeking in Multiagent Systems Under Dynamically Changing Interaction Topologies

Abstract

Keywords

introduction

PROBLEM STATEMENT

CONSENSUS OF INFORMATION UNDER DYNAMICALLY CHANGING INTERACTION TOPOLOGIES

CONCLUSION

多无人机协同编队飞行控制研究现状及发展

基于群体智能的无人机集群协同对抗系统的设计与实现（学位论文）电子科技大学2020

非线性多智能体系统一致性分布式控制 王寅秋

概述

前人研究

公告

非线性多智能体系统一致性分布式控制王寅秋