【转载】 模仿学习:在线模仿学习与离线模仿学习 ———— Imitate with Caution: Offline and Online Imitation
网上闲逛找到的一篇文章,介绍模仿学习的,题目:
之所以转载这个文章是因为这个文章还是蛮浅显易懂的,而且还蛮有知识量的,尤其是自己以前对模仿学习的了解也比较粗浅,看了这个文章后学习到了知识。
离线模仿学习还好理解一些,对于在线模仿学习一直不是很理解,本文给出了一种很有名的在线模仿学习算法,Data Aggregation Approach: DAGGER
地址:
==================================================================
What’s Imitation Learning ?
As the same itself suggests, almost every species including humans learn by imitating and also improvise. That’s evolution in one sentence. Similarly we can make machines mimic us and learn from a human expert. Autonomous driving is a good example: We can make an agent learn from millions of driver demonstrations and mimic an expert driver.
This Learning from demonstrations also known as Imitation Learning (IL) is an emerging field in reinforcement learning and AI in general. The application of IL in robotics is ubiquitous, a robot can learn a policy from analysing demonstrations of the policy performed by a human supervisor.
Expert Absence vs Presence:
The Imitation learning takes 2 directions conditioned on whether the expert is absent during training or if the expert is present to correct the agent’s action. Let’s talk about the first case when the expert is absent.
Expert Absence During Training
Absence of expert basically means that the agent just has access to the expert demonstrations and nothing more than that. In these “Expert Absence” tasks, the agent tries to use a fixed training set (state-action pair) demonstrated by an expert in order to learn a policy and achieve an action as similar as possible as the expert’s one. These “Expert Absence” tasks can also be termed as offline imitation learning tasks.
This problem can be framed as supervised learning by classification. The expert demonstrates contain numerous training trajectories and each trajectory comprises a sequence of observations and a sequence of actions executed by an expert. These training trajectories are fixed and are not affected by the agent’s policy, this “Expert Absence” tasks can also be termed as offline imitation learning tasks.
This learning problem can be formulated as a supervised learning problem in which a policy can be obtained by solving a simple supervised learning problem: we can simply train a supervised learning model that directly maps the state to actions to mimic the expert through his/her demonstrations. We call this approach “behavioural cloning”.
Now we need a surrogate loss function which quantifies the difference between the demonstrated behaviour and the learned policy. We use maximum expected log-likelihood functions for formulating the loss.
L2 error ~ Maximising Log Likelihood
we choose cross entropy if we are solving a classification problem and L2 loss if its regression problem. It’s simple to see minimising the l2 loss function is equivalent to maximising the expected log likelihood under the Gaussian distribution.
Challenges
Everything looks fine till now but one important shortcoming with behavioural cloning is generalisation. Experts only collect a subset of infinite possible states that agents can experience. A simple example is that an expert car driver doesn’t collect unsafe and risky states by drift off-course, but an agent can encounter such risky states it may not have learned corrective actions as there is no data. This occurs because of “covariate shift” a known challenge, where the states encountered during training differ from the states encountered during testing, reducing robustness and generalisation.
One way to solve this “covariance shift” problem is to collect more demonstrations of risky states, this can be prohibitively expensive. Expert presence during the training can help us solve this problem and bridge the gap between the demonstrated policy and agent policy.
Expert Presence: Online Learning
In this section we will introduce the most famous on-line imitation learning algorithm called Data Aggregation Approach: DAGGER. This method is very effective in closing the gap between the states encountered during training and the states encountered during testing, i.e “covariate shift”.
What if the expert evaluates the learners policy during the learning? The expert provides the correct actions to take to the examples that come from the learner’s own behaviour. This is what DAgger tries to achieve exactly. The main advantage of DAgger is that the expert teaches the learner how to recover from past mistakes.
The steps are simple and similar to behavioural cloning except we collect more trajectories based on what agent has learned so far
1. The policy is initialised by behavioural cloning of the expert demonstrations D, resulting in policy π1
2. The agent uses π1 and interact with the environment to generate a new dataset D1 that contains trajectories
3. D = D U D1: We add the new generated dataset D1 to the expert demonstrations D.
1. New demonstrations D is used to train a policy π2…..
To leverage the presence of the expert, a mix of expert and learner is used to query the environment and collect the dataset. Thus, DAGGER learns a policy from the expert demonstrations under the state distribution induced by the learned policy. If we set β=0, which in this case means all trajectories during are generated from the learner agent.
Algorithm:

DAgger alleviates the problem of “covariance shift” ( the state distribution induced by the learner’s policy is different from the state distribution in the initial demonstration data). This approach significantly reduces the size of the training dataset necessary to obtain satisfactory performance.
Conclusion
DAgger has seen extraordinary success in robotic control and has also been applied to control UAVs. Since the learner encounters various states in which the expert did not demonstrate how to act, an online learning approach such as DAGGER is essential in these applications.
In the next blog in this series, we will understand the shortcomings of DAgger algorithm and importantly we will emphasis on the safety aspects of DAgger algorithms.
=============================================================
个人的分析:
Data Aggregation Approach: DAGGER
这个算法的详细步骤还是要看算法描述:
从算法描述中可以看到一个出现了三个策略,Pi_E是专家的策略,Pi_L是学习者的策略,Pi是采样数据时的策略。
每次训练得出新的学习者策略的数据集都是 Aggregate datesets: D 。
Aggregate datesets: D数据集是不断与采样数据融合的,不过这里我有个疑问,那就是这个数据集一直融合会不会导致数据量过大呢。
有一个需要注意的那就是Di 数据集的来历,这个数据集是使用当前学习者策略采集到的轨迹中的状态,然后使用专家策略给出这些状态下专家策略所对应的动作,然后组成状态动作对组成数据集Di 。
有一个问题就是这个超参数beta的设置是如何来的,这里其实是N次迭代也就是有N个beta,关于这个beta的设置可能只有去找原始论文和相关代码了,本文这里也只是介绍性的,因此了解这些也是OK的。
=========================================================
posted on 2022-03-11 23:14 Angry_Panda 阅读(540) 评论(0) 编辑 收藏 举报
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· C#/.NET/.NET Core技术前沿周刊 | 第 29 期(2025年3.1-3.9)
· 从HTTP原因短语缺失研究HTTP/2和HTTP/3的设计差异