【转载】 模仿学习:在线模仿学习与离线模仿学习 ———— Imitate with Caution: Offline and Online Imitation
网上闲逛找到的一篇文章,介绍模仿学习的,题目:
之所以转载这个文章是因为这个文章还是蛮浅显易懂的,而且还蛮有知识量的,尤其是自己以前对模仿学习的了解也比较粗浅,看了这个文章后学习到了知识。
离线模仿学习还好理解一些,对于在线模仿学习一直不是很理解,本文给出了一种很有名的在线模仿学习算法,Data Aggregation Approach: DAGGER
地址:
==================================================================
What’s Imitation Learning ?
As the same itself suggests, almost every species including humans learn by imitating and also improvise. That’s evolution in one sentence. Similarly we can make machines mimic us and learn from a human expert. Autonomous driving is a good example: We can make an agent learn from millions of driver demonstrations and mimic an expert driver.
This Learning from demonstrations also known as Imitation Learning (IL) is an emerging field in reinforcement learning and AI in general. The application of IL in robotics is ubiquitous, a robot can learn a policy from analysing demonstrations of the policy performed by a human supervisor.
Expert Absence vs Presence:
The Imitation learning takes 2 directions conditioned on whether the expert is absent during training or if the expert is present to correct the agent’s action. Let’s talk about the first case when the expert is absent.
Expert Absence During Training
Absence of expert basically means that the agent just has access to the expert demonstrations and nothing more than that. In these “Expert Absence” tasks, the agent tries to use a fixed training set (state-action pair) demonstrated by an expert in order to learn a policy and achieve an action as similar as possible as the expert’s one. These “Expert Absence” tasks can also be termed as offline imitation learning tasks.
This problem can be framed as supervised learning by classification. The expert demonstrates contain numerous training trajectories and each trajectory comprises a sequence of observations and a sequence of actions executed by an expert. These training trajectories are fixed and are not affected by the agent’s policy, this “Expert Absence” tasks can also be termed as offline imitation learning tasks.
This learning problem can be formulated as a supervised learning problem in which a policy can be obtained by solving a simple supervised learning problem: we can simply train a supervised learning model that directly maps the state to actions to mimic the expert through his/her demonstrations. We call this approach “behavioural cloning”.
Now we need a surrogate loss function which quantifies the difference between the demonstrated behaviour and the learned policy. We use maximum expected log-likelihood functions for formulating the loss.
L2 error ~ Maximising Log Likelihood
we choose cross entropy if we are solving a classification problem and L2 loss if its regression problem. It’s simple to see minimising the l2 loss function is equivalent to maximising the expected log likelihood under the Gaussian distribution.
Challenges
Everything looks fine till now but one important shortcoming with behavioural cloning is generalisation. Experts only collect a subset of infinite possible states that agents can experience. A simple example is that an expert car driver doesn’t collect unsafe and risky states by drift off-course, but an agent can encounter such risky states it may not have learned corrective actions as there is no data. This occurs because of “covariate shift” a known challenge, where the states encountered during training differ from the states encountered during testing, reducing robustness and generalisation.
One way to solve this “covariance shift” problem is to collect more demonstrations of risky states, this can be prohibitively expensive. Expert presence during the training can help us solve this problem and bridge the gap between the demonstrated policy and agent policy.
Expert Presence: Online Learning
In this section we will introduce the most famous on-line imitation learning algorithm called Data Aggregation Approach: DAGGER. This method is very effective in closing the gap between the states encountered during training and the states encountered during testing, i.e “covariate shift”.
What if the expert evaluates the learners policy during the learning? The expert provides the correct actions to take to the examples that come from the learner’s own behaviour. This is what DAgger tries to achieve exactly. The main advantage of DAgger is that the expert teaches the learner how to recover from past mistakes.
The steps are simple and similar to behavioural cloning except we collect more trajectories based on what agent has learned so far
1. The policy is initialised by behavioural cloning of the expert demonstrations D, resulting in policy π1
2. The agent uses π1 and interact with the environment to generate a new dataset D1 that contains trajectories
3. D = D U D1: We add the new generated dataset D1 to the expert demonstrations D.
1. New demonstrations D is used to train a policy π2…..
To leverage the presence of the expert, a mix of expert and learner is used to query the environment and collect the dataset. Thus, DAGGER learns a policy from the expert demonstrations under the state distribution induced by the learned policy. If we set β=0, which in this case means all trajectories during are generated from the learner agent.
Algorithm:
DAgger alleviates the problem of “covariance shift” ( the state distribution induced by the learner’s policy is different from the state distribution in the initial demonstration data). This approach significantly reduces the size of the training dataset necessary to obtain satisfactory performance.
Conclusion
DAgger has seen extraordinary success in robotic control and has also been applied to control UAVs. Since the learner encounters various states in which the expert did not demonstrate how to act, an online learning approach such as DAGGER is essential in these applications.
In the next blog in this series, we will understand the shortcomings of DAgger algorithm and importantly we will emphasis on the safety aspects of DAgger algorithms.
=============================================================
个人的分析:
Data Aggregation Approach: DAGGER
这个算法的详细步骤还是要看算法描述:
从算法描述中可以看到一个出现了三个策略,Pi_E是专家的策略,Pi_L是学习者的策略,Pi是采样数据时的策略。
每次训练得出新的学习者策略的数据集都是 Aggregate datesets: D 。
Aggregate datesets: D数据集是不断与采样数据融合的,不过这里我有个疑问,那就是这个数据集一直融合会不会导致数据量过大呢。
有一个需要注意的那就是Di 数据集的来历,这个数据集是使用当前学习者策略采集到的轨迹中的状态,然后使用专家策略给出这些状态下专家策略所对应的动作,然后组成状态动作对组成数据集Di 。
有一个问题就是这个超参数beta的设置是如何来的,这里其实是N次迭代也就是有N个beta,关于这个beta的设置可能只有去找原始论文和相关代码了,本文这里也只是介绍性的,因此了解这些也是OK的。
=========================================================
【转载】 模仿学习:在线模仿学习与离线模仿学习 ———— Imitate with Caution: Offline and Online Imitation的更多相关文章
- 在线算法与离线算法(online or offline)
1. 在线算法(online) PFC(prefix-free code)编码树的解码过程:可以在二进制编码串的接收过程中实时进行,而不必等到所有比特位都到达后才开始: 2. 离线算法(offline ...
- wampsever在线模式和离线模式有什么区别
我们在开发网站的时候经常会使用到wampsever服务器,在测试项目的时候我们会经常发现,wampsever服务器在线模式和离线模式都可以使用并且测试,还有一个现象就是我们在测试无线网络,用手机访问的 ...
- HDU 4605 Magic Ball Game (在线主席树|| 离线 线段树)
转载请注明出处,谢谢http://blog.csdn.net/ACM_cxlove?viewmode=contents by---cxlove 题意:给出一棵二叉树,每个结点孩子数目为0或者2. ...
- hdu 4605 Magic Ball Game (在线主席树/离线树状数组)
版权声明:本文为博主原创文章,未经博主允许不得转载. hdu 4605 题意: 有一颗树,根节点为1,每一个节点要么有两个子节点,要么没有,每个节点都有一个权值wi .然后,有一个球,附带值x . 球 ...
- SPOJ DQUERY D-query (在线主席树/ 离线树状数组)
版权声明:本文为博主原创文章,未经博主允许不得转载. SPOJ DQUERY 题意: 给出一串数,询问[L,R]区间中有多少个不同的数 . 解法: 关键是查询到某个右端点时,使其左边出现过的数都记录在 ...
- Sencha Toucha 2 —1.环境安装配置、在线打包、离线打包
环境安装配置 1. 下载 1.1 Sencha Touch 下载 http://cdn.sencha.com/touch/sencha-touch-2.2.1-gpl.zip 1 ...
- HDU 2586 How far away ?(经典)(RMQ + 在线ST+ Tarjan离线) 【LCA】
<题目链接> 题目大意:给你一棵带有边权的树,然后进行q次查询,每次查询输出指定两个节点之间的距离. 解题分析:本题有多重解决方法,首先,可用最短路轻易求解.若只用LCA解决本题,也有三种 ...
- [转载]各种在线api地址
J2SE1.7英文api地址: http://download.oracle.com/javase/7/docs/api/J2SE1.6英文api地址: http://download.oracle ...
- Cloudera Hadoop 4 实战课程(Hadoop 2.0、集群界面化管理、电商在线查询+日志离线分析)
课程大纲及内容简介: 每节课约35分钟,共不下40讲 第一章(11讲) ·分布式和传统单机模式 ·Hadoop背景和工作原理 ·Mapreduce工作原理剖析 ·第二代MR--YARN原理剖析 ·Cl ...
- [转载]github在线更改mysql表结构工具gh-ost
GitHub正式宣布以开源的方式发布gh-ost:GitHub的MySQL无触发器在线更改表定义工具! gh-ost是GitHub最近几个月开发出来的,目的是解决一个经常碰到的问题:不断变化的产品需求 ...
随机推荐
- vitepress使用createContentLoader时遇到["vitepress" resolved to an ESM file. ESM file cannot be loaded by `require`.]报错
在使用vitepress构建一个所有博客的概览页的时候,使用到了createContentLoader这个api,但是在index.data.ts中定义并导出后,在md文件中调用遇到了下面这个问题: ...
- 图片预加载需要token认证的地址处理
1.添加函数修改img的属性: /** * * @param {*} idName 传入的id,获取改img的dom,添加相应的数学 */ export const proxyImg = (idNam ...
- post请求 restTemplate.postForObject restTemplate.postForEntity java.lang.ClassCastException: com.google.gson.internal.LinkedTreeMap cannot be cast to xxx POSTpost请求
1.restTemplate调用的两种方式及获取字符串转换对象model的处理,统一按接收字符串,然后gson转换为对象的方式. ResponseData对象包含的属性private String r ...
- vue3实现模拟地图上,站点名称按需显示的功能
很久很久没有更新博客了,因为实在是太忙了,每天都有公司的事情忙不完....... 最近在做车辆模拟地图,在实现控制站点名称按需显示时,折腾了好一段时间,特此记录一下.最终界面如下图所示: 站点显示需求 ...
- EIGRP总结
EIGRP 思科私有,2013年公开,其他厂商不支持,所以用得不是很多 几秒钟就能完成收敛 触发更新,只要网络不发生变化就不会发生更新 按需更新,只更新变化的部分 ...
- 集成学习与随机森林(四)Boosting与Stacking
Boosting Boosting(原先称为hypothesis boosting),指的是能够将多个弱学习器结合在一起的任何集成方法.对于大部分boosting方法来说,它们常规的做法是:按顺序训练 ...
- Xilinx-HDF的文件内容
Xilinx-HDF文件 原文:分享:HDF文件的更多用途 Xilnx Vivado能导出HDF文件,给Xilnx SDK创建软件工程.HDF文件的还可以有更多用途. HDF文件是一个zip文件,可以 ...
- T3/A40i升级,推荐全志T507-H的5个理由!
作为能源电力.工业自动化领域的国产中坚力量,全志T3/A40i处理器国产平台一直深受广大客户的喜爱,甚有"国产工业鼻祖处理器"之称.自创龙科技推出T3/A40i全国产工业核心板(S ...
- MySQL - CASE WHEN的高级用法
Case语法 CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 WHEN conditionN THEN resultN E ...
- mybatis-plus的insert方法出现-id' doesn't have a default value问题
出现这个问题,只需把对应的字段注解设置为,例如: @TableId(value = "id",type = IdType.INPUT) private String id; 即可解 ...