Multi-armed Bandit Problem与增强学习的联系

选自《Reinforcement Learning: An Introduction》, version 2, 2016, Chapter2

https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf

引言中是这样引出Chapter2的：

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward. The exploration-exploitaion dilemma has been intensively studied by mathematicians for many decades (see chapter 2). For now, we simply note that the entire issue of balancing exploration and exploitation does not even arise in supervised and unsupervised learning, at least in their purest forms.

增强学习的挑战之一是如何处理exploration与exploitation之间的折中，这是其他类学习问题所没有的。为了获得很多奖励、收益，增强学习的agent更倾向于选择那些在过去尝试过且收益很大的行为。但是为了发现这样的行为，它必须尝试之前没有选择过的。也就是说，对于agent，一方面它要尽可能的利用它已经知道的知识来获得收益，另一方面，它必须积极进行探索使得未来能够做出更好的选择。矛盾在于过分的追求exploration或exploitation都会导致任务的失败。所以agent应该一方面积极尝试多种多样的行为，另一方面应该尽量选择那些目前看来最好的。在随机试验中，每个行为必定被多次尝试以获得对于期望收益最为可靠的估计。exploration-exploitation矛盾已经被数学家广泛研究了几十年（见第2章）。至少现在我们可以简单的理解为平衡exploration与exploitation的问题并没有出现在有监督与无监督的学习问题中。

chapter2是这样引出的：

The most import feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit trail-and-error search for good behavior. Purely evaluative feedback indicates how good the action taken is, but not whether it is the best or the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken. This kind of feedback is the basis of supervised learning, which includes large parts of pattern classification, artificial neural networks, and system identification. In their pure forms, these two kinds of feedback are quite distinct: evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken. There are also interesting intermediate cases in which evaluation and instruction blend together.

增强学习有别于其他类的学习方式，它使用训练数据不但能够给出正确的行为指令，而且能够评价该行为（采用该行为的奖励、收益）。由此产生了通过显式搜索有利行为的主动的探索需求。单纯的评价式反馈指明了若采取某一行为，则产生的收益是多少，而不是仅仅判断这个行为是最好的活最差的。从另一个角度来讲，单纯的指示型反馈仅指明应该采取的正确行为，与实际采取的行为无关。这种反馈是有监督学习的基础。这两种反馈是完全不同的：评价式反馈完全依赖于已经采取的行为，而指示型反馈独立于实际采取的行为。也有一些处于两者之间的例子。

In this chapter we study the evaluative aspect of reinforcement learning in a simplified setting, one that does not involve learning to act in more than one situation. This nonassociative setting is the one in which most prior work involving evaluative feedback has been done, and it avoids much of the complexity of the full reinforcement learning problem. Studying this case will enable us to see most clearly how evaluative feedback differs from, and yet can be combined with instructive feedback.

本章研究增强学习在简化场景下的评价方面，所谓简化场景也就是说不涉及多个学习场景。这种非关联场景已有许多相关工作涉及到评价式反馈，但是比完全的增强学习问题要简单。学习这些例子有助于我们理解评价式反馈，以及与之相结合的指示型反馈。

The particular nonassociative, evaluative feedback problem that we explore is a simple version of the k-armed bandit problem. We can use this problem to introduce a number of basic learning methods which we extend in later chapters to apply to the full reinforcement learning problem. At the end of this chapter, we take a step closer to the full reinforcement learning problem by discussing what happens when the bandit problem becomes associative, that is, when actions are taken in more than one situation.

我们将要探索的这种特殊的非关联的评价式反馈问题是k-armed bandit problem的简化版本。我们用这个问题引出后续章节中要介绍的完全增强学习的基本方法。本章的最后，对bandit问题进行扩展，使得action发生在多个场景中，得到了关联型版本。

总结：

Multi-armed bandit problem（又称k-armed bandit problem）并非完全的reinforcement learning，而只是其简化版本。所以该书将bandit问题作为引子，引出reinforcement learning的问题。reinforcement learning中的一些概念都是其中的一些概念扩展而来的。

Multi-armed Bandit Problem与增强学习的联系的更多相关文章

马里奥AI实现方式探索 ——神经网络+增强学习
[TOC] 马里奥AI实现方式探索 --神经网络+增强学习儿时我们都曾有过一个经典游戏的体验,就是马里奥(顶蘑菇^v^),这次里约奥运会闭幕式,日本作为2020年东京奥运会的东道主,安倍最后也已经典 ...
增强学习（三）----- MDP的动态规划解法
上一篇我们已经说到了,增强学习的目的就是求解马尔可夫决策过程(MDP)的最优策略,使其在任意初始状态下,都能获得最大的Vπ值.(本文不考虑非马尔可夫环境和不完全可观测马尔可夫决策过程(POMDP)中的 ...
增强学习（四） ----- 蒙特卡罗方法(Monte Carlo Methods)
1. 蒙特卡罗方法的基本思想蒙特卡罗方法又叫统计模拟方法,它使用随机数(或伪随机数)来解决计算的问题,是一类重要的数值计算方法.该方法的名字来源于世界著名的赌城蒙特卡罗,而蒙特卡罗方法正是以概率为基 ...
增强学习————K-摇臂赌博机
探索与利用增强学习任务的最终奖赏是在多步动作之后才能观察到,于是我们先考虑最简单的情形:最大化单步奖赏,即仅考虑一步操作.不过,就算这样,强化学习仍与监督学习有显著不同,因为机器要通过尝试来发现各个动 ...
增强学习（Reinforcement Learning and Control）
增强学习(Reinforcement Learning and Control) [pdf版本]增强学习.pdf 在之前的讨论中,我们总是给定一个样本x,然后给或者不给label y.之后对样本进行 ...
增强学习 | AlphaGo背后的秘密
"敢于尝试,才有突破" 2017年5月27日,当今世界排名第一的中国棋手柯洁与AlphaGo 2.0的三局对战落败.该事件标志着最新的人工智能技术在围棋竞技领域超越了人类智能,借此 ...
增强学习 | Q-Learning
"价值不是由一次成功决定的,而是在长期的进取中体现" 上文介绍了描述能力更强的多臂赌博机模型,即通过多台机器的方式对环境变量建模,选择动作策略时考虑时序累积奖赏的影响.虽然多臂赌博 ...
(zhuan) 大牛讲堂 | 算法工程师入门第二期-穆黎森讲增强学习
大牛讲堂 | 算法工程师入门第二期-穆黎森讲增强学习 2017-07-13 HorizonRobotics
常用增强学习实验环境 II (ViZDoom, Roboschool, TensorFlow Agents, ELF, Coach等) （转载）
原文链接:http://blog.csdn.net/jinzhuojun/article/details/78508203 前段时间Nature上发表的升级版Alpha Go - AlphaGo Ze ...

随机推荐

启动kafka出现找不到或无法加载主类
首先确认下环境变量配置是否成功. 如果配置成功<javac,javah>都没有问题,那就有可能是你安装了两个版本的jdk导致的,都卸载了,然后换一个目录按照一个jdk 在配置环境变量试下!
Linux手动释放内存
手动释放内存 1.sync将内存中的缓存写入磁盘 2. to free pagecache, use echo 1 > /proc/sys/vm/drop_caches; to f ...
Action向前台输出
import java.io.IOException;import java.io.PrintWriter; import javax.servlet.http.HttpServletResponse ...
What is SPI?
原文地址:http://www.fpga4fun.com/SPI1.html SPI is a simple interface that allows one chip to communicate ...
docker help
localhost == 127.0.0.1 == 本机ip ifconfig 或者 ip addr 查看本地宿主机的ip地址 $ docker help Usage: docker [OPTIONS ...
UnicodeDecodeError: ‘XXX' codec can't decode bytes in position X 的问题
错误信息:UnicodeDecodeError: ‘XXX' codec can't decode bytes in position 2-5: illegal multibyte sequence ...
PowerDesigner 把Comment复制到name中和把name复制到Comment
在使用PowerDesigner对数据库进行概念模型和物理模型设计时,一般在NAME或Comment中写中文,在Code中写英文.Name用来显示,Code在代码中使用,但Comment中的文字会保 ...
RTC,登陆后添加权限值
修改单元:rtcMW.DM.Main; 修改组件:fnLogin 在方法中添加: 服务端: const SQL_SELECT_USER = 'SELECT * FROM Users WHERE Use ...
ROW_NUMBER() OVER函数的用法
语法:ROW_NUMBER() OVER(PARTITION BY COLUMN ORDER BY COLUMN) partition 划分,分割 --ROW_NUMBER() 就是生成一个有顺序的行 ...
delphi IOS 获取电池信息
procedure TDeviceInfoForm.btnGetDeviceInfoClick(Sender: TObject); var Device : UIDevice; begin Devic ...

Multi-armed Bandit Problem与增强学习的联系

Multi-armed Bandit Problem与增强学习的联系的更多相关文章

随机推荐

热门专题