强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Probably) Shouldn’t Use Reinforcement Learning
外文原文:
Why You (Probably) Shouldn’t Use Reinforcement Learning
地址:
https://towardsdatascience.com/why-you-shouldnt-use-reinforcement-learning-163bae193da8
中文翻译版本(ChatGPT3.5翻译:)
有关这项技术存在很大的炒作,而且理由充分,因为这可能是实现通用人工智能的最重要的机器学习进展之一。但在一般兴趣之外,你可能最终会面临这样的问题:“这对你的应用是否合适”?
我目前正在一个视觉启用的机器人团队工作,并作为强化学习的过去研究者,我被要求为我的团队回答这个问题。以下是我认为你可能不想在你的应用中使用强化学习,或者至少在选择这条道路之前要三思的一些原因。让我们深入探讨!
极其嘈杂
以下是来自一个最高得分为500的游戏的两个学习图。那么哪个学习算法更好呢?捉弄你的问题。它们完全相同,第二次运行只是第一次的重新运行。在完全成功并学到了完美策略的一次训练会话和另一次彻底失败之间唯一的区别是随机种子。
DQN(Deep Q-Network)在 CartPole 上的训练曲线。图片由作者提供。
- 随机初始化中的微小变化可能会极大地影响训练性能,因此实验结果的可重复性具有挑战性。
- 噪声使得很难比较算法、超参数设置等,因为你不知道性能的提升是由于你所做的更改,还是仅仅是一个随机的偶然现象。
- 你需要在完全相同的条件下运行20次以上的训练会话,以获得一致和稳健的结果。这使得在算法上进行迭代变得非常具有挑战性(请参阅下面关于这些实验可能需要多长时间的说明!)
大量的超参数
目前市场上最成功的算法之一是Soft Actor-Critic(SAC),它有近20个需要调整的超参数。你可以自己查看!但这还不是全部...
- 在深度强化学习中,你有所有与网络架构相关的常规深度学习参数:层数、每层节点数、激活函数、最大池化、随机失活、批标准化、学习率等。
- 此外,你还有与强化学习特定的10个以上的超参数:缓冲区大小、熵系数、折扣因子(gamma)、动作噪声等。
- 此外,你还有通过奖励塑形(Reward Shaping)的形式来调整智能体行为的“超参数”。
- 调整其中之一甚至可能非常困难!参见关于极度嘈杂、长训练时间的说明……想象一下调整30个以上的情况。
- 与大多数超参数调整一样,这些参数并不总是有直观的设置或一种确保最有效地找到最佳超参数的方法。实际上,你只能在黑暗中摸索,直到找到似乎奏效的参数。
仍处于研究和开发阶段
由于强化学习实际上仍处于起步阶段,研究社区仍在解决如何验证和共享进展中的问题。这给那些希望利用研究成果并重现结果的人带来了困扰。
- 论文在实现细节上存在歧义。你并不总是能找到源代码,也不总是清楚如何将一些复杂的损失函数转化为代码。而且,论文似乎还省略了一些用于获得卓越性能的小技巧。
- 一旦有一些代码在互联网上公开,由于上述原因,它们在实现上可能略有不同。这使得很难将你得到的结果与别人在线上的结果进行比较。我相对较差的表现是因为我引入了一个错误,还是因为他们使用了我不知道的技巧呢?
难以调试
- 最近的方法采用了各种各样的技术来取得尖端的结果。这使得编写整洁的代码变得非常困难,随之而来的是很难追踪他人的代码,甚至是你自己的代码!
- 相关的是,由于有太多的组成部分,很容易引入错误,而要找到这些错误却非常困难。强化学习通常涉及多个网络进行学习。而且在学习过程中存在很多随机性,因此某次运行可能有效,而下一次可能无效。这是因为你引入的错误还是因为随机种子的偶发事件呢?在进行更多实验之前很难说。而这需要... 时间。
非常低的样本效率
模型无关学习意味着我们不尝试构建/学习环境的模型。因此,我们学习策略的唯一方法是直接与环境进行交互。On-policy 意味着我们只能使用从执行当前策略中采样的样本来学习/改进我们的策略,也就是说,一旦运行单个反向梯度更新,我们必须丢弃所有这些样本并收集新样本。例如,PPO是一个无模型、基于策略的最先进算法。所有这些都意味着我们必须在学习策略之前与环境进行大量交互(比如数百万步)。
这在我们在一个相对低保真度的模拟器中具有高级特征时可能是可行的。例如,
Humanoid环境的图像,来源于https://gym.openai.com/。
Humanoid花费5小时学会走路(200万步)。
但一旦我们转向低级特征,如图像空间,我们的状态空间就会大幅增长,这意味着我们的网络必须大幅增长,例如我们必须使用卷积神经网络(CNN)。
Atari Pheonix. Image by mybrainongames.com
Atari游戏,如Phoenix,可能需要12小时(40-200百万步)的时间。
而当我们开始引入像CARLA这样的3D高保真度模拟器时,情况变得更加复杂。
CARLA Driving Simulator. Image by Unreal Engine
使用GPU对CARLA中的汽车进行驾驶训练需要大约3-5天(200万步)。
如果策略特别复杂,情况可能会变得更糟。
在2018年,OpenAI训练了一个能够击败DOTA 2世界冠军的智能体。你可能会问,这个智能体训练了多长时间?答案是10个月。
如果我们想在真实世界而不是在模拟器中进行训练呢?在这里,我们受到实时时间步的限制(而在之前,我们可以以超过实际时间的速度模拟步骤)。这可能需要几周,甚至更糟,可能完全无法处理。想要了解更多信息,可以搜索“强化学习的致命三重奏”。
模拟到真实的差距
如果我们想在模拟器中进行训练,然后在真实世界中部署呢?这是大多数机器人应用的情况。然而,即使一个智能体在模拟器中学会了良好的表现,也不一定意味着它会在实际应用中表现得很好。这取决于模拟器的质量。理想情况下,我们希望模拟器尽可能地接近真实生活。但请看最后一部分,了解高保真度模拟器的问题。
不可预测性和不可解释性
- 即使一个经过良好训练的强化学习智能体在真实环境中也可能表现出难以预测的行为。我们可能会试图严厉惩罚灾难性的行为,但我们仍然不能保证智能体不会选择那个动作,因为最终,我们只是在优化总体奖励的期望值。
- 可解释性:这更多地是深度学习整体上的问题,但在强化学习中,这个问题变得更加重要,因为网络通常选择如何移动可能会对人或财产造成实际损害的物理机械(比如自动驾驶或机器人)。强化学习智能体可能做出灾难性的控制决策,而我们却完全不知道为什么,这反过来意味着我们不知道如何在将来阻止它。
结论
嗯,我不知道你读完后是感到沮丧还是泼了一盆冷水。我有点意在提醒大家保持清醒,剖析研究的热度,所以我的措辞可能有些直接。但我也要声明一下,这些观点背后的事实是这正是为什么这是一个如此炙手可热的研究领域,人们正在积极致力于解决许多,如果不是所有,这些问题。这使我对强化学习的未来感到乐观,但认识到这些问题仍然存在正是我成为现实乐观主义者的原因。
信不信由你,我并不完全排除强化学习在工业应用中的可能性......当它运行时,它确实是非常出色的。我只是想确保你知道自己将自己卷入其中,以免高估承诺并低估时间表。
英文原文:
There is a lot of hype around this technology. And for good reason… it’s quite possibly one of the most important machine learning advancements towards enabling general AI. But outside of general interest, you may eventually come to the question of, “is it right for your application”?
I am currently working on a team for vision enabled robotics and as a past researcher in RL, I was asked to answer this question for my team. Below, I’ve outlined some of the reasons I think you may not want to use reinforcement learning in your application, or at least think twice before walking down that path. Let’s dive in!
Extremely Noisy
Below are two learning plots from a game which has a max score of 500. So which learning algorithm was better? Trick question. They were the exact same, the second run is just a rerun of the first. The only difference between the one training session that totally rocked it and learned a perfect policy, and the other, that miserably failed, was the random seed.
Training curves for DQN on CartPole. Image by Author
- Small changes in random initialization can greatly affect training performance so reproducibility of experimental results is challenging.
- Being noisy makes it very hard to compare algorithms, hyperparameter settings, etc because you don’t know if improved performance is because of the change you made or just a random artifact.
- You need to run 20+ training sessions under the exact same conditions to get consistent/robust results. This makes iterating on your algorithm very challenging (see note below about how long these experiments can take!)
Large amount of hyperparameters
One of the most successful algorithms on the market right now is Soft Actor-Critic (SAC), which has nearly 20 hyperparmeters to tune. Check for yourself! But that’s not the end of it…
- In deep RL, you have all the normal deep learning parameters related to network architecture: number of layers, nodes per layer, activation function, max pool, dropout, batch normalization, learning rate, etc.
- Additionally, you have 10+ hyperparameters specific to RL: buffer size, entropy coefficient, gamma, action noise, etc
- Additionally, you have “Hyperparameters” in the form of reward shaping (RewardArt) to get the agent to act as you want it to.
- Tuning even one of these can be very difficult! See notes about extremely noisy, long training time… imagine tuning 30+.
- As with most hyperparemter tuning, there’s not always an intuitive setting for each of these or a foolproof way to most efficiently find the best hyperparameters. You’re really just shooting in the dark until something seems to work.
Still in research and development
As RL is still actually in its budding phases, the research community is still working out the kinks in how advancements are validated and shared. This causes headaches for those of us that want to use the findings and reproduce the results.
- Papers are ambiguous in implementation details. You can’t always find the code and it’s not always clear how to turn some of the complex loss functions into code. And papers also seem to leave out little handwaivy tweaks they used to get that superior performance.
- Once some code does get out there on the interwebs, because of the reason listed above, these differ slightly in implementation. This makes it hard to compare results you’re getting to someone else’s online. Is my comparatively bad performance because I introduced a bug or because they used a trick I don’t know about?
Hard to debug
- Recent methods use the kitchen sink of techniques to get cutting edge results. This makes it really hard to have clean code, which subsequently makes it hard to track others code or even your own!
- On a related note, because there’s so many moving parts, it’s really easy to introduce bugs and really hard to find them. RL often has multiple networks learning. And it’s a lot of randomness in the learning process so things may work one run and may not the next. Was it because of a bug you introduced or because of a fluke in the random seed? Hard to say without running many more experiments. Which takes…. TIME.
Extremely sample inefficient
Model free learning means we don’t try to build/learn a model of the environment. So the only way we learn a policy is by interacting directly with the environment. On-policy means that we can only learn/improve our policy with samples taken from acting with our current policy, ie we have to throw away all these samples and collect new ones as soon as we run a single backward gradient update. PPO is, for example, a model-free on-policy state-of-the-art algorithm. All this means that we have to interact with the environment a lot (like millions of steps) before learning a policy.
This may be passible for if we have a high-level features in a relatively low-fidelity simulator. For example,
Image of Humanoid Environment by https://gym.openai.com/
Humanoid takes 5 hours to learn how to walk (2 mil steps)
But as soon as we move to low-level features, like image space, our state space grows a lot which means our network must grow a lot, eg we must use CNN’s.
Atari Pheonix. Image by mybrainongames.com
Atari games such as Phoenix takes 12(?) hours (40–200 mil steps)
And things get even worse when we start introducing 3D high-fidelity simulators like CARLA.
CARLA Driving Simulator. Image by Unreal Engine
Training a car to drive in CARLA takes ~3–5 days (2 mil steps) with a GPU
Andd even worse if the policy is notably complex.
In 2018, OpenAI trained an agent that beat the world champions at DOTA 2. How long did the agent take to train you ask? 10 months
What if we wanted to train in the real world instead of a simulator? Here, we are bound by real-time time steps (whereas before we could simulate steps in faster than real time.). This could take weeks or even worse, just be entirely intractable. For more on this, look up “the deadly triad of RL”.
Sim 2 real gap
What if we wanted to train in a simulator and then deploy in the real world? This is the case with most robotics applications. However, even if an agent learns to play well in a simulator, it doesn’t necessarily mean that it will transfer to real world applications. Depends how good the simulator is. Ideally, we’d make the simulator as close to real life as possible. But see the last section to see the problem with high-fidelity simulators.
Unpredictability & Inexplainability
- Even a well trained RL agent can be unpredictable in the wild. We may try to punish disastrous behaviors severely but we still don’t have a gaurantee that the agent won’t still chose that action since, in the end, we are only optimizing the expectation of total reward.
- Explainability: this is more a problem with DL in general, but in reinforcement learning, this issue takes on a new importance since the networks are often choosing how to move physical machinery that could physically damage people or property (as in the case of self driving or robotics). The RL agent may make a disastrous control decision and we have no idea exactly why, which in turn means we don’t know how to prevent it in the future.
Conclusion
Well, I don’t know if that was depressing or a buzz kill for you to read. I kind of meant it to be a reality check to cut through the hype so I did go pretty hard. But I should also disclaim all these points with the fact that these issues are the very reason why it is such a hot research area and people are actively working on many, if not all, of these pain points. This makes me optimistic for the future of RL but realizing that these are still problems is what makes me a realistic optimist.
Believe it or not, I wouldn’t totally discount RL for industrial applications… it is really awesome when it works. I’d just make sure you know what you’re getting yourself into so you don’t overpromise and underestimate the timeline.
强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Probably) Shouldn’t Use Reinforcement Learning的更多相关文章
- 转:强化学习(Reinforcement Learning)
机器学习算法大致可以分为三种: 1. 监督学习(如回归,分类) 2. 非监督学习(如聚类,降维) 3. 增强学习 什么是增强学习呢? 增强学习(reinforcementlearning, RL)又叫 ...
- Stanford大学机器学习公开课(五):生成学习算法、高斯判别、朴素贝叶斯
(一)生成学习算法 在线性回归和Logistic回归这种类型的学习算法中我们探讨的模型都是p(y|x;θ),即给定x的情况探讨y的条件概率分布.如二分类问题,不管是感知器算法还是逻辑回归算法,都是在解 ...
- 一文读懂 深度强化学习算法 A3C (Actor-Critic Algorithm)
一文读懂 深度强化学习算法 A3C (Actor-Critic Algorithm) 2017-12-25 16:29:19 对于 A3C 算法感觉自己总是一知半解,现将其梳理一下,记录在此,也 ...
- 强化学习算法DQN
1 DQN的引入 由于q_learning算法是一直更新一张q_table,在场景复杂的情况下,q_table就会大到内存处理的极限,而且在当时深度学习的火热,有人就会想到能不能将从深度学习中借鉴方法 ...
- [Reinforcement Learning] 强化学习介绍
随着AlphaGo和AlphaZero的出现,强化学习相关算法在这几年引起了学术界和工业界的重视.最近也翻了很多强化学习的资料,有时间了还是得自己动脑筋整理一下. 强化学习定义 先借用维基百科上对强化 ...
- 【笔记】MAML-模型无关元学习算法
目录 论文信息: Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networ ...
- 适合普通大学生的 Java 后端开发学习路线
大家好,我是帅地. 接下来的一段时间,帅地会总结各种技术栈的学习路线,例如 Java 开发,C++ 开发,python 开发,前端开发等等,假如你没有明确的目标,或许可以按照我说的学习路线来学习一波, ...
- AI系统——机器学习和深度学习算法流程
终于考上人工智能的研究僧啦,不知道机器学习和深度学习有啥区别,感觉一切都是深度学习 挖槽,听说学长已经调了10个月的参数准备发有2000亿参数的T9开天霹雳模型,我要调参发T10准备拿个Best Pa ...
- 各大公司广泛使用的在线学习算法FTRL详解
各大公司广泛使用的在线学习算法FTRL详解 现在做在线学习和CTR常常会用到逻辑回归( Logistic Regression),而传统的批量(batch)算法无法有效地处理超大规模的数据集和在线数据 ...
- 各大公司广泛使用的在线学习算法FTRL详解 - EE_NovRain
转载请注明本文链接:http://www.cnblogs.com/EE-NovRain/p/3810737.html 现在做在线学习和CTR常常会用到逻辑回归( Logistic Regression ...
随机推荐
- WIn32 C++ 消息处理函数 问题
这个消息处理这个 Winproc 这个 接收到网络信息 在自己的函数用完后可以选择向系统路由传递这个网络消息接收到的数据原型 你处理完,系统也处理,不想让系统处理可以不将接受到的那几个变量啊数据啊,就 ...
- C#.NET X509Certificate2 该项不适于在指定状态下使用
X509Certificate2 x509 = new X509Certificate2(lblPfxPath.Text,txtPfxPwd.Text.Trim() ); string xmlpri= ...
- categraf托管与自升级
categraf支持多种方式进行部署.托管,社区里部署和管理categraf也是五花八门,大家自己使用方便即可. 之前我们觉得大家通过ansible之类的工具批量下发/更新就能很简单地完成任务,最近很 ...
- Vue学习:15.组件化开发
组件化开发 组件化开发是一种软件开发方法,它将应用程序拆分成独立的.可重用的模块,每个模块都被称为组件.这些组件可以独立开发.测试.维护和部署,从而提高了代码的可维护性.可扩展性和复用性.在前端开发中 ...
- maven项目创建默认目录结构
maven项目创建默认目录结构命令 项目文件夹未创建情况下 mvn \ archetype:generate \ -DgroupId=com.lits.parent \ -DartifactId=my ...
- 简约-Markdown教程
##注意 * 两个元素之间最好有空行 * 利用\来转义 我是一级标题 ==== 我是二级标题 ---- #我是一级标题 ##我是二级标题 ##<center>标题居中显示</cent ...
- 高通mm-camera平台 Camera bring up基本调试思路
原文:https://www.cnblogs.com/thjfk/p/4086001.html 确定硬件 1.首先对照原理图,检查camera module的pin脚连接是否正确. 2.用示波器量Ca ...
- springboot 访问url 报404
使用Springboot 写了一个后端服务,通过 postman 测试接口时,发现一直 404 (message: No message available) 把springboot 从启动 到 da ...
- Java报表开发工具总结
Java报表工具,首先可以分成两大类:纯Java报表工具,和支持Java的报表工具. 支持Java的报表工具 支持Java的报表工具.其实就是非Java的报表工具,但是可以在Java程序中调用,这样的 ...
- 你有对 Vue 项目进行哪些优化?
(1)代码层面的优化 v-if 和 v-show 区分使用场景 computed 和 watch 区分使用场景 v-for 遍历必须为 item 添加 key,且避免同时使用 v-if 图片资源懒加载 ...