a survey for RL
• A finite set of states St summarizing the information the agent senses from the environment at every time step t ∈ {1, ..., T}.
• A set of actions At which the agent can perform at each time step t ∈ {1, ..., T} to interact with the environment.
• A set of transition probabilities between subsequent states which render the environment stochastic. Note: the probabilities are usually not explicitly modeled but the result of the stochastic nature of the financial asset’s price process.
• A reward (or return) function Rt which provides a numerical feedback value rt to the agent in response to its action At−1 = at−1 in state St−1 = st−1.
• A policy π which maps states to concrete actions to be carried out by the agent. The policy can hence be understood as the agent’s rules for how to choose actions.
• A value function V which maps states to the total (discounted) reward the agent can expect from a given state until the end of the episode (trading period) under policy π.
Given the above framework, the decision problem is formalized as finding the optimal policy π = π ∗ , i.e., the mapping from states to actions, corresponding to the optimal value function V ∗ - see also Dempster et al. (2001); Dempster and Romahi (2002):
V ∗ (st) = max at E[Rt+1 + γV ∗ (St+1)|St = st ].(1)
Hereby, E denotes the expectation operator, γ the discount factor, and Rt+1 the expected immediate reward for carrying out action At = at in state St = st . Further, St+1 denotes the next state of the agent. The value function can hence be understood as a mapping from states to discounted future rewards which the agent seeks to maximize with its actions.
To solve this optimization problem, the Q-Learning algorithm (Watkins, 1989) can be applied, extending the above equation to the level of state-action tuples:
Q ∗ (st , at) = E[Rt+1 + γ max at+1 Q ∗ (St+1, at+1)|St = st , At = at ].(2)
Hereby, the Q-value Q∗ (st , at) equals to the immediate reward for carrying out action At = at in state St = st plus the discounted future reward from carrying on in the best way possible.
The optimal policy π ∗ (the mapping from states to actions) then simply becomes:
π ∗ (st) = max at Q ∗ (st , at) .(3)
i.e., in every state St = st , choose the action At = at that yields the highest Q-value. To approximate the Q-function during (online) learning, an iterative optimization is carried out with α denoting the learning rate - see also Sutton and Barto (1998) for further details:
Q ∗ (st , at) ← (1 − α) Q ∗ (st , at) + α (rt+1 + γ max at+1 Q ∗ (st+1, at+1) ) . (4)
a survey for RL的更多相关文章
- (转)Applications of Reinforcement Learning in Real World
Applications of Reinforcement Learning in Real World 2018-08-05 18:58:04 This blog is copied from: h ...
- 论文笔记系列-Neural Network Search :A Survey
论文笔记系列-Neural Network Search :A Survey 论文 笔记 NAS automl survey review reinforcement learning Bayesia ...
- (zhuan) 一些RL的文献(及笔记)
一些RL的文献(及笔记) copy from: https://zhuanlan.zhihu.com/p/25770890 Introductions Introduction to reinfor ...
- A Survey of Visual Attention Mechanisms in Deep Learning
A Survey of Visual Attention Mechanisms in Deep Learning 2019-12-11 15:51:59 Source: Deep Learning o ...
- Generalizing from a Few Examples: A Survey on Few-Shot Learning 小样本学习最新综述 | 三大数据增强方法
目录 原文链接:小样本学习与智能前沿 01 Transforming Samples from Dtrain 02 Transforming Samples from a Weakly Labeled ...
- 知识图谱顶刊综述 - (2021年4月) A Survey on Knowledge Graphs: Representation, Acquisition, and Applications
知识图谱综述(2021.4) 论文地址:A Survey on Knowledge Graphs: Representation, Acquisition, and Applications 目录 知 ...
- SharePoint 2010 Survey的Export to Spreadsheet功能怎么不见了?
背景信息: 最近用户报了一个问题,说他创建的Survey里将结果导出成Excel文件(Export to spreadsheet)的按钮不见了. 原因排查: 正常情况下,这个功能只存在于SharePo ...
- 中间值为什么为l+(r-l)/2,而不是(l+r)/2
二分法的算法中,我们看到一些代码里取中间值: MID=l+(r-l)/2; 为什么是这个呢?不就是(l+r)/2吗?为什么要多此一举呢? 其实还是有不一样的,看看他们的区别吧: l,r是指针的时候只能 ...
- SharePoint Tricks - Survey
1. SharePoint 2010中,在Survey的问题框中输入HTML代码可以用于插入图片或者链接,具体方法为: 1.1 在问题框中输入html, 1.2 在New Form和Edit Form ...
随机推荐
- PowerDesigner设计表时显示注释列Comment,Columns中没有Comment的解决办法
我使用的PowerDesigner版本为16.5,如下图: 在所要编辑的表上双击,打开Table Properties窗口,并将上面的选项卡切换到Columns,如下图: 我们点击Customize ...
- LibreOJ #2036. 「SHOI2015」自动刷题机
#2036. 「SHOI2015」自动刷题机 内存限制:256 MiB时间限制:1000 ms标准输入输出 题目类型:传统评测方式:文本比较 题目描述 曾经发明了信号增幅仪的发明家 SHTSC 又公开 ...
- Glassfish DeploymentException: Error in linking security policy for
http://stackoverflow.com/questions/7322476/glassfish-deploymentexception-error-in-linking-security-p ...
- day9函数作业详解
1.day9题目 1,整理函数相关知识点,写博客. 2,写函数,检查获取传入列表或元组对象的所有奇数位索引对应的元素,并将其作为新列表返回给调用者. 3,写函数,判断用户传入的对象(字符串.列表.元组 ...
- Huber鲁棒损失函数
在统计学习角度,Huber损失函数是一种使用鲁棒性回归的损失函数,它相比均方误差来说,它对异常值不敏感.常常被用于分类问题上. 下面先给出Huber函数的定义: 这个函数对于小的a值误差函数是二次的, ...
- Jmeter 的 vars 和 props 用法
meter 的 JSR223 控件是 代替 BeanShell 的新一代脚本控件,支持多种脚本语言,尤其是其中的 Groovy,更是重点推荐使用的脚本语言,本文研究其中的 vars 和 props 两 ...
- js一些练习题
1 如果数组中存在 item,则返回元素在数组中的位置,否则返回 -1 function indexOf(arr, item) { if(Array.prototype.indexOf){ retur ...
- NET Core应用中如何记录和查看日志
NET Core应用中如何记录和查看日志 日志记录不仅对于我们开发的应用,还是对于ASP.NET Core框架功能都是一项非常重要的功能特性.我们知道ASP.NET Core使用的是一个极具扩展性的日 ...
- 2016-2017 ACM-ICPC, NEERC, Southern Subregional Contest A. Toda 2 贪心 + 暴力
A. Toda 2 time limit per test 2 seconds memory limit per test 512 megabytes input standard input out ...
- SSM Spring SpringMVC Mybatis框架整合Java配置完整版
以前用着SSH都是老师给配好的,自己直接改就可以.但是公司主流还是SSM,就自己研究了一下Java版本的配置.网上大多是基于xnl的配置,但是越往后越新的项目都开始基于JavaConfig配置了,这也 ...