Temporal-Difference Learning for Prediction
In Monte Carlo Learning, we've got the estimation of value function:
Gt is the episode return from time t, which can be calculated by:
Please recall, Gt can be only calculated at the end of a given episode. This reveals a disadvantage of Monte Carlo Learning: have to wait until the end of episodes.
TD(0) algorithm replace Gt of the equation to the immediate reward and estimated value function of the next state:
The algorithm updates the Estimated State-Value Function at time t+1, because everything in the equation is determined. This means we will wait until the agent reaching the next state, so that the agent can get the immediate reward Rt+1 and know which state the system will transition to at time t+1.
The equations below are State-Value Function for Dynamic Programming, in which the whole environment is known. Compare to these equations:
TD algorithm is quite like 6.4 Bellman Equation, but it does not take expectation. Instead, it uses the knowledge till now to estimate how much reward I am going to get from this state. The whole algorithm can be demonstrated as:
TD Target, TD Error
Bias/ Viriance trade-off
Bootstraping
Temporal-Difference Learning for Prediction的更多相关文章
- 【PPT】 Least squares temporal difference learning
最小二次方时序差分学习 原文地址: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd= ...
- PP: Multi-Horizon Time Series Forecasting with Temporal Attention Learning
Problem: multi-horizon probabilistic forecasting tasks; Propose an end-to-end framework for multi-ho ...
- [Reinforcement Learning] Model-Free Prediction
上篇文章介绍了 Model-based 的通用方法--动态规划,本文内容介绍 Model-Free 情况下 Prediction 问题,即 "Estimate the value funct ...
- [Machine Learning] 机器学习常见算法分类汇总
声明:本篇博文根据http://www.ctocio.com/hotnews/15919.html整理,原作者张萌,尊重原创. 机器学习无疑是当前数据分析领域的一个热点内容.很多人在平时的工作中都或多 ...
- (转) Deep Learning Research Review Week 2: Reinforcement Learning
Deep Learning Research Review Week 2: Reinforcement Learning 转载自: https://adeshpande3.github.io/ad ...
- Awesome Reinforcement Learning
Awesome Reinforcement Learning A curated list of resources dedicated to reinforcement learning. We h ...
- Machine Learning 学习笔记1 - 基本概念以及各分类
What is machine learning? 并没有广泛认可的定义来准确定义机器学习.以下定义均为译文,若以后有时间,将补充原英文...... 定义1.来自Arthur Samuel(上世纪50 ...
- Distributional Reinforcement Learning with Quantile Regression
郑重声明:原文参见标题,如有侵权,请联系作者,将会撤销发布! arXiv:1710.10044v1 [cs.AI] 27 Oct 2017 In AAAI Conference on Artifici ...
- 3. Distributional Reinforcement Learning with Quantile Regression
C51算法理论上用Wasserstein度量衡量两个累积分布函数间的距离证明了价值分布的可行性,但在实际算法中用KL散度对离散支持的概率进行拟合,不能作用于累积分布函数,不能保证Bellman更新收敛 ...
随机推荐
- codeforces 108D Basketball Team(简单组合)
D. Basketball Team time limit per test 2 seconds memory limit per test 256 megabytes input standard ...
- react 详细解析学习笔记
React的介绍: React来自于Facebook公司的开源项目 React 可以开发单页面应用 spa(单页面应用) react 组件化模块化 开发模式 React通过对DOM的模拟 ...
- R语言-三种方法绘制单位圆
与一般开发语言不同,R以数据统计分析和绘图可视化为主要卖点.本文是第一篇博客,解决一个简单的绘图问题,以练手为目的. 以下直接给出三种单位圆的画法: 方法1 f=seq(,*pi,0.001) x=s ...
- dede后台系统基本参数空白怎么办?
dede后台系统基本参数空白怎么办? 如图: 解决办法:还原dede_sysconfig表即可 后台 系统-SQL命令行工具,执行如下sql delete table dede_sysconfig ...
- 超赞的Linux软件分享(持续更新)
开发 Android studio - Android 的官方 IDE:Android Studio 提供在各种类型的安卓设备上构建应用最快的工具. Aptana - Aptana Studio 利用 ...
- mysql WHERE语句 语法
mysql WHERE语句 语法 作用:如需有条件地从表中选取数据,可将 WHERE 子句添加到 SELECT 语句.珠海大理石平尺 语法:SELECT 列名称 FROM 表名称 WHERE 列 运算 ...
- POJ 2112 Optimal Milking ( 经典最大流 && Floyd && 二分 )
题意 : 有 K 台挤奶机器,每台机器可以接受 M 头牛进行挤奶作业,总共有 C 头奶牛,机器编号为 1~K,奶牛编号为 K+1 ~ K+C ,然后给出奶牛和机器之间的距离矩阵,要求求出使得每头牛都能 ...
- 【bzoj3676】[Apio2014]回文串
*题目描述: 考虑一个只包含小写拉丁字母的字符串s.我们定义s的一个子串t的“出现值”为t在s中的出现次数乘以t的长度.请你求出s的所有回文子串中的最大出现值. *输入: 输入只有一行,为一个只包含小 ...
- 【PowerOJ1752&网络流24题】运输问题(费用流)
题意: 思路: [问题分析] 费用流问题. [建模方法] 把所有仓库看做二分图中顶点Xi,所有零售商店看做二分图中顶点Yi,建立附加源S汇T. 1.从S向每个Xi连一条容量为仓库中货物数量ai,费用为 ...
- C# 字符串的长度问题
string str = "aa奥奥"; 如果直接取 str.length,取的就是字符的长度,一个汉字也是一个字符,长度就是4. 一个汉字是两个字节,如果需要统计字节数,可以用下 ...