强化学习(六):n-step Bootstrapping
n-step Bootstrapping
n-step 方法将Monte Carlo 与 one-step TD统一起来。 n-step 方法作为 eligibility traces 的引入,eligibility traces 可以同时的在很多时间间隔进行bootstrapping.
n-step TD Prediction
one-step TD 方法只是基于下一步的奖励,通过下一步状态的价值进行bootstrapping,而MC方法则是基于某个episode的整个奖励序列。n-step 方法则是基于两者之间。使用n 步更新的方法被称作n-step TD 方法。
对于MC方法,估计\(v_{\pi}(S_t)\), 使用的是完全收益(complete return)是:
\]
而在one-step TD方法中,则是一步收益(one-step return):
\]
那么n-step return:
\]
其中 \(n\ge 1, 0\le t< T-n\)。
因为在t+n 时刻才可知道 \(R_{t+n}, V_{t+n-1}\) ,故可定义:
\]
# n-step TD for estimating V = v_pi
Input: a policy pi
Algorithm parameters: step size alpha in (0,1], a positive integer n
Initialize V(s) arbitrarily, s in S
All store and access operations (for S_t and R_t) can take their index mod n+1
Loop for each episode:
Initialize and store S_0 != terminal
T = infty
Loop for t = 0,1,2,...
if t < T, then:
Take an action according to pi(.|S_t)
Observe and store the next reward as R_{t+1} and the next state as S_{t+1}
If S_{t+1} is terminal, then T = t + 1
tau = t - n + 1 (tau is the time whose state's estimate is being updated)
if tau >= 0:
G = sum_{i = tau +1}^{min(tau+n,T)} gamma^{i-tau-1} R_i
if tau + n < T, then G = G + gamma^n V(S_{tau+n})
V(S_{tau}) = V(S_{tau} + alpha [G - V(S_tau)])
Until tau = T - 1
n-step Sarsa
与n-step TD方法类似,只不过n-step Sarsa 使用的state-action对,而不是state:
\]
自然地:
\]
# n-step Sarsa for estimating Q = q* or q_pi
Initialize Q(s,a) arbitrarily, for all s in S, a in A
Initialize pi to be e-greedy with respect to Q, or to a fixed given policy
Algorithm parameters: step size alpha in (0,1], small e >0, a positive integer n
All store and access operations (for S_t, A_t and R_t) can take their index mod n+1
Loop for each episode:
Initialize and store S_o != terminal
Select and store an action A_o from pi(.|S_0)
T = infty
Loop for t = 0,1,2,...:
if t < T, then:
Take action A_t
Observe and store the next reward as R_{t+1} and the next state as S_{t+1}
If S_{t+1} is terminal, then:
T = t + 1
else:
Select and store an action A_{t+1} from pi(.|S_{t+1})
tau = t - n + 1 (tau is the time whose estimate is being updated)
if tau >= 0:
G = sum_{i = tau+1}^{min(tau+n,T)} gamma^{i-tau-1}R_i
if tau + n < T, then G = G + gamma^nQ(S_{tau +n}, A_{tau+n})
Q(S_tau,A_tau) = Q(S_{tau},A_{tau}) + alpha [ G - Q(S_{tau},A_{tau})]
至于 Expected Sarsa:
\]
\]
n-step Off-policy Learning by Importance Sampling
一个简单off-policy 版的 n-step TD:
\]
其中 \(\rho_{t:t+n-1}\) 是 importance sampling ratio:
\]
off-policy n-step Sarsa更新形式:
\]
# Off-policy n-step Sarsa for estimating Q = q* or q_pi
Input: an arbitrary behavior policy b such that b(a|s) > 0, for all s in S, a in A
Initialize pi to be greedy with respect to Q, or as a fixed given policy
Algorithm parameters: step size alpha in (0,1], a positive integer n
All store and access operations (for S_t, A_t, and R_t) can take their index mod n + 1
Loop for each episode:
Initialize and store S_0 != terminal
Select and store an action A_0 from b(.|S0)
T = infty
Loop for t = 0,1,2,...:
if t<T, then:
take action At
Observe and store the next reward as R_{t+1} and the next state as S_{t+1}
if S_{t+1} is terminal, then:
T = t+1
else:
select and store an action A_{t+1} from b(.|S_{t+1})
tau = t - n + 1 (tau is the time whose estimate is being updated)
if tau >=0:
rho = \pi_{i = tau+1}^min(tau+n-1, T-1) pi(A_i|S_i)/b(A_i|S_i)
G = sum_{i = tau +1}^min(tau+n, T) gamma^{i-tau-1}R_i
if tau + n < T, then: G = G + gamma^n Q(S_{tau+n}, A_{tau+n})
Q(S_tau,A_tau) = Q(S_tau, A_tau) + alpha rho [G-Q(s_tau, A_tau)]
if pi is being learned, then ensure that pi(.|S_tau) is greedy wrt Q
Until tau = T - 1
Per-decision Off-policy Methods with Control Variates
pass
Off-policy Learning without Importance Sampling: The n-step Tree Backup Algorithm
tree-backup 算法是一种可以不借助importance sampling的off-policy n-step 方法。 tree-backup 的更新基于整个估计行动价值树,或者说,更新是基于树中叶结点(未被选中的行动)的估计的行动价值。树的内部的行动结点(即实际被选择的行动)不参加更新。
\]
G_{t:t+2} &\dot =& R_{t+1} + \gamma\sum_{a \ne A_{t+1}} \pi(a|S_{t+1})Q_{t+1}(S_{t+1},a)+ \gamma \pi(A_{t+1}|S_{t+1})(R_{t+2}+\gamma \sum_{a}\pi(a|S_{t+2},a)) \\
& = & R_{t+1} + \gamma\sum_{a\ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1},a) + \gamma\pi(A_{t+1}|S_{t+1})G_{t+1:t+2}
\end{array}
\]
于是
\]
算法更新规则:
\]
# n-step Tree Backup for estimating Q = q* or q_pi
Initialize Q(s,a) arbitrarily, for all s in S, a in A
Initialize pi to be greedy with respect to Q, or as a fixed given policy
Algorithm parameters: step size alpha in (0,1], a positive integer n
All store and access operations can take their index mod n+1
Loop for each episode:
Initialize and store S_0 != terminal
Choose an action A_0 arbitrarily as a function of S_0; Store A_0
T = infty
Loop for t = 0,1,2,...:
If t < T:
Take action A_t; observe and store the next reward and state as R_{t+1}, S_{t+1}
if S_{t+1} is terminal:
T = t + 1
else:
Choose an action A_{t+1} arbitrarily as a function of S_{t+1}; Store A_{t+1}
tau = t+1 - n (tau is the time whose estimate is being updated)
if tau >= 0:
if t + 1 >= T:
G = R_T
else:
G = R_{t+1} + gamma sum_{a} pi(a|S_{t+1})Q(S_{t+1},a)
Loop for k = min(t, T - 1) down through tau + 1:
G = R_k + gamma sum_{a != A_k}pi(a|S_k)Q(S_k,a) + gamma pi(A_k|S_k) G
Q(S_tau,A_tau) = Q(S_tau,A_tau) + alpha [G - Q(S_tau,A_tau)]
if pi is being learned, then ensure that pi(.|S_tau) is greedy wrt Q
Until tau = T - 1
*A Unifying Algorithm: n-step Q(\(\sigma\))
在n-step Sarsa方法中,使用所有抽样转换(transitions), 在tree-backup 方法中,使用state-to-action所有分支的转换,而非抽样,而在期望 n-step 方法中,除了最后一步不使用抽样而使用所有分支的转换外,其他所有都进行抽样转换。
为统一以上三种算法,有一种思路是引入一个随机变量抽样率:\(\sigma\in [0,1]\),当其取1时,表示完全抽样,当取0时表示使用期望而不抽样。
根据tree-backup n-step return (h = t + n)以及\(\bar V\):
G_{t:h} &\dot =& R_{t+1} + \gamma\sum_{a\ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1},a) + \gamma\pi(A_{t+1}|S_{t+1})G_{t+1:h}\\
& = & R_{t+1} +\gamma \bar V_{h-1} (S_{t+1}) - \gamma\pi(A_{t+1}|S_{t+1})Q_{h-1}(S_{t+1},A_{t+1}) + \gamma\pi(A_{t+1}| S_{t+1})G_{t+1:h}\\
& =& R_{t+1} +\gamma\pi(A_{t+1}|S_{t+1})(G_{t+1:h} - Q_{h-1}(S_{t+1},A_{t+1})) + \gamma \bar V_{h-1}(S_{t+1})\\
\\
&& (\text{引入}, \sigma)\\
\\
& = & R_{t+1} + \gamma(\sigma_{t+1}\rho_{t+1} + (1 - \sigma_{t+1})\pi(A_{t+1}|S_{t+1}))(G_{t+1:h} - Q_{h-1}(S_{t+1}, A_{t+1})) + \gamma \bar V_{h-1}(S_{t+1})
\end{array}
\]
# n-step Tree Backup for estimating Q = q* or q_pi
Initialize Q(s,a) arbitrarily, for all s in S, a in A
Initialize pi to be greedy with respect to Q, or as a fixed given policy
Algorithm parameters: step size alpha in (0,1], a positive integer n
All store and access operations can take their index mod n+1
Loop for each episode:
Initialize and store S_0 != terminal
Choose an action A_0 arbitrarily as a function of S_0; Store A_0
T = infty
Loop for t = 0,1,2,...:
If t < T:
Take action A_t; observe and store the next reward and state as R_{t+1}, S_{t+1}
if S_{t+1} is terminal:
T = t + 1
else:
Choose an action A_{t+1} arbitrarily as a function of S_{t+1}; Store A_{t+1}
Select and store sigma_{t+1}
Store rho_{t+1} = pi(A_{t+1}|S_{t+1})/b(A_{t+1}|S_{t+1})
tau = t+1 - n (tau is the time whose estimate is being updated)
if tau >= 0:
G = 0
Loop for k = min(t, T - 1) down through tau + 1:
if k = T:
G = R_t
else:
V_bar = sum_{a} pi(a|S_k) Q(S_k,a)
G = R_k + gamma(simga_k rho_k + (1-simga_k)pi(A_k|S_k))(G - Q(S_k,A_k)) + gamma V_bar
Q(S_tau,A_tau) = Q(S_tau,A_tau) + alpha [G - Q(S_tau,A_tau)]
if pi is being learned, then ensure that pi(.|S_tau) is greedy wrt Q
Until tau = T - 1
强化学习(六):n-step Bootstrapping的更多相关文章
- 强化学习(六)时序差分在线控制算法SARSA
在强化学习(五)用时序差分法(TD)求解中,我们讨论了用时序差分来求解强化学习预测问题的方法,但是对控制算法的求解过程没有深入,本文我们就对时序差分的在线控制算法SARSA做详细的讨论. SARSA这 ...
- 【转载】 强化学习(六)时序差分在线控制算法SARSA
原文地址: https://www.cnblogs.com/pinard/p/9614290.html ------------------------------------------------ ...
- 强化学习(十六) 深度确定性策略梯度(DDPG)
在强化学习(十五) A3C中,我们讨论了使用多线程的方法来解决Actor-Critic难收敛的问题,今天我们不使用多线程,而是使用和DDQN类似的方法:即经验回放和双网络的方法来改进Actor-Cri ...
- 强化学习(五)用时序差分法(TD)求解
在强化学习(四)用蒙特卡罗法(MC)求解中,我们讲到了使用蒙特卡罗法来求解强化学习问题的方法,虽然蒙特卡罗法很灵活,不需要环境的状态转化概率模型,但是它需要所有的采样序列都是经历完整的状态序列.如果我 ...
- 强化学习(八)价值函数的近似表示与Deep Q-Learning
在强化学习系列的前七篇里,我们主要讨论的都是规模比较小的强化学习问题求解算法.今天开始我们步入深度强化学习.这一篇关注于价值函数的近似表示和Deep Q-Learning算法. Deep Q-Lear ...
- 强化学习(七)时序差分离线控制算法Q-Learning
在强化学习(六)时序差分在线控制算法SARSA中我们讨论了时序差分的在线控制算法SARSA,而另一类时序差分的离线控制算法还没有讨论,因此本文我们关注于时序差分离线控制算法,主要是经典的Q-Learn ...
- 【转载】 强化学习(八)价值函数的近似表示与Deep Q-Learning
原文地址: https://www.cnblogs.com/pinard/p/9714655.html ------------------------------------------------ ...
- 【转载】 强化学习(七)时序差分离线控制算法Q-Learning
原文地址: https://www.cnblogs.com/pinard/p/9669263.html ------------------------------------------------ ...
- 强化学习中的无模型 基于值函数的 Q-Learning 和 Sarsa 学习
强化学习基础: 注: 在强化学习中 奖励函数和状态转移函数都是未知的,之所以有已知模型的强化学习解法是指使用采样估计的方式估计出奖励函数和状态转移函数,然后将强化学习问题转换为可以使用动态规划求解的 ...
- DRL强化学习:
IT博客网 热点推荐 推荐博客 编程语言 数据库 前端 IT博客网 > 域名隐私保护 免费 DRL前沿之:Hierarchical Deep Reinforcement Learning 来源: ...
随机推荐
- Django model 字段类型及选项解析---转载
model field 类型1.AutoField() 自增的IntegerField,通常不用自己设置,若没有设置主键,Django会自动添加它为主键字段,Django会自动给每张表添加一个自增的p ...
- pycharm 打开json 文件 \2 自动成了转义字符
打开json 文件 \2 自动成了转义字符 暂时只发现在( \2 ) \ 后面为数字的情况下会出现转义json 文件为是指:在pycharm 中新建 file 后缀为json的文件 如: 1234.j ...
- JQuery选择器,动画,事件和DOM操作
JQuery是由JS封装的一些方法,供我们调用,可以快速的实现某些JS功能,实际是JS编写的方法包 将JQuery文件放到JS文件夹下,然后引用到<head></head>中 ...
- xtrabackup 2.4.3 BUG
用XtraBackup对备份集进行apply log 的时候,卡在 xtrabackup 版本:2.4.3 InnoDB: Waited for 1535930 seconds for 128 pen ...
- Java第一次实训课的作业
1.圆的面积 2.加密数字 3.奇偶数
- 利用Android-FingerprintManager类实现指纹识别
安卓指纹识别 利用FingerprintManager主类进行指纹识别. Github项目地址 在安卓6.0中新增了API,FingerprintManager类,它是Google提供的帮助访问指纹硬 ...
- opencv学习之路(40)、人脸识别算法——EigenFace、FisherFace、LBPH
一.人脸识别算法之特征脸方法(Eigenface) 1.原理介绍及数据收集 特征脸方法主要是基于PCA降维实现. 详细介绍和主要思想可以参考 http://blog.csdn.net/u0100066 ...
- centos7 安装pgsql
1.添加prm安装源(或者从官网下载) PostgreSQL官网地址:https://yum.postgresql.org/ yum install https://download.postgres ...
- Java基础学习-Eclipse综述和运算符的使用
1.Eclipse的概述(磨刀不误砍柴工) -Eclipse是一个IDE(集成开发环境) -IDE(Intergrated Development Environment) ...
- jQuary学习の五のAJAX
AJAX 是与服务器交换数据的技术,它在不重载全部页面的情况下,实现了对部分网页的更新. 一.jQuery load() 方法 jQuery load() 方法是简单但强大的 AJAX 方法. loa ...