Temporal-Difference Control: SARSA and Q-Learning
SARSA
SARSA algorithm also estimate Action-Value functions rather than State-Value function. The difference between SARSA and Monte Carlo is: SARSA does not need to wait the actual return untill the end of the episode, instead it learns from each time step using estimations of the return.
In every step, the agent takes an action A from state S, then it receives a reward R and gets to a new state S'. Based on the policy π, we know the algorithm will greedily pick the action A'. So now we have:S,A,R,S',A', and the task is to estimate Q function of S,A pair.
We borrow the idea of estimating State-Value functions and use it onto Action-Value function estimation, then we get:
Here is the Sudo code for SARSA:
On-Policy vs Off-Policy
If we look into the learning process, there are actually two steps, firstly taking an action A from state S based on policy π, geting the reward R, and the next state S' coming; the second step is using the Q-function of action A' followd the same policy π. Both of the two steps use the same policy π, but actually they can be different. On the first step, the policy is called Target Policy, which is the policy that we will update. The second policy is Behavior Policy, this is how we pick the oprimal action from S'. Q-Learning uses different Policies on the two steps.
Q-Learning
From state S', Q-Learning algorithm picks the action maximizing the Q-function. It stands at state S', looking into all possible actions, and then chooses the best one.
Temporal-Difference Control: SARSA and Q-Learning的更多相关文章
- 强化学习9-Deep Q Learning
之前讲到Sarsa和Q Learning都不太适合解决大规模问题,为什么呢? 因为传统的强化学习都有一张Q表,这张Q表记录了每个状态下,每个动作的q值,但是现实问题往往极其复杂,其状态非常多,甚至是连 ...
- 增强学习(五)----- 时间差分学习(Q learning, Sarsa learning)
接下来我们回顾一下动态规划算法(DP)和蒙特卡罗方法(MC)的特点,对于动态规划算法有如下特性: 需要环境模型,即状态转移概率\(P_{sa}\) 状态值函数的估计是自举的(bootstrapping ...
- 【PPT】 Least squares temporal difference learning
最小二次方时序差分学习 原文地址: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd= ...
- 论文笔记之:Human-level control through deep reinforcement learning
Human-level control through deep reinforcement learning Nature 2015 Google DeepMind Abstract RL 理论 在 ...
- 如何用简单例子讲解 Q - learning 的具体过程?
作者:牛阿链接:https://www.zhihu.com/question/26408259/answer/123230350来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明 ...
- 强化学习_Deep Q Learning(DQN)_代码解析
Deep Q Learning 使用gym的CartPole作为环境,使用QDN解决离散动作空间的问题. 一.导入需要的包和定义超参数 import tensorflow as tf import n ...
- 深度强化学习介绍 【PPT】 Human-level control through deep reinforcement learning (DQN)
这个是平时在实验室讲reinforcement learning 的时候用到PPT, 交期末作业.汇报都是一直用的这个,觉得比较不错,保存一下,也为分享,最早该PPT源于师弟汇报所做.
- The Difference between Gamification and Game-Based Learning
http://inservice.ascd.org/the-difference-between-gamification-and-game-based-learning/ Have you trie ...
- deep Q learning小笔记
1.loss 是什么 2. Q-Table的更新问题变成一个函数拟合问题,相近的状态得到相近的输出动作.如下式,通过更新参数 θθ 使Q函数逼近最优Q值 深度神经网络可以自动提取复杂特征,因此,面对高 ...
随机推荐
- DRF框架 之基础配置
Vue框架的总结 """ 1.vue如果控制html 在html中设置挂载点.导入vue.js环境.创建Vue对象与挂载点绑定 2.vue是渐进式js框架 3.vue指令 ...
- Kubernetes 入门-学习-nginx安装-dashboard安装
一.入门 1.Kubernetes中文社区---http://docs.kubernetes.org.cn/ 2.Kubernetes集群组件: - etcd 一个高可用的K/V键值对存储和服务发现系 ...
- newgrp - 登录到新的用户组中
总览 (SYNOPSIS) newgrp [ group ] 描述 (DESCRIPTION) Newgrp 改变 调用者 的 用户组标识, 类似于 login(1). 调用者 仍旧 登录 在 系统 ...
- 简单Spring Cloud 微服务框架搭建
微服务是现在比较流行的技术,对于程序猿而言,了解并搭建一个基本的微服务框架是很有必要滴. 微服务包含的内容非常多,一般小伙伴们可以根据自己的需求不断添加各种组件.框架. 一般情况下,基本的微服务框架包 ...
- The Complex Inversion Formula. Bromwich contour.
网址:http://www.solitaryroad.com/c916.html
- Linux shell 批量验证端口连通性
工作中会遇到验证到某某服务器端口是否连通,如果IP或端口多时,用shell还是很省时省力的,看下面的脚本: #!/bin/bash # #database check #set -o nounset ...
- bzoj5089 最大连续子段和 分块+复杂度分析+凸包
题目传送门 https://lydsy.com/JudgeOnline/problem.php?id=5089 题解 本来打算迟一点再写这个题解的,还有一个小问题没有弄清楚. 不过先写一下存个档吧. ...
- Java 学习输入1234 求和
输入任何整数都能求和 import java.util.Scanner; /** * @author 作者anil E-mail:888@yirose.com * @date 创建时间:2017年3月 ...
- @RequestBody、@RequestParam、@PathVariable区别与使用场景
由于项目是前后端分离,因此后台使用的是spring boot,做成微服务,只暴露接口.接口设计风格为restful的风格,在get请求下,后台接收参数的注解为RequestBody时会报错:在post ...
- iOS---如何获取手机的本地照片和相册
__weak ViewController *weakSelf = self; dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIO ...