Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs

> 目录 <

Agent–Environment Interface
Goals and Rewards
Returns and Episodes
Policies and Value Functions
Optimal Policies and Optimal Value Functions

> 笔记 <

Agent–Environment Interface

MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent.1 The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.

More specifcally, the agent and environment interact at each of a sequence of discrete time steps, t = 0,1,2,.... At each time step t, the agent receives some representation of the environment's state, $S_{t}\in S$, where $S$ is the set of possible states, and on that basis selects an action, $A_{t}\in A(S_{t})$, where $A(S_{t})$ is the set of actions available in state $S_{t}$. One time step later, in part as a consequence of its action, the agent receives a numerical reward, $R_{t+1}\in R \subset \mathbb{R}$, and finds itself in a new state, $S_{t+1}$.

At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent's policy and is denoted $\pi_{t}(a|s)$ is the probability that $A_{t}=a$ if $S_{t}=s$. Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent's goal, roughly speaking, is to maximize the total amount of reward it receives over the long run.

the actions are the choices made by the agent; the states are the basis for making the choices; and the rewards are the basis for evaluating the choices.

图1. agent-environment interaction in a MDP

马尔可夫性(Markov property)：如果state signal具有马尔科夫性，那么当前状态只跟上一状态有关，它包含了所有从过去经历中得到的信息。马尔可夫性对RL而言很重要，∵decisions和values通常都被认为是一个只跟当前state相关的函数。

MDP的动态性：$p(s',r|s,a)=Pr\left \{ S_{t}=s',R_{t}=r|S_{t-1}=s,A_{t-1}=a \right \}$,

where $ \underset{s'\in S \ r\in R}{\sum \sum}p(s',r|s,a)=1 $, for all $s\in S$, $a\in A(s)$.

基于the dynamics of the MDP, 我们可以很容易地得到状态转移概率(state-transition probabilities, $p(s'|s,a)$)，state-action的期望回报(the expected rewards for state–action pairs, $r(s,a)$)，以及state-action-next state的期望回报(the expected rewards for state–action-next state, $r(s,a,s')$)。

Goals and Rewards

agent的goal是以一个从environment传递给agent的reward signal的形式存在的。我们通过定义reward signal的值，可以实现跟agent的交流，告诉它what you want it to achieve, not how you want it achieved。

Agent的目标是最大化total reward。因此，最大化的不是immediate reward，而是cumulative reward in the long run。

Returns and Episodes

The return is the function of future rewards that the agent seeks to maximize (in expected value). return有多种形式，取决于task本身和是否希望对回报进行折扣。

Expected return： $G_{t}=R_{t+1}+R_{t+2}+...+R_{T}$， where T is a final time step。适合于episodic tasks。

Episodic tasks: each episode ends in the terminal state, followed by a reset to a standard starting state or a sample from a standard distribution of starting states.

Continuing tasks: the agent–environment interaction doesn’t break naturally into identifiable episodes, but goes on continually without limit.

Expected discounted return: $G_{t}=R_{t+1}+\gamma R_{t+2}+ \gamma ^{2}R_{t+3}+...=\sum_{k=0}^{\infty}\gamma ^{k}R_{t+k+1}$。其中，折扣率(discount rate, $\gamma$)决定了未来rewards的当前价值。适合于continuing tasks。

Policies and Value Functions

Value functions: functions of states (or state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defined in terms of future rewards that can be expected, or, in terms of the expected return from that state (or state-action pair).

Policy: a mapping from states to probabilities of selecting each possible action. If the agent is following policy $\pi$ at time t, then $\pi(a|s)$ is the probability that $A_{t} = a$ if $S_{t} = s$.

the value function of a state s under a policy $\pi$: (i.e. the expected return when starting in s and following $\pi$ thereafter)

$v_{\pi}(s)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s]=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}|S_{t}=s ],\ for\ all\ s\in S$

We call the function $ v_{\pi}$ is the state-value function for policy $\pi$

the value of taking action a in state s under a policy $\pi$: (i.e. the expected return starting from s, taking the action a, and thereafter following policy $\pi$)

$q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s, A_{t}=a]=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}|S_{t}=s, A_{t}=a]$

We call $ q_{\pi}$ the action-value function for policy $\pi$

Bellman equation for $v_{\pi}$: It expresses a relationship between the value of a state and the values of its successor states.

Optimal Policies and Optimal Value Functions

Value functions define a partial ordering over policies. $\pi> \pi'$ if and only if $v_{\pi}(s) > v_{\pi'}(s)$, for all $s \in S$. The optimal value functions assign to each state, or state–action pair, the largest expected return achievable by any policy.

Optimal policy $\pi_{*}$：A policy whose value functions are optimal. There is always at least one (can be many) policy that is better than or equal to all other policies.

Optimal state-value function:

$v_{*}(s)=\underset{\pi}{max} \, v_{\pi}(s),\ for\ all\ s \in S$

Optimal action-value function:

$q_{*}(s,a)=\underset{\pi}{max}\, q_{\pi}(s,a), for\ all\ s \in S\ and\ a \in A(s)$

用$v_{*}$来表示$q_{*}$：

$q_{*}(s,a)=\mathbb{E}[R_{t+1}+\gamma v_{*}(S_{t+1})|S_{t}=s, A_{t}=a]$

Any policy that is greedy with respect to the optimal value functions must be an optimal policy. The Bellman optimality equations are special consistency conditions that the optimal value functions must satisfy and that can, in principle, be solved for the optimal value functions.

Bellman optimality equation for $v_{*}$：

Bellman optimality equation for $q_{*}$：

Backup diagrams for $v_{*}$ and $q_{*}$:

Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs的更多相关文章

Reinforcement Learning: An Introduction读书笔记(4)--动态规划
> 目录 < Dynamic programming Policy Evaluation (Prediction) Policy Improvement Policy Iterat ...
Reinforcement Learning: An Introduction读书笔记(1)--Introduction
> 目录 < learning & intelligence 的基本思想 RL的定义.特点.四要素与其他learning methods.evolutionary m ...
Reinforcement Learning: An Introduction读书笔记(2)--多臂机
> 目录 < k-armed bandit problem Incremental Implementation Tracking a Nonstationary Problem ...
《Machine Learning Yearing》读书笔记
——深度学习的建模.调参思路整合. 写在前面最近偶尔从师兄那里获取到了吴恩达教授的新书<Machine Learning Yearing>(手稿),该书主要分享了神经网络建模.训练.调节 ...
Machine Learning for hackers读书笔记(六)正则化：文本回归
data<-'F:\\learning\\ML_for_Hackers\\ML_for_Hackers-master\\06-Regularization\\data\\' ranks < ...
Machine Learning for hackers读书笔记(三)分类：垃圾邮件过滤
#定义函数,打开每一个文件,找到空行,将空行后的文本返回为一个字符串向量,该向量只有一个元素,就是空行之后的所有文本拼接之后的字符串 #很多邮件都包含了非ASCII字符,因此设为latin1就可以读取 ...
Machine Learning for hackers读书笔记_一句很重要的话
为了培养一个机器学习领域专家那样的直觉,最好的办法就是,对你遇到的每一个机器学习问题,把所有的算法试个遍,直到有一天,你凭直觉就知道某些算法行不通.
Machine Learning for hackers读书笔记(十二)模型比较
library('ggplot2')df <- read.csv('G:\\dataguru\\ML_for_Hackers\\ML_for_Hackers-master\\12-Model_C ...
Machine Learning for hackers读书笔记(十)KNN：推荐系统
#一,自己写KNN df<-read.csv('G:\\dataguru\\ML_for_Hackers\\ML_for_Hackers-master\\10-Recommendations\\ ...

随机推荐

先序遍历DOM树的5种方法
DOM树由文档中的所有节点(元素节点.文本节点.注释节点等)所构成的一个树结构,DOM树的解析和构建是浏览器要实现的关键功能.既然DOM树是一个树结构,那么我们就可以使用遍历树结构的相关方法来对DOM ...
FineCMS 5.0.10 多个漏洞详细分析过程
0x01 前言已经一个月没有写文章了,最近发生了很多事情,水文一篇.今天的这个CMS是FineCMS,版本是5.0.10版本的几个漏洞分析,从修补漏洞前和修补后的两方面去分析. 文中的evai是特意 ...
Hive数据仓库之快速入门
Hive定位:ETL(数据仓库)工具将数据从来源端经过抽取(extract).转换(transform).加载(load)至目的端的工具,如像:kettle 有关Hive数据导入导出mysql的问题 ...
吴恩达机器学习笔记18-多类别分类：一对多(Multiclass Classification_ One-vs-all)
对于之前的一个,二元分类问题,我们的数据看起来可能是像这样: 对于一个多类分类问题,我们的数据集或许看起来像这样: 我用3 种不同的符号来代表3 个类别,问题就是给出3 个类型的数据集,我们如何得到一 ...
html 转义处理
比如要把:<span>test</span> 这段代码当做文本原样输出在页面上,如果按照正常的方式,肯定会被转义,在页面上只能看到 text.那么要想达到预想的效果,应该怎么办 ...
产品经理聊产品－－mac book pro 2018 初体验
工作前几年,使用电脑,基本上都是微软的操作系统,自从从大厂出来之后,才逐渐熟悉使用linux,到现在基本上都是基本上一个月windows平台基本不需要开机就可以,可以说基本上被ubuntu的简洁和实用 ...
VS解决-无法打开文件“opencv_ts300d.lib”问题
之前使用过opencv,但不想要时没有正确去卸载,终造成历史问题,每次新建工程编译时都会弹出错误,然后停止运行,解决方法较笨笨成功打印
[CERC2014] Virus synthesis
设f[i]为形成极长回文串i的最小操作数.答案为min f[i]+n-len[i]. 在不形成偶回文的情况下形成奇回文的最小操作数为该串长度.可以不考虑(但ans赋为len). 正确性基于: 1)奇. ...
使用 SonarQube 来分析 .NET Core 项目代码问题
0.介绍 Sonar 是一款开源的代码分析工具,可能有很多人已经用过,本篇文章主要是讲解如何在 Docker 里面安装 Sonar 并且用其来分析 .Net Core 项目. Sonar 是一个用于代 ...
mysql 开发基础系列21 事务控制和锁定语句(下)
1. 隐含的执行unlock tables 如果在锁表期间,用start transaction命令来开始一个新事务,会造成一个隐含的unlock tables 被执行,如下所示: 会话1 会话2 ...

Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs

Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs的更多相关文章

随机推荐

热门专题