Key Concepts in RL

标签(空格分隔): RL_learning


OpenAI Spinning Up原址

  • states and observations (状态和观测)
  • action spaces(动作空间)
  • policies(策略)
  • trajectories(运动轨迹)
  • different formulations of return(不同形式的奖励)
  • the RL optimization problem(RL的优化问题)
  • value functions()

States and Observations

states(状态) s 是一个对世界的完整的描述,states反映了世界的真实情况。

observation(观测) o 是对状态的一个部分描述,可能缺少某些信息。

在 deep RL 中 通常用 real-valued vector, matrix, or higher-order tensor 来描述。对于图像而言观测值是像素矩阵,对于机器人而言观测是它的速度和角度等。

如果agent可以观察到关于环境的所有的状态,那么我们称之为fully observed 。如果agent只能看到部分信息,我们称之为partially observed。

注意:理论上action的选择取决与states,但是实际上我们的action取决于observation,因为状态对于agent来讲是不可获得的。

Action Spaces

不同的环境可以用不同的动作来描述。在给定环境里的所有的可执行动作就是动作空间。在一些环境中,比如Atari 和 Go,是离散的运动空间,其中agent可执行的动作只有有限个。在一些其他环境中,比如真是世界中的机器人的控制,它的运动空间是连续的。在这样的空间中动作通常是 real-valued vector。

对于不同的action spaces 会对RL的算法有很大的影响,一些算法在某种场景下可以直接应用,在其他的场景下可能需要全部重新改编。

Policies

Policies(策略)就是agent用来选择action的规则,它可以是明确的,通常表示为$\mu $:

\[a_t=\mu(s_t),
\]

或者是随机的,通常表示为\(\pi\):

\[a_t\sim\pi(\cdot\mid s_t).
\]

在deep RL中我们其实是在处理参数化的策略问题:

这些策略的输出是一些可以计算的函数可以控制agent的行为,而这些函数是依赖于一些参数的,例如神经网络中的权重和偏置。我们对这些参数进行优化,从而达到改变行为的目的。

通常通过\(\theta\)或者\(\phi\)来表示这种参数化的策略,通常写成这种形式

\[a_t=\mu_\theta(s_t)\\a_t\sim\pi_\theta(\cdot\mid s_t).
\]

Deterministic Polices

下面是Deterministic Police的例子

obs = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs, hidden_dims=(64,64), activation=tf.tanh)
actions = tf.layers.dense(net, units=act_dim, activation=None)

where mlp is a function that stacks multiple dense layers on top of each other with the given sizes and activation.

Stochastic Policies

在深度学习中最常见的两种随机策略是Categorical Policies(分类策略)和diagonal Gaussian policies对角高斯策略。

categorical policies适用于离散的动作空间,而diagonal Gaussian polices使用于连续的动作空间。

两个运算对于训练和使用随机策略来说至关重要:

  • sampling actions from policy,
  • and computing log likelihoods of particular actions,\(log\pi_\theta(a|s)\).

一个是如何根据策略选取动作,一个是计算选取指定动作的概率。

Categorical Policies

Categorical Policies类似一个对离散动作进分类的分类器,对它构建神经网络的方法与构建分类器的方法是一样的。输入是observation,然后是一些层(一般来说是convolutional or densely-connected,取决于输入的种类)。最后是一个线性层给出选择action的概率。

Sampling.给出每个动作的概率值

Log-Likehood.概率的对数形式,

\[\log\pi_\theta (a|s)=\log[P_\theta(s)]_a.
\]

Diagonal Gaussian Policies

多元高斯分布通过平均值向量\(\mu\)和协方差矩阵\(\Sigma\)来描述,Diagonal Gussian distribution(对角高斯分布)是一个特殊的情况,它的协方差矩阵只有对角线上的元素,所以我们可以通过一个向量来表示。

Diagonal Gaussian policy通常通过神经网络将observation映射到mean action,\(\mu_\theta (s)\)。

用对数的形式可以让取值范围是\((-\infty,\infty)\),而且对数形式的值都是非负的。这回让我们训练起来简单很多,不用去考虑取值范围和值域的范围。

Sampling.通过输入mean action\(\mu_\theta(s)\),标准差 \(\sigma_\theta(s)\)和成球面高斯分布的噪声 \((z \sim \mathcal{N}(0, I))\),可以得到action sample $$a = \mu_{\theta}(s) + \sigma_{\theta}(s) \odot z,$$

其中\(\odot\)表示两个向量元素的乘积。

Log-Likelihood. The log-likelihood of a \(k\) -dimensional action \(a\), for a diagonal Gaussian with mean \(\mu = \mu_{\theta}(s)\) and standard deviation \(\sigma = \sigma_{\theta}(s)\), is given by

\[\log \pi_{\theta}(a|s) = -\frac{1}{2}\left(\sum_{i=1}^k \left(\frac{(a_i - \mu_i)^2}{\sigma_i^2} + 2 \log \sigma_i \right) + k \log 2\pi \right).
\]

Trajectories

A trajectory \(\tau\) is a sequence of states and actions in the world,

\[\tau = (s_0, a_0, s_1, a_1, ...).$$The very first state of the world, $s_0$, is randomly sampled from the start-state distribution, sometimes denoted by$\rho_0$:

$$s_0\sim\rho_0(\cdot)\]

State transitions (what happens to the world between the state at time \(t\), \(s_t\), and the state at \(t+1\), \(s_{t+1}\), are governed by the natural laws of the environment, and depend on only the most recent action, \(a_t\). They can be either deterministic,

\[s_{t+1} = f(s_t, a_t)
\]

or stochastic,$$s_{t+1} \sim P(\cdot|s_t, a_t).$$Actions come from an agent according to its policy.

Reward and Return

The reward function R is critically important in reinforcement learning. It depends on the current state of the world, the action just taken, and the next state of the world:

\[r_t = R(s_t, a_t, s_{t+1})
\]

although frequently this is simplified to just a dependence on the current state, \(r_t = R(s_t)\), or state-action pair \(r_t = R(s_t,a_t)\).

所谓奖励就是一个关于此刻的状态,此刻做出的动作以及下一刻的状态的函数,更普遍的将其简化为只关于此刻状态或者此刻状态和此刻做出动作的一个函数。

The goal of the agent is to maximize some notion of cumulative reward over a trajectory, but this actually can mean a few things. We’ll notate all of these cases with \(R(\tau)\), and it will either be clear from context which case we mean, or it won’t matter (because the same equations will apply to all cases)

agent的目的就是最大化累计的reward,所以我们用\(R(\tau)\)来表示agent所获得的奖励。

One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window of steps:

\[R(\tau)=\sum_{t=0}^Tr_t.
\]

Another kind of return is the infinite-horizon discounted return, which is the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained. This formulation of reward includes a discount factor\(\gamma \in(0,1)\):

\[R(\tau)=\sum_{t=0}^\infty \gamma^t r_t.
\]

Why would we ever want a discount factor, though? Don’t we just want to get all rewards? We do, but the discount factor is both intuitively appealing and mathematically convenient. On an intuitive level: cash now is better than cash later. Mathematically: an infinite-horizon sum of rewards may not converge to a finite value, and is hard to deal with in equations. But with a discount factor and under reasonable conditions, the infinite sum converges.

在处理reward的时候我们有两种方式,一种是对每一步的reward进行无衰减的相加,但是求和的次数是有限的。还有一种方式就是无限求和次数,但是对每次的奖励进行衰减,衰减系数在0~1之间。

那么为什么我们要衰减因子呢?难道我们不想要全部的奖励吗?不是的我们想要,我们这么做是为了数学上的方便。如果我们对一个数无限的累加,那这个值也会变成无限大,如果我们加入了衰减因子那么对这个值进行无限累加,它会收敛到一个值上。

The RL Problem

Whatever the choice of return measure (whether infinite-horizon discounted, or finite-horizon undiscounted), and whatever the choice of policy, the goal in RL is to select a policy which maximizes expected return when the agent acts according to it.

不管我们采用什么return的形式以及策略,RL的最终目的是让agent根据策略选择action来最大化预期收益。

To talk about expected return, we first have to talk about probability distributions over trajectories.

Let’s suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a \(T\)-step trajectory is:

\[P(\tau|\pi)=\rho_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\pi(a_t|s_t).
\]

The expected return (for whichever measure), denoted by \(J(\pi)\), is then:

\[J(\pi)=\int_\tau P(\tau|\pi)R(\tau)=\underset {\tau \sim\pi}{E}[R(\tau)].
\]

The central optimization problem in RL can then be expressed by

\[\pi^*=\arg\max_\pi J(\pi),
\]

with \(\pi^*\) being the optimal policy.

Value Functions

It’s often useful to know the value of a state, or state-action pair. By value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. Value functions are used, one way or another, in almost every RL algorithm.

知道某个状态或者状态-动作对的值对我们来说很有用,那么这个值就是在这个状态或者状态-动作对根据策略选择动作之后得到的期望收益。在所有的RL算法中都会用到一个或者几个Value Functions.

主要有以下四种Value Function

  1. The On-Policy Value Function, \(V^{\pi}(s)\), which gives the expected return if you start in state \(s\) and always act according to policy \(\pi\):

\[V^\pi(s)=\underset {\tau \sim \pi} {E} [R(\tau)|s_0 = s]
\]

  1. The On-Policy Action-Value Function, Q^{\pi}(s,a), which gives the expected return if you start in state s, take an arbitrary action a (which may not have come from the policy), and then forever after act according to policy \pi:

\[Q^\pi(s,a)=\underset{\tau \sim \pi} {E} [R(\tau)|s_0=s,a_0=a]
\]

  1. The Optimal Value Function, V^*(s), which gives the expected return if you start in state s and always act according to the optimal policy in the environment:

    \[V^*(s) = \max_{\pi} \underset {\tau \sim \pi}{E} [R(\tau) |s_0 = s]
    \]

  2. The Optimal Action-Value Function, Q^*(s,a), which gives the expected return if you start in state s, take an arbitrary action a, and then forever after act according to the optimal policy in the environment:

    \[Q^*(s,a) = \max_{\pi} \underset {\tau \sim \pi} {E} [R(\tau)\ | s_0 = s, a_0 = a]
    \]

The Optimal Q-function and the Optimal Action

There is an important connection between the optimal action-value function \(Q^*(s,a)\) and the action selected by the optimal policy. By definition, \(Q^*(s,a)\) gives the expected return for starting in state \(s\), taking (arbitrary) action \(a\), and then acting according to the optimal policy forever after.

The optimal policy in s will select whichever action maximizes the expected return from starting in \(s\). As a result, if we have \(Q^*\), we can directly obtain the optimal action, \(a^*(s)\), via

\[a^*(s) = \arg {\max_{a}{Q^*(s,a)}}.
\]

Note: there may be multiple actions which maximize \(Q^*(s,a)\), in which case, all of them are optimal, and the optimal policy may randomly select any of them. But there is always an optimal policy which deterministically selects an action.

Bellman Equations

All four of the value functions obey special self-consistency equations called Bellman equations. The basic idea behind the Bellman equations is this:

The value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.

The Bellman equations for the on-policy value functions are

\[V^\pi(s) = \underset{\underset{s' \sim P}{a \sim \pi}}{E}[r(s,a)+\gamma V^\pi(s')],\\ Q^\pi(s,a) = \underset {s' \sim p}{E}[r(s,a)+\gamma \underset {a'\sim \pi} {E}[Q^\pi(s',a')]],
\]

where \(s'\sim P\) is shorthand for \(s'\sim P(\cdot |s,a)\),indicating that the next state \(s'\) is sampled from the environment’s transition rules; $a \sim \pi $ is shorthand for \(a \sim \pi(\cdot|s)\); and \(a' \sim \pi\) is shorthand for \(a' \sim \pi(\cdot|s')\).

The Bellman equations for the optimal value functions are

\[V^*(s) = \max_a \underset{s' \sim P}{E}[r(s,a)+\gamma V^*(s')],\\ Q^*(s,a) = \underset {s' \sim p}{E}[r(s,a)+\gamma \max_{a'}[Q^* (s',a')]].
\]

The crucial difference between the Bellman equations for the on-policy value functions and the optimal value functions, is the absence or presence of the \max over actions. Its inclusion reflects the fact that whenever the agent gets to choose its action, in order to act optimally, it has to pick whichever action leads to the highest value.

Note

The term “Bellman backup” comes up quite frequently in the RL literature. The Bellman backup for a state, or state-action pair, is the right-hand side of the Bellman equation: the reward-plus-next-value.

Advantage Functions

Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.

The advantage function \(A^{\pi}(s,a)\) corresponding to a policy \(\pi\) describes how much better it is to take a specific action a in state s, over randomly selecting an action according to \(\pi(\cdot|s)\), assuming you act according to \(\pi\) forever after. Mathematically, the advantage function is defined by

\[A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s).
\]

Note

advantage function is crucially important to policy gradient methods.

RL_Learning的更多相关文章

  1. Python中的”黑魔法“与”骚操作“

    本文主要介绍Python的高级特性:列表推导式.迭代器和生成器,是面试中经常会被问到的特性.因为生成器实现了迭代器协议,可由列表推导式来生成,所有,这三个概念作为一章来介绍,是最便于大家理解的,现在看 ...

随机推荐

  1. 使用单例模式来打造ActivityManager类

    单例(Singleton)模式 定义 单例模式是一种对象创建型模式,使用单例模式,可以保证为一个类只生成唯一的实例对象.也就是说,在整个程序空间中,该类只存在一个实例对象. GoF对单例模式的定义是: ...

  2. iOS开发神器InjectionIII

    最近发现了一款适用于iOS开发的神器,希望可以和大家一起分享,同时自己也将有用的东西记录下来,没错就是InjectionIII! 先看一下使用流程: 1.在MAC的App Store里面搜索下载这个工 ...

  3. 自定义UICollectionViewLayout并添加UIDynamic

    大家也可以到这里查看. UICollectionView是iOS6引入的控件,而UIDynamicAnimator是iOS7上新添加的框架.本文主要涵盖3部分: 一是简单概括UICollectionV ...

  4. uva_11806_Cheerleaders

    In most professional sporting events, cheerleaders play a major role in entertaining the spectators. ...

  5. CentOS7 yum命令

    1.yum 清理缓存 [hado@localhost /]# yum clean all [hado@localhost /]# rm -rf /var/cache/yum/*

  6. SQL Server公用表达式CET递归查询所有上级数据

    with cte as( select bianma,fjbm from #tree where chkDisabled='true' union all select t.bianma,t.fjbm ...

  7. TinyMCE:下载、安装、配置

    第一步:下载 官网下载:https://www.tiny.cloud/download/ TinyMCE从4.0开始,不再支持直接下载,而是直接使用提供免费的CDN,让用户免除安装过程,可以在网站中使 ...

  8. css中三种隐藏方式

    1.overflow 溢出隐藏 overflow:hidden 2.display 隐藏不占据原来的文档,即会让出空间 display:black  显示 display:none  隐藏 3.vis ...

  9. 「PHP」工厂方法模式

    引言   所属:创建型模式,常用设计模式之一 工厂模式分为:简单工厂模式.工厂方法模式.静态工厂模式.抽象工厂模式. 下面为工厂方法模式. 参考资料: <大话设计模式>程杰   模式概述 ...

  10. PHP 使用GD库合成带二维码的海报步骤以及源码实现

    PHP 使用GD库合成带二维码的海报步骤以及源码实现 在做微信项目开发过程中,经常会遇到图片合成的问题,比如将用户的二维码合成到宣传海报中,那么,遇到这种情况,利用PHP的GD库也是很容易实现的,实现 ...