Policy Improvement and Policy Iteration
From the last post, we know how to evaluate a policy. But that's not enough, because the purpose of policy evaluation is to improve policies so that finally get the optimal policy. So in this post, we will discuss about how to improve a given policy, and how to from a given policy get to the optimal policy.
Firstly, when you have an evaluated policy, the Action-Value function is known for every state. That is, at a certain state s, we known which action can give the system the largest reward.
In the puzzle wandering example, we evaluate the random policy. However,the State-Value functions can be used for policy improvement. After 1 step calculating,we can conclude at the circled location, moving left is better than randomly picking a direction because left side has more reward.
After three steps, we've got a much better intuition about the map. We can change the random policy to a new better one.
The way to improve the current policy is to greedyly pick actions for every state. It is worth noting that greedily picking actions does not means it only consider one step (too greedy to consider multiple steps). Instead, when k=3, the algorithm can foresee three steps, and the greedy picking algorithm will select the best action for k steps.
The Policy Iteration Algorithm is keep doing evaluation and improvement tasks untill the policy becomes stable,
This process means Action-Value function of the improved policy picking the best return from a single action:
The algorithm is:
Policy Improvement and Policy Iteration的更多相关文章
- Provider Policy与Consumer Policy在bnd中的区别
首先需要了解的是bnd的相关知识: 1. API(也就是接口), 2. API Provider(接口的实现) 3. API Consumer( 接口的使用者) OSGi中的一个版本有4个部分: ...
- Reinforcement Learning Index Page
Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...
- Policy Gradient Algorithms
Policy Gradient Algorithms 2019-10-02 17:37:47 This blog is from: https://lilianweng.github.io/lil-l ...
- Deep Learning专栏--强化学习之从 Policy Gradient 到 A3C(3)
在之前的强化学习文章里,我们讲到了经典的MDP模型来描述强化学习,其解法包括value iteration和policy iteration,这类经典解法基于已知的转移概率矩阵P,而在实际应用中,我们 ...
- 使用 SecurityManager 和 Policy File 管理 Java 程序的权限
参考资料 该文中的内容来源于 Oracle 的官方文档.Oracle 在 Java 方面的文档是非常完善的.对 Java 8 感兴趣的朋友,可以从这个总入口 Java SE 8 Documentati ...
- Utility2:Appropriate Evaluation Policy
UCP收集所有Managed Instance的数据的机制,是通过启用各个Managed Instances上的Collection Set:Utility information(位于Managem ...
- trait与policy模板应用简单示例
trait与policy模板应用简单示例 accumtraits.hpp // 累加算法模板的trait // 累加算法模板的trait #ifndef ACCUMTRAITS_HPP #define ...
- trait与policy模板技术
trait与policy模板技术 我们知道,类有属性(即数据)和操作两个方面.同样模板也有自己的属性(特别是模板参数类型的一些具体特征,即trait)和算法策略(policy,即模板内部的操作逻辑). ...
- Network Policy - 每天5分钟玩转 Docker 容器技术(171)
Network Policy 是 Kubernetes 的一种资源.Network Policy 通过 Label 选择 Pod,并指定其他 Pod 或外界如何与这些 Pod 通信. 默认情况下,所有 ...
随机推荐
- modinfo - 显示当前内核模块信息
总览 modinfo [ options ] <module_file> 描述 modinfo 工具软件用来对内核模块的目标文件 module_file 进行测试并打印输出相关信息. 选项 ...
- 020-VMware虚拟机作为OpenStack计算节点,上面的虚拟机无法启动问题解决
问题描述: VMware虚拟机作为OpenStack计算节点,如果安装的操作系统是CentOS7.3,则在此计算节点放置的虚拟机无法正常启动,报如下错误: 在创建计算节点时,为了能让 KVM 能创 ...
- 最长公共子序列板/滚动 N^2
#include <bits/stdc++.h> using namespace std; int main() { ][],t; ],b[]; bool now,pre; scanf(& ...
- Python核心技术与实战——十|面向对象的案例分析
今天通过面向对象来对照一个案例分析一下,主要模拟敏捷开发过程中的迭代开发流程,巩固面向对象的程序设计思想. 我们从一个最简单的搜索做起,一步步的对其进行优化,首先我们要知道一个搜索引擎的构造:搜索器. ...
- python3-使用__slots__
正常情况下,当我们定义了一个class,创建了一个class的实例后,我们可以给该实例绑定任何属性和方法,这就是动态语言的灵活性.先定义class: class Student(object): pa ...
- git_sd
(一)将代码从服务器移到gitlab nano .gitignore ll -ah 1.关联一个远程库 : git remote add origin http://hcgit.hengchang6. ...
- 安装suds,提示No module named 'client'
最近在研究webservice,但是在线安装suds的时候提示No module named 'client' 提示没有client模块,提示这个错误主要还是因为没有安装client模块 在线安装cl ...
- vue 报错 :属性undefined(页面成功渲染)
vue 报错:Cannot read property 'instrumentId' of undefined" 相关代码如下: <template> ... <span& ...
- 【C】题解 (五校联考3day2)
分析 这道题看上去很恶心,实际上只用记录四坨东西就能打DP了:y坐标最小的向上射的点.y坐标最大的向下射的点.y坐标最大和最小的向右射的点,转移显然.注意,如果该状态的值为零就可以略过,否则会超时. ...
- 【bzoj3162】独钓寒江雪
*题目描述: *题解: 树哈希+组合数学.对于树的形态相同的子树就一起考虑. *代码: #include <cstdio> #include <cstring> #includ ...