Optimal Value Functions and Optimal Policy

Optimal Value Function is how much reward the best policy can get from a state s, which is the best senario given state s. It can be defined as:

Value Function and Optimal State-Value Function

Let's see firstly compare Value Function with Optimal Value Function. For example, in the student study case, the value function for the blue circle state under 50:50 policy is 7.4.

However, when we consider the Optimal State-Value function, 'branches' that may prevent us from getting the best scores are proned. For instance, the optimal senario for the blue circle state is having 100% probability to continue his study rather than going to pub.

Optimal Action-Value Function

Then we move to Action-Value Function, and the following equation also reveals the Optimal Action-Value Function is from the policy who gives the best Action Returns.

The Optimal Action-Value Function is strongly related to Optimal State-Value Function by:

The equation means when action a is taken at state s, what the best return is. At this condition, the probability of reaching each state and the immediate reward is determined, so the only variable is the State-Value function . Therefore it is obvious that obtaining the Optimal State-Value function is equivalent to holding the Optimal Action-Value Function.

Conversely, the Optimal State-Value function is the best combination of Action and the following states with Optimal State-value Functions:

Still in the student example, when we know the Optimal State-Value Function, the Optimal Action-Value Function can be calculated as:

Finally we can derive the best policy from the Optimal Action-Value Function:

This means the policy only picks up the best action at every state rather than having a probability distribution. This deterministic policy is the goal of Reinforcement Learning, as it will guide the action to complete the task.

Optimal Value Functions and Optimal Policy的更多相关文章

Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs
> 目录 < Agent–Environment Interface Goals and Rewards Returns and Episodes Policies and Val ...
Machine Learning——吴恩达机器学习笔记（酷
[1] ML Introduction a. supervised learning & unsupervised learning 监督学习:从给定的训练数据集中学习出一个函数(模型参数), ...
RL_Learning
Key Concepts in RL 标签(空格分隔): RL_learning OpenAI Spinning Up原址 states and observations (状态和观测) action ...
Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
factoextra is an R package making easy to extract and visualize the output of exploratory multivaria ...
深度学习课程笔记（七）：模仿学习（imitation learning）
深度学习课程笔记(七):模仿学习(imitation learning) 2017.12.10 本文所涉及到的模仿学习,则是从给定的展示中进行学习.机器在这个过程中,也和环境进行交互,但是,并没有显 ...
DP Intro - OBST
http://radford.edu/~nokie/classes/360/dp-opt-bst.html Overview Optimal Binary Search Trees - Problem ...
[C5] Andrew Ng - Structuring Machine Learning Projects
About this Course You will learn how to build a successful machine learning project. If you aspire t ...
Reinforcement Learning Index Page
Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...

随机推荐

Excel VBA在生成副本的工作表中插入本工作簿中的VBA模块代码
即在工作簿中添加一个工作表,然后移出并存为新的工作簿,在移出前将本工作簿的一个模块的代码拷贝至新的工作簿.下面是关键代码: '===================================== ...
OSI模型——传输层
OSI模型——传输层运输层运输层概述运输层提供应用层端到端通信服务,通俗的讲,两个主机通讯,也就是应用层上的进程之间的通信,也就是转换为进程和进程之间的通信了,我们之前学到网络层,IP协议能将分 ...
C# 图片与Base64的相互转化
public ActionResult UploadSignature2(string src_data) { Class1.Base64StrToImage(src_data, "C:\\ ...
bootstrap复习
菜单 <div class="row">下拉菜单/分裂菜单</div> <div class="dropdown btn-group&quo ...
从0构建webpack开发环境(一) 一个简单webpack.config.js
本文基于webpack4.X,使用的包管理工具是yarn 概念相关就不搬运了,直接开始首先项目初始化 mkdir webpack-demo && cd webpack-demo ya ...
MongoDB的使用学习之（三）安装MongoDB以及一些基础操作
原文链接:http://www.cnblogs.com/huangxincheng/archive/2012/02/18/2356595.html 此博主的 8天学通MongoDB 系列还是不错的,本 ...
脚本_查看所有虚拟机磁盘以及 CPU 的使用量
#!bin/bash#作者:liusingbon#功能:查看所有虚拟机磁盘使用量以及 CPU 使用量信息read -p "按任意键进入查看页面.比如按下Enter键" keyvir ...
Petrozavodsk Winter-2018. AtCoder Contest. Problem I. ADD, DIV, MAX 吉司机线段树
题意:给你一个序列,需要支持以下操作:1:区间内的所有数加上某个值.2:区间内的所有数除以某个数(向下取整).3:询问某个区间内的最大值. 思路(从未见过的套路):维护区间最大值和区间最小值,执行2操 ...
hdu 4651 Partition(整数拆分+五边形数)
Partition Time Limit: 6000/3000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total ...
java中的Excel导出功能
public void exportExcel(Long activityId, HttpServletResponse response) throws IOException { // 获取统计报 ...

Optimal Value Functions and Optimal Policy

Optimal Value Functions and Optimal Policy的更多相关文章

随机推荐

热门专题