From the last post about MDP, we know the environment consists of 5 basic elements:

S:State Space of environment;

A:Actions Space that the environment allows;

{Ps,s'}:Transition Matrix, the probabilities of how environment state transit from one to another when actions are taken. The number of matrices equals to the number of actions.

R: Reward, when the system transitions from state s to s' due to action a, how much reward can an agent receive from the environment. Sometimes, reward have different definition.

γ: How reward discounts by time.

How Different between MDP and MRP:

Keyword: Action

The five elements of MDP can be illustrated by the chart below, in which the green circles are states, orange circles are actions, and there are two rewards. In MRP and Markov Process, we directly know the transition matrix. However, in the transition path from one state to another is interupted by actions. And it's worth noting that when the environment is at a certain state, there is no probabilities for actions. The reason is quite understandable: we live in the some world(environment), but different people have different behaviors.

Agent and Policy

Agent is the person or robot who interacts with the environment in Reinforcement Learning. Like human being, everyone may have different behavior under the same condition. The probability distribution of behaviors under different states is Policy. There are so many probabilities in an environment, but for a specific agent (person), he or she may take only one or several possible actions under a certain state. Given states, the policy is defined by:

An example of policy is shown below:

From MRP to MDP: MRP+Policy

Transition Matrix: Without policies we do not exactly know the the probability from state s transitioning to s', because different agents may have different probabilitie to take actions. As long as we get π, we can calculate the state transition matrix.

In the chart above, for example, if an agent has probabilitie 0.4 and 0.6 for action a0 and a1, the transition probability from s0 to s1 is: 0.4*0.5+0.6*1=0.8

Reward:

In MDP the reward function is related to actions, which average the uncertainties of the result from an action.

Once we've got the Policy π, we know the action distribution of a specific agent, so we can average the uncertaintie of actions, then measure how much immediate reward can receive from state s under policy π.

So now we go back from MDP to MRP, and the Markov Reward Process is defined by the tuple

 

Two Value Functions:

State Value Function:

State Value Function is the same as the value function in MRP. It is used to evaluate the goodness of being in a state s(by immedate and future reward), and the only difference is to average the uncertainty of actions under policy π. It is in the form of:

Action Value Function:

To average uncertainties of actions, it's neccesary to know the expected reward from possible actions. So we have Action Value Functionin MDP, which reveals whether an action is good or bad when an agent takes an particular action in state s.

If we calculate expectation of Action Value Functions under the same state s, we will end up with the State Value Function v:

Similarly, when an action is taken, the system may end up with different states. When we remove the uncertainty of state transition, we go back from State Value Function to Action Value Function:

If we put them together:

Another way:

Markov Decision Process in Detail的更多相关文章

  1. Step-by-step from Markov Process to Markov Decision Process

    In this post, I will illustrate Markov Property, Markov Reward Process and finally Markov Decision P ...

  2. Ⅱ Finite Markov Decision Processes

    Dictum:  Is the true wisdom fortitude ambition. -- Napoleon 马尔可夫决策过程(Markov Decision Processes, MDPs ...

  3. Markov Decision Processes

    为了实现某篇论文中的算法,得先学习下马尔可夫决策过程~ 1. https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/conte ...

  4. Reinforcement Learning Index Page

    Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...

  5. 论文笔记之:Learning to Track: Online Multi-Object Tracking by Decision Making

    Learning to Track: Online Multi-Object Tracking by Decision Making ICCV   2015 本文主要是研究多目标跟踪,而 online ...

  6. 强化学习二:Markov Processes

    一.前言 在第一章强化学习简介中,我们提到强化学习过程可以看做一系列的state.reward.action的组合.本章我们将要介绍马尔科夫决策过程(Markov Decision Processes ...

  7. (转) Deep Reinforcement Learning: Pong from Pixels

    Andrej Karpathy blog About Hacker's guide to Neural Networks Deep Reinforcement Learning: Pong from ...

  8. 机器学习算法基础(Python和R语言实现)

    https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/?spm=5176.100239.blo ...

  9. How do I learn machine learning?

    https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644   How Can I Learn X? ...

随机推荐

  1. 初识STM8S105K心得!

    最近由于公司项目需要STM8S105K这颗芯片,这两天我也捣鼓了下,正好现在开通了博客,以此记录下自己的工作. 开发环境:         window10操作系统:         IAR for ...

  2. SSM商城系统开发笔记-问题01-通配符的匹配很全面, 但无法找到元素 'mvc:annotation-driven' 的声明。

    配置搭建完后进行Post请求测试时报错: Caused by: org.xml.sax.SAXParseException; lineNumber: 14; columnNumber: 29; cvc ...

  3. express快速入门

    1.简介: express是基于Node.js平台,快速开放极简的web开发框架,使用 各种http使用工具和中间件,创建强大API. 2.安装 npm install express -g 全局安装 ...

  4. python在类中使用__slot__属性

    在类中定义__slot__属性来限制实例的属性字段,在创建大量对象的场合可以减少内存占用. 创建大量对象是内存占用对比: 类中不使用__slot__ class MySlot:def __init__ ...

  5. CF3D Least Cost Bracket Sequence(2500的实力贪心...

    哎,昨天一直在赶课设..没有写 最近听了一些人的建议,停止高级算法的学习,开始刷cf. 目前打算就是白天懒得背电脑的话,系统刷一遍蓝书紫书白书之类的(一直没系统刷过),回宿舍再上机吧. https:/ ...

  6. 简要说明 django restframework 的交互式文档

    现在为了解决前后端交互沟通的问题,不少框架都推出了相关的swage库, 用起来似乎很是友好. 正好最近在开发一个小项目,想到新项目就用新版本新技术的理念,我下载了restframework 3.7的版 ...

  7. Qt 样式对于QPushbutton 增加 hover press release效果

    按钮的三种状态,未被选中,选中(划过),点击时候的效果 使用setStyleSheet即QSS样式实现. QPushButton *MyBtn = new QPushButton(this); MyB ...

  8. D0g3_Trash_Pwn_Writeup

    Trash Pwn 下载文件 1 首先使用checksec查看有什么保护 可以发现,有canary保护(Stack),堆栈不可执行(NX),地址随机化没有开启(PIE) 2 使用IDA打开看看 mai ...

  9. service-resources

    <?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.spr ...

  10. Linux下安装Harbor 1.8.0 仓库的安装和使用(亲测)

    根据Harbor官方描述: Harbor是一个用于存储和分发Docker镜像的企业级Registry服务器,通过添加一些企业必需的功能特性,例如安全.标识和管理等,扩展了开源Docker Distri ...