代码:https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On

Chapter 1 What is Reinforcement Learning

Learning - supervised, unsupervised, and reinforcement

RL is not completely blind as in an unsupervised learning setup--we have a reward system.

(1) life is suffering, which could be totally wrong. In machine learning terms, it can be rephrased as having non-i.i.d data.

(2) exploration/exploitation dilemma is one of the open fundamental question in RL.

(3) the third complication factor lays in the fact that reward can be seriously delayed from actions.

RL fromalisms and realtions

RL entities and their communications

  • Agent和Environment是图的两个node
  • Actions作为edge由Agent指向Environment
  • Rewards和Observations作为edge由Environment指向Agent

Reward

We don't define how frequently the agent receives this reward. In the case of once-in-a-lifetime reward systems, all rewards except the last one will be zero.

The agent

The environment

Action

two types of actions: discrete or continuous.

Observations

Markov decision process

It is the theoretical foundation of RL, which makes it possible to start moving toward the methods used to solve the RL problem.

we start from the simplest case of a Markov Process(also known as a Markov chain), then extend it with rewards, which will turn it into a Markov reward processes. Then we'll put this idea into one other extra envelop by adding actions, which will lead us to Markov Decision Processes.

Markov process

you can always make your model more complex by extending your state space, which will allow you to capture more dependencies in the model at the cost of a large state space.

you can capture transition probabilities with a transition matrix, which is a square matrix of the size NxN, where N is the number of states in your model.

可以根据观测的episodes来估计transition matrix

Markov reward process

first thing is to add reward to Markov process model.

representation: reward transition matrix or a more compact representation, which is applicable only if the reward value depends only on the target state, which is not always the case.

second thing is to add discount factor gamma(from 0 to 1).

Markov decision process

add a dimension 'action' to transition matrix.

Chapter 2 OpenAI Gym

Chapter 3 Deep Learning with PyTorch

Chapter 4 The Cross-Entropy Method

Taxonomy of RL methods

  • Model-free or model-based
  • Value-based or policy-based
  • On-policy or off-policy

Practional cross-entropy

DRL Hands-on book的更多相关文章

  1. Drools mvel方言drl断点调试方法

    开发环境:myeclipse2014,  jdk1.8.0.91,drools6.4.0.Final, drools-eclipse-plugin,mvel2-2.2.6.Final问题描述:drl使 ...

  2. DRL之:策略梯度方法 (Policy Gradient Methods)

    DRL 教材 Chpater 11 --- 策略梯度方法(Policy Gradient Methods) 前面介绍了很多关于 state or state-action pairs 方面的知识,为了 ...

  3. 基于DRL和TORCS的自动驾驶仿真系统——之环境配置

    基于DRL和TORCS的自动驾驶仿真系统 --之环境配置 玩TORCS和DRL差不多有一整年了,开始的摸爬滚打都是不断碰壁过来的,近来在参与CMU的DRL10703课程学习和翻译志愿者工作,也将自己以 ...

  4. DRL前沿之:Benchmarking Deep Reinforcement Learning for Continuous Control

    1 前言 Deep Reinforcement Learning可以说是当前深度学习领域最前沿的研究方向,研究的目标即让机器人具备决策及运动控制能力.话说人类创造的机器灵活性还远远低于某些低等生物,比 ...

  5. DRL强化学习:

    IT博客网 热点推荐 推荐博客 编程语言 数据库 前端 IT博客网 > 域名隐私保护 免费 DRL前沿之:Hierarchical Deep Reinforcement Learning 来源: ...

  6. drools原生drl规则文件的使用

    在初识drools中对drl文件进行了简单的介绍.这里举个例子来具体说明下.主要是写了规则之后我们如何用java代码来run起来. drl文件内容如下: rule "ageUp12" ...

  7. Reinforcement Learning,微信公众号:DRL学习

    欢迎大家关注微信公众号:DRL学习,我们一起来学习强化学习和深度强化学习的算法及现状应用问题. 强化学习简单说就是学习如何最大化未来奖励的预期总和,以及agent学会在环境中做出的行动序列,其中随机状 ...

  8. Drools规则引擎详解-常用的drl实例

    package droolsDemo //说明:每个 drl 都必须声明一个包名,这个包名与 Java 里面的不同,它不需要与文件夹的层次结构一致, //主要用于可以根据kmodule.xml中不同的 ...

  9. allegro生成光绘文件时,通过cam打开,*.drl钻孔文件不识别,为Unknow类型

    生成钻孔文件时,NC_Parameters中,应该选Absolute

  10. DRL 教程 | 如何保持运动小车上的旗杆屹立不倒?TensorFlow利用A3C算法训练智能体玩CartPole游戏

    本教程讲解如何使用深度强化学习训练一个可以在 CartPole 游戏中获胜的模型.研究人员使用 tf.keras.OpenAI 训练了一个使用「异步优势动作评价」(Asynchronous Advan ...

随机推荐

  1. css 单位

    CSS 有几个不同的单位用于表示长度. 一些设置 CSS 长度的属性有 width, margin, padding, font-size, border-width, 等. 长度有一个数字和单位组成 ...

  2. httpclient 多附件上传

    多附件上传实例: /** * 多附件上传 * @param host * @param uri * @param attachment 附件 * @param param body参数 * @retu ...

  3. js把一串字符串去重(能统计出字符重复次数更佳)

    原文来自:https://juejin.im/post/5ba6e77e6fb9a05d0b14359b <script> let str = "12qwe345671dsfa2 ...

  4. Java ArrayList常用接口介绍及示例

    Java List 常用类型 类型 特征 ArrayList 随机访问元素快:中间插入与删除元素较慢:操作不是线程安全的 LinkedList 中间插入与删除操作代价较低,提供优化的顺序访问:随机访问 ...

  5. Redis总结2

    一.Redis效率高的原因 众所周知,Redis常用来做缓存,从而提高项目QPS(每秒查询率).QPS = 并发量 / 平均响应时间 然而其效率高的原因包含但不仅限于如下几点: 1.Redis基于内存 ...

  6. shell与其他语言不同点

    1.定义变量时,变量名不加美元符号($,PHP语言中变量需要),如: your_name="w3cschool.cn" 注意,变量名和等号之间不能有空格,这可能和你熟悉的所有编程语 ...

  7. 怎么处理Win7电脑打开软件速度慢的情况?

    很多使用Win7系统的用户都会发现这么一个问题,就是电脑在使用过一段时间后,打开一个应用软件的速度就会变慢,非常耽误时间.下面就和大家分享一个解决Win7系统应用软件打开速度慢的小技巧. Win7系统 ...

  8. GPS坐标转大地坐标

    根据网上EXCEL表格给出的关系,生成的C语言代码.计算结果和软件[万能坐标转换980]计算出的结果很接近. double B = 39.3926; double L = 117.4514; //do ...

  9. java-消息队列相关-activeMQ

    ,1,如何防止activeMQ崩溃导致消息丢失呢? 第一点,首先消息需要使用持久化消息,服务挂掉,重启服务后消息依然可以消费,不会丢失: 第二点,ActiveMQ采用主从模式搭建集群,比如搭建3台主从 ...

  10. Android异常与性能优化相关面试问题-冷启动优化面试问题详解

    什么是冷启动: 冷启动的定义:冷启动就是在启动应用前,系统中没有该应用的任何进程信息.实际也就是要执行Application.onCreate()方法的那次启动. 冷启动 / 热启动的区别:热启动:用 ...