https://www.zhihu.com/question/277325426

https://github.com/jinglescode/reinforcement-learning-tic-tac-toe/blob/master/README.md

Intuition

After a long day at work, you are deciding between 2 choices: to head home and write a Medium article or hang out with friends at a bar. If you choose to hang out with friends, your friends will make you feel happy; whereas heading home to write an article, you’ll end up feeling tired after a long day at work. In this example, enjoying yourself is a reward and feeling tired is viewed as a negative reward, so why write articles?

Because in life, we don’t just think about immediate rewards; we plan a course of actions to determine the possible future rewards that may follow. Perhaps writing an article may brush up your understanding of a particular topic really well, get recognised and ultimately lands you that dream job you’ve always wanted. In this scenario, getting your dream job is a delayed reward from a list of actions you took, then we want to assign some value for being at those states (for example “going home and write an article”). In order to determine the value of a state, we call this the “value function”.

So how do we learn from our past? Let’s say you made some great decisions and are in the best state of your life. Now look back at the various decisions you’ve made to reach this stage: what do you attribute your success to? What are the previous states that led you to this success? What are the actions you did in the past that led you to this state of receiving this reward? How is the action you are doing now related to the potential reward you may receive in the future?

Reinforcement Learning — Implement TicTacToe

How to use reinforcement learning to play tic-tac-toe

https://github.com/MJeremy2017/reinforcement-learning-implementation/tree/master/TicTacToe

直接看这个「井字棋」的代码,结合反复阅读这几篇文章,慢慢理解 Q-Learning 是个什么东西,每个参数的意义又是什么。

https://github.com/ZuzooVn/machine-learning-for-software-engineers

https://machinelearningmastery.com/machine-learning-for-programmers/#comment-358985

https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56

What’s ‘Q’?

The ‘q’ in q-learning stands for quality. Quality in this case represents how useful a given action is in gaining some future reward.

How Does Learning Rate Decay Help Modern Neural Networks?

https://smartlabai.medium.com/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc

RL 分为 Model-free 和 Model-based 两类

Q-Learning 就属于 Model-free

http://incompleteideas.net/book/the-book-2nd.html

http://incompleteideas.net/book/code/code2nd.html

Reinforcement Learning 强化学习入门的更多相关文章

  1. [Reinforcement Learning] 强化学习介绍

    随着AlphaGo和AlphaZero的出现,强化学习相关算法在这几年引起了学术界和工业界的重视.最近也翻了很多强化学习的资料,有时间了还是得自己动脑筋整理一下. 强化学习定义 先借用维基百科上对强化 ...

  2. The categories of Reinforcement Learning 强化学习分类

    RL分为三大类: (1)通过行为的价值来选取特定行为的方法,具体 包括使用表格学习的 q learning, sarsa, 使用神经网络学习的 deep q network: (2)直接输出行为的 p ...

  3. 【论文研读】强化学习入门之DQN

    最近在学习斯坦福2017年秋季学期的<强化学习>课程,感兴趣的同学可以follow一下,Sergey大神的,有英文字幕,语速有点快,适合有一些基础的入门生. 今天主要总结上午看的有关DQN ...

  4. 强化学习入门基础-马尔可夫决策过程(MDP)

    作者:YJLAugus 博客: https://www.cnblogs.com/yjlaugus 项目地址:https://github.com/YJLAugus/Reinforcement-Lear ...

  5. gym强化学习入门demo——随机选取动作 其实有了这些动作和反馈值以后就可以用来训练DNN网络了

    # -*- coding: utf-8 -*- import gym import time env = gym.make('CartPole-v0') observation = env.reset ...

  6. DQN(Deep Q-learning)入门教程(一)之强化学习介绍

    什么是强化学习? 强化学习(Reinforcement learning,简称RL)是和监督学习,非监督学习并列的第三种机器学习方法,如下图示: 首先让我们举一个小时候的例子: 你现在在家,有两个动作 ...

  7. <Machine Learning - 李宏毅> 学习笔记

    <Machine Learning - 李宏毅> 学习笔记 b站视频地址:李宏毅2019国语 第一章 机器学习介绍 Hand crafted rules Machine learning ...

  8. 【强化学习】MOVE37-Introduction(导论)/马尔科夫链/马尔科夫决策过程

    写在前面的话:从今日起,我会边跟着硅谷大牛Siraj的MOVE 37系列课程学习Reinforcement Learning(强化学习算法),边更新这个系列.课程包含视频和文字,课堂笔记会按视频为单位 ...

  9. 强化学习 reinforcement learning: An Introduction 第一章, tic-and-toc 代码示例 (结构重建版,注释版)

    强化学习入门最经典的数据估计就是那个大名鼎鼎的  reinforcement learning: An Introduction 了,  最近在看这本书,第一章中给出了一个例子用来说明什么是强化学习, ...

随机推荐

  1. 初学MyBatis(踩坑)Error querying database. Cause: java.sql.SQLException: java.lang.ClassCastException: java.math.BigInteger cannot be cast to java.lang.Long

    最近在学习Mybatis,代码全部根据教程写好了,一运行结果报了一个错误,主要错误内容: Caused by: org.apache.ibatis.exceptions.PersistenceExce ...

  2. Android内存溢出、内存泄漏常见案例及最佳实践总结

    内存溢出是Android开发中一个老大难的问题,相关的知识点比较繁杂,绝大部分的开发者都零零星星知道一些,但难以全面.本篇文档会尽量从广度和深度两个方面进行整理,帮助大家梳理这方面的知识点(基于Jav ...

  3. Apache ActiveMQ(CVE-2016-3088)

    影响版本 Apache ActiveMQ 5.0.0-5.13.x 路径地址 http://192.168.49.2:8161/admin/test/systemProperties.jsp 该漏洞允 ...

  4. Python小白的数学建模课-10.微分方程边值问题

    小白往往听到微分方程就觉得害怕,其实数学建模中的微分方程模型不仅没那么复杂,而且很容易写出高水平的数模论文. 本文介绍微分方程模型边值问题的建模与求解,不涉及算法推导和编程,只探讨如何使用 Pytho ...

  5. 使用vue实现简单的待办事项

    待办事项 效果图 目录结构 详细代码 AddNew.vue <template> <div> <input v-model="content"/> ...

  6. OpenGL学习笔记(四)纹理

    目录 要完成的纹理效果 纹理环绕方式 纹理过滤 多级渐远纹理 加载与创建纹理 stb_image库的使用方法 生成纹理对象 应用纹理 纹理单元 参考资料:OpenGL中文翻译 要完成的纹理效果 纹理是 ...

  7. noip模拟34[惨败]

    noip模拟34 solutions 我从来不为失败找借口,因为败了就是败了,没人听你诉说任何事情 今天很伤感,以来考试没考好,二来改题改半天也改不出来 这次算是炸出来了我经常范的一些错误,比如除以0 ...

  8. 攻防世界pwn高手区——pwn1

    攻防世界 -- pwn1 攻防世界的一道pwn题,也有一段时间没有做pwn了,找了一道栈题热身,发现还是有些生疏了. 题目流程 拖入IDA中,题目流程如图所示,当v0为1时,存在栈溢出漏洞.在gdb中 ...

  9. openstack June all-in-one 安装手册

    by lt,hyc 1.安全规范 表1:openstack用户和密码值设置 用户名 含义  本文的设置值 Admin openstack管理员用户 ADMIN_PASS Keystone openst ...

  10. [开源]C++实现控制台随机迷宫

    我全程使用TCHAR系列函数,亲测可以不改动代码兼容Unicode/ANSI开发环境,功能正常.大概有100行代码是来自网络的,我也做了改动,侵权请联系删除.本文作者szx0427,只发布于CSDN与 ...