强化学习实践:Policy Gradient-Cart pole游戏展示
摘要:智能体 agent 在环境 environment 中学习,根据环境的状态 state(或观测到的 observation),执行动作 action,并根据环境的反馈 reward(奖励)来指导更好的动作。
本文分享自华为云社区《强化学习从基础到进阶 - 案例与实践 [5.1]:Policy Gradient-Cart pole 游戏展示》,作者:汀丶 。
- 强化学习(Reinforcement learning,简称 RL)是机器学习中的一个领域,区别与监督学习和无监督学习,强调如何基于环境而行动,以取得最大化的预期利益。
- 基本操作步骤:智能体 agent 在环境 environment 中学习,根据环境的状态 state(或观测到的 observation),执行动作 action,并根据环境的反馈 reward(奖励)来指导更好的动作。
比如本项目的 Cart pole 小游戏中,agent 就是动图中的杆子,杆子有向左向右两种 action。
1.Policy Gradient 简介
- 在强化学习中,有两大类方法,一种基于值(Value-based),一种基于策略(Policy-based)
- Value-based 的算法的典型代表为 Q-learning 和 SARSA,将 Q 函数优化到最优,再根据 Q 函数取最优策略。
- Policy-based 的算法的典型代表为 Policy Gradient,直接优化策略函数。
- 采用神经网络拟合策略函数,需计算策略梯度用于优化策略网络。
- 优化的目标是在策略 π(s,a) 的期望回报:所有的轨迹获得的回报 R 与对应的轨迹发生概率 p 的加权和,当 N 足够大时,可通过采样 N 个 Episode 求平均的方式近似表达。
- 优化目标对参数 θ 求导后得到策略梯度:
## 安装依赖
!pip install pygame
!pip install gym
!pip install atari_py
!pip install parl
import gym
import os
import random
import collections
import paddle
import paddle.nn as nn
import numpy as np
import paddle.nn.functional as F
2. 模型 Model
这里的模型可以根据自己的需求选择不同的神经网络组建。
PolicyGradient 用来定义前向 (Forward) 网络,可以自由的定制自己的网络结构。
class PolicyGradient(nn.Layer):
def __init__(self, act_dim):
super(PolicyGradient, self).__init__()
act_dim = act_dim
hid1_size = act_dim * 10
self.linear1 = nn.Linear(in_features=4, out_features=hid1_size)
self.linear2 = nn.Linear(in_features=hid1_size, out_features=act_dim)
def forward(self, obs):
out = self.linear1(obs)
out = paddle.tanh(out)
out = self.linear2(out)
out = F.softmax(out)
return out
3. 智能体 Agent 的学习函数
这里包括模型探索与模型训练两个部分
Agent 负责算法与环境的交互,在交互过程中把生成的数据提供给 Algorithm 来更新模型 (Model),数据的预处理流程也一般定义在这里。
def sample(obs, MODEL):
global ACTION_DIM
obs = np.expand_dims(obs, axis=0)
obs = paddle.to_tensor(obs, dtype='float32')
act = MODEL(obs)
act_prob = np.squeeze(act, axis=0)
act = np.random.choice(range(ACTION_DIM), p=act_prob.numpy())
return act
def learn(obs, action, reward, MODEL):
obs = np.array(obs).astype('float32')
obs = paddle.to_tensor(obs)
act_prob = MODEL(obs)
action = paddle.to_tensor(action.astype('int32'))
log_prob = paddle.sum(-1.0 * paddle.log(act_prob) * F.one_hot(action, act_prob.shape[1]), axis=1)
reward = paddle.to_tensor(reward.astype('float32'))
cost = log_prob * reward
cost = paddle.sum(cost)
opt = paddle.optimizer.Adam(learning_rate=LEARNING_RATE,
parameters=MODEL.parameters()) # 优化器(动态图)
cost.backward()
opt.step()
opt.clear_grad()
return cost.numpy()
4. 模型梯度更新算法
def run_train(env, MODEL):
MODEL.train()
obs_list, action_list, total_reward = [], [], []
obs = env.reset()
while True:
# 获取随机动作和执行游戏
obs_list.append(obs)
action = sample(obs, MODEL) # 采样动作
action_list.append(action)
obs, reward, isOver, info = env.step(action)
total_reward.append(reward)
# 结束游戏
if isOver:
break
return obs_list, action_list, total_reward
def evaluate(model, env, render=False):
model.eval()
eval_reward = []
for i in range(5):
obs = env.reset()
episode_reward = 0
while True:
obs = np.expand_dims(obs, axis=0)
obs = paddle.to_tensor(obs, dtype='float32')
action = model(obs)
action = np.argmax(action.numpy())
obs, reward, done, _ = env.step(action)
episode_reward += reward
if render:
env.render()
if done:
break
eval_reward.append(episode_reward)
return np.mean(eval_reward)
5. 训练函数与验证函数
设置超参数
LEARNING_RATE = 0.001 # 学习率大小
OBS_DIM = None
ACTION_DIM = None
# 根据一个episode的每个step的reward列表,计算每一个Step的Gt
def calc_reward_to_go(reward_list, gamma=1.0):
for i in range(len(reward_list) - 2, -1, -1):
# G_t = r_t + γ·r_t+1 + ... = r_t + γ·G_t+1
reward_list[i] += gamma * reward_list[i + 1] # Gt
return np.array(reward_list)
def main():
global OBS_DIM
global ACTION_DIM
train_step_list = []
train_reward_list = []
evaluate_step_list = []
evaluate_reward_list = []
# 初始化游戏
env = gym.make('CartPole-v0')
# 图像输入形状和动作维度
action_dim = env.action_space.n
obs_dim = env.observation_space.shape[0]
OBS_DIM = obs_dim
ACTION_DIM = action_dim
max_score = -int(1e4)
# 创建存储执行游戏的内存
MODEL = PolicyGradient(ACTION_DIM)
TARGET_MODEL = PolicyGradient(ACTION_DIM)
# 开始训练
print("start training...")
# 训练max_episode个回合,test部分不计算入episode数量
for i in range(1000):
obs_list, action_list, reward_list = run_train(env, MODEL)
if i % 10 == 0:
print("Episode {}, Reward Sum {}.".format(i, sum(reward_list)))
batch_obs = np.array(obs_list)
batch_action = np.array(action_list)
batch_reward = calc_reward_to_go(reward_list)
cost = learn(batch_obs, batch_action, batch_reward, MODEL)
if (i + 1) % 100 == 0:
total_reward = evaluate(MODEL, env, render=False) # render=True 查看渲染效果,需要在本地运行,AIStudio无法显示
print("Test reward: {}".format(total_reward))
if __name__ == '__main__':
main()
W0630 11:26:18.969960 322 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0630 11:26:18.974581 322 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
start training...
Episode 0, Reward Sum 37.0.
Episode 10, Reward Sum 27.0.
Episode 20, Reward Sum 32.0.
Episode 30, Reward Sum 20.0.
Episode 40, Reward Sum 18.0.
Episode 50, Reward Sum 38.0.
Episode 60, Reward Sum 52.0.
Episode 70, Reward Sum 19.0.
Episode 80, Reward Sum 27.0.
Episode 90, Reward Sum 13.0.
Test reward: 42.8
Episode 100, Reward Sum 28.0.
Episode 110, Reward Sum 44.0.
Episode 120, Reward Sum 30.0.
Episode 130, Reward Sum 28.0.
Episode 140, Reward Sum 27.0.
Episode 150, Reward Sum 47.0.
Episode 160, Reward Sum 55.0.
Episode 170, Reward Sum 26.0.
Episode 180, Reward Sum 47.0.
Episode 190, Reward Sum 17.0.
Test reward: 42.8
Episode 200, Reward Sum 23.0.
Episode 210, Reward Sum 19.0.
Episode 220, Reward Sum 15.0.
Episode 230, Reward Sum 59.0.
Episode 240, Reward Sum 59.0.
Episode 250, Reward Sum 32.0.
Episode 260, Reward Sum 58.0.
Episode 270, Reward Sum 18.0.
Episode 280, Reward Sum 24.0.
Episode 290, Reward Sum 64.0.
Test reward: 116.8
Episode 300, Reward Sum 54.0.
Episode 310, Reward Sum 28.0.
Episode 320, Reward Sum 44.0.
Episode 330, Reward Sum 18.0.
Episode 340, Reward Sum 89.0.
Episode 350, Reward Sum 26.0.
Episode 360, Reward Sum 57.0.
Episode 370, Reward Sum 54.0.
Episode 380, Reward Sum 105.0.
Episode 390, Reward Sum 56.0.
Test reward: 94.0
Episode 400, Reward Sum 70.0.
Episode 410, Reward Sum 35.0.
Episode 420, Reward Sum 45.0.
Episode 430, Reward Sum 117.0.
Episode 440, Reward Sum 50.0.
Episode 450, Reward Sum 35.0.
Episode 460, Reward Sum 41.0.
Episode 470, Reward Sum 43.0.
Episode 480, Reward Sum 75.0.
Episode 490, Reward Sum 37.0.
Test reward: 57.6
Episode 500, Reward Sum 40.0.
Episode 510, Reward Sum 85.0.
Episode 520, Reward Sum 86.0.
Episode 530, Reward Sum 30.0.
Episode 540, Reward Sum 68.0.
Episode 550, Reward Sum 25.0.
Episode 560, Reward Sum 82.0.
Episode 570, Reward Sum 54.0.
Episode 580, Reward Sum 53.0.
Episode 590, Reward Sum 58.0.
Test reward: 147.2
Episode 600, Reward Sum 24.0.
Episode 610, Reward Sum 78.0.
Episode 620, Reward Sum 62.0.
Episode 630, Reward Sum 58.0.
Episode 640, Reward Sum 50.0.
Episode 650, Reward Sum 67.0.
Episode 660, Reward Sum 68.0.
Episode 670, Reward Sum 51.0.
Episode 680, Reward Sum 36.0.
Episode 690, Reward Sum 69.0.
Test reward: 84.2
Episode 700, Reward Sum 34.0.
Episode 710, Reward Sum 59.0.
Episode 720, Reward Sum 56.0.
Episode 730, Reward Sum 72.0.
Episode 740, Reward Sum 28.0.
Episode 750, Reward Sum 35.0.
Episode 760, Reward Sum 54.0.
Episode 770, Reward Sum 61.0.
Episode 780, Reward Sum 32.0.
Episode 790, Reward Sum 147.0.
Test reward: 123.0
Episode 800, Reward Sum 129.0.
Episode 810, Reward Sum 65.0.
Episode 820, Reward Sum 73.0.
Episode 830, Reward Sum 54.0.
Episode 840, Reward Sum 60.0.
Episode 850, Reward Sum 71.0.
Episode 860, Reward Sum 54.0.
Episode 870, Reward Sum 74.0.
Episode 880, Reward Sum 34.0.
Episode 890, Reward Sum 55.0.
Test reward: 104.8
Episode 900, Reward Sum 41.0.
Episode 910, Reward Sum 111.0.
Episode 920, Reward Sum 33.0.
Episode 930, Reward Sum 49.0.
Episode 940, Reward Sum 62.0.
Episode 950, Reward Sum 114.0.
Episode 960, Reward Sum 52.0.
Episode 970, Reward Sum 64.0.
Episode 980, Reward Sum 94.0.
Episode 990, Reward Sum 90.0.
Test reward: 72.2
项目链接 fork 一下即可运行。
强化学习实践:Policy Gradient-Cart pole游戏展示的更多相关文章
- 深度学习-深度强化学习(DRL)-Policy Gradient与PPO笔记
Policy Gradient 初始学习李宏毅讲的强化学习,听台湾的口音真是费了九牛二虎之力,后来看到有热心博客整理的很细致,于是转载来看,当作笔记留待复习用,原文链接在文末.看完笔记再去听一听李宏毅 ...
- 强化学习七 - Policy Gradient Methods
一.前言 之前我们讨论的所有问题都是先学习action value,再根据action value 来选择action(无论是根据greedy policy选择使得action value 最大的ac ...
- 强化学习算法Policy Gradient
1 算法的优缺点 1.1 优点 在DQN算法中,神经网络输出的是动作的q值,这对于一个agent拥有少数的离散的动作还是可以的.但是如果某个agent的动作是连续的,这无疑对DQN算法是一个巨大的挑战 ...
- 深度学习课程笔记(十四)深度强化学习 --- Proximal Policy Optimization (PPO)
深度学习课程笔记(十四)深度强化学习 --- Proximal Policy Optimization (PPO) 2018-07-17 16:54:51 Reference: https://b ...
- 告别炼丹,Google Brain提出强化学习助力Neural Architecture Search | ICLR2017
论文为Google Brain在16年推出的使用强化学习的Neural Architecture Search方法,该方法能够针对数据集搜索构建特定的网络,但需要800卡训练一个月时间.虽然论文的思路 ...
- Deep Learning专栏--强化学习之从 Policy Gradient 到 A3C(3)
在之前的强化学习文章里,我们讲到了经典的MDP模型来描述强化学习,其解法包括value iteration和policy iteration,这类经典解法基于已知的转移概率矩阵P,而在实际应用中,我们 ...
- 基于Keras的OpenAI-gym强化学习的车杆/FlappyBird游戏
强化学习 课程:Q-Learning强化学习(李宏毅).深度强化学习 强化学习是一种允许你创造能从环境中交互学习的AI Agent的机器学习算法,其通过试错来学习.如上图所示,大脑代表AI Agent ...
- ICML论文|阿尔法狗CTO讲座: AI如何用新型强化学习玩转围棋扑克游戏
今年8月,Demis Hassabis等人工智能技术先驱们将来到雷锋网“人工智能与机器人创新大会”.在此,我们为大家分享David Silver的论文<不完美信息游戏中的深度强化学习自我对战&g ...
- 强化学习--QLearning
1.概述: QLearning基于值函数的方法,不同与policy gradient的方法,Qlearning是预测值函数,通过值函数来选择 值函数最大的action,而policy gradient ...
- 强化学习--Actor-Critic---tensorflow实现
完整代码:https://github.com/zle1992/Reinforcement_Learning_Game Policy Gradient 可以直接预测出动作,也可以预测连续动作,但是无 ...
随机推荐
- day16:Linux常用命令
Linux中目录含义 /bin 存放普通用户的命令文件/boot 存放系统启动文件/cdrom 存放读取光盘的相关文件/dev 设备文件 /etc 配置文件/home 家目录/lib 库文件/lib6 ...
- day14:列表/集合/字典推导式&生成器表达式&生成器函数
推导式 推导式的定义: 通过一行循环判断,遍历一系列数据的方式 推导式的语法: val for val in Iterable 三种方式: [val for val in Iterable] {val ...
- day29:计算机网络概念
目录 1.网络开发的两大架构 2.网络概念 3.OSI七层模型 4.ARP协议 5.TCP三次握手和四次挥手 1.网络开发的两大架构 1.没有网络的时候,文件是如何传输的? 早期没有网络 a.py - ...
- 搞懂Python正则表达式,这一篇就够了
本文代码基于Python3.11解释器,除了第一次示例,代码将省略 import re 这个语句 所有示例代码均可以在我的github仓库中的 code.py文件内查看 [我的仓库](PythonLe ...
- AndroidApp加固与脱壳
0x01 APP加固 01.为什么要加固 APP加固是对APP代码逻辑的一种保护.原理是将应用文件进行某种形式的转换,包括不限于隐藏,混淆,加密等操作,进一步保护软件的利益不受损坏.总结主要有以下三方 ...
- Rails 中的布局和渲染
Templates, Partials, and Layouts 在 Rails 中,视图是用于呈现 HTML.XML.JSON 等响应的模板.Rails 的视图系统支持模板.局部模板和布局模板,它们 ...
- Python 函数传递任意数量的实参
函数传递任意数量的实参 *形参名,形参名中的星号让python创建了一个空元组,并将收到的所有值都封装到这个元组中 # 案例 *toppings 形参名中的星号让python创建了一个空元组,并将收到 ...
- 【H5】Emmet 指令 HTML
Emmet操作指南 HTML篇 生成带有内容的标签 标签名{内容}可以生成带有内容的标签 div{abc} <div>abc</div> 生成带有属性的标签 生成带有class ...
- 2020-10-25:go中channel的close流程是什么?
福哥答案2020-10-25:
- Git开发、发布、缺陷分离模型概述(支持master/develop/feature/release/hotfix类型分支)
Git是什么? Git是一种分布式版本控制系统,它可以记录文件的修改历史和版本变化,并可以支持多人协同开发.Git最初是由Linux开发者Linus Torvalds创建的,它具有高效.灵活.稳定等优 ...