强化学习_Deep Q Learning(DQN)_代码解析
Deep Q Learning
使用gym的CartPole作为环境,使用QDN解决离散动作空间的问题。
一、导入需要的包和定义超参数
import tensorflow as tf
import numpy as np
import gym
import time
import random
from collections import deque ##################### hyper parameters #################### # Hyper Parameters for DQN
GAMMA = 0.9 # discount factor for target Q
INITIAL_EPSILON = 0.5 # starting value of epsilon
FINAL_EPSILON = 0.01 # final value of epsilon
REPLAY_SIZE = 10000 # experience replay buffer size
BATCH_SIZE = 32 # size of minibatch
二、DQN构造函数
1、初始化经验重放buffer;
2、设置问题的状态空间维度,动作空间维度;
3、设置e-greedy的epsilon;
4、创建用于估计q值的Q网络,创建训练方法。
5、初始化tensorflow的session
def __init__(self, env):
# init experience replay
self.replay_buffer = deque()
# init some parameters
self.time_step = 0
self.epsilon = INITIAL_EPSILON
self.state_dim = env.observation_space.shape[0]
self.action_dim = env.action_space.n self.create_Q_network()
self.create_training_method() # Init session
self.session = tf.InteractiveSession()
self.session.run(tf.global_variables_initializer())
三、创建神经网络
创建一个3层全连接的神经网络,hidden layer有20个神经元。
def create_Q_network(self):
# network weights
W1 = self.weight_variable([self.state_dim,20])
b1 = self.bias_variable([20])
W2 = self.weight_variable([20,self.action_dim])
b2 = self.bias_variable([self.action_dim])
# input layer
self.state_input = tf.placeholder("float",[None,self.state_dim])
# hidden layers
h_layer = tf.nn.relu(tf.matmul(self.state_input,W1) + b1)
# Q Value layer
self.Q_value = tf.matmul(h_layer,W2) + b2 def weight_variable(self,shape):
initial = tf.truncated_normal(shape)
return tf.Variable(initial) def bias_variable(self,shape):
initial = tf.constant(0.01, shape = shape)
return tf.Variable(initial)
定义cost function和优化的方法,使“实际”q值(y)与当前网络估计的q值的差值尽可能小,即使当前网络尽可能接近真实的q值。
def create_training_method(self):
self.action_input = tf.placeholder("float",[None,self.action_dim]) # one hot presentation
self.y_input = tf.placeholder("float",[None])
Q_action = tf.reduce_sum(tf.multiply(self.Q_value,self.action_input),reduction_indices = 1)
self.cost = tf.reduce_mean(tf.square(self.y_input - Q_action))
self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.cost)
从buffer中随机取样BATCH_SIZE大小的样本,计算y(batch中(s,a)在当前网络下的实际q值)
if done: y_batch.append(reward_batch[i])
else : y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i]))
def train_Q_network(self):
self.time_step += 1
# Step 1: obtain random minibatch from replay memory
minibatch = random.sample(self.replay_buffer,BATCH_SIZE)
state_batch = [data[0] for data in minibatch]
action_batch = [data[1] for data in minibatch]
reward_batch = [data[2] for data in minibatch]
next_state_batch = [data[3] for data in minibatch] # Step 2: calculate y
y_batch = []
Q_value_batch = self.Q_value.eval(feed_dict={self.state_input:next_state_batch})
for i in range(0,BATCH_SIZE):
done = minibatch[i][4]
if done:
y_batch.append(reward_batch[i])
else :
y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i])) self.optimizer.run(feed_dict={
self.y_input:y_batch,
self.action_input:action_batch,
self.state_input:state_batch
})
四、Agent感知环境的接口
每次决策采取的动作,得到环境的反馈,将(s, a, r, s_, done)存入经验重放buffer。当buffer中经验数量大于batch_size时开始训练。
def perceive(self,state,action,reward,next_state,done):
one_hot_action = np.zeros(self.action_dim)
one_hot_action[action] = 1
self.replay_buffer.append((state,one_hot_action,reward,next_state,done))
if len(self.replay_buffer) > REPLAY_SIZE:
self.replay_buffer.popleft() if len(self.replay_buffer) > BATCH_SIZE:
self.train_Q_network()
五、决策(选取action)
两种选取方式greedy和e-greedy。
def egreedy_action(self,state):
Q_value = self.Q_value.eval(feed_dict = {
self.state_input:[state]
})[0]
if random.random() <= self.epsilon:
self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000
return random.randint(0,self.action_dim - 1)
else:
self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000
return np.argmax(Q_value) def action(self,state):
return np.argmax(self.Q_value.eval(feed_dict = {
self.state_input:[state]
})[0])
Agent完整代码:
import tensorflow as tf
import numpy as np
import gym
import time
import random
from collections import deque ##################### hyper parameters #################### # Hyper Parameters for DQN
GAMMA = 0.9 # discount factor for target Q
INITIAL_EPSILON = 0.5 # starting value of epsilon
FINAL_EPSILON = 0.01 # final value of epsilon
REPLAY_SIZE = 10000 # experience replay buffer size
BATCH_SIZE = 32 # size of minibatch ############################### DQN #################################### class DQN():
# DQN Agent
def __init__(self, env):
# init experience replay
self.replay_buffer = deque()
# init some parameters
self.time_step = 0
self.epsilon = INITIAL_EPSILON
self.state_dim = env.observation_space.shape[0]
self.action_dim = env.action_space.n self.create_Q_network()
self.create_training_method() # Init session
self.session = tf.InteractiveSession()
self.session.run(tf.global_variables_initializer()) def create_Q_network(self):
# network weights
W1 = self.weight_variable([self.state_dim,20])
b1 = self.bias_variable([20])
W2 = self.weight_variable([20,self.action_dim])
b2 = self.bias_variable([self.action_dim])
# input layer
self.state_input = tf.placeholder("float",[None,self.state_dim])
# hidden layers
h_layer = tf.nn.relu(tf.matmul(self.state_input,W1) + b1)
# Q Value layer
self.Q_value = tf.matmul(h_layer,W2) + b2 def create_training_method(self):
self.action_input = tf.placeholder("float",[None,self.action_dim]) # one hot presentation
self.y_input = tf.placeholder("float",[None])
Q_action = tf.reduce_sum(tf.multiply(self.Q_value,self.action_input),reduction_indices = 1)
self.cost = tf.reduce_mean(tf.square(self.y_input - Q_action))
self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.cost) def perceive(self,state,action,reward,next_state,done):
one_hot_action = np.zeros(self.action_dim)
one_hot_action[action] = 1
self.replay_buffer.append((state,one_hot_action,reward,next_state,done))
if len(self.replay_buffer) > REPLAY_SIZE:
self.replay_buffer.popleft() if len(self.replay_buffer) > BATCH_SIZE:
self.train_Q_network() def train_Q_network(self):
self.time_step += 1
# Step 1: obtain random minibatch from replay memory
minibatch = random.sample(self.replay_buffer,BATCH_SIZE)
state_batch = [data[0] for data in minibatch]
action_batch = [data[1] for data in minibatch]
reward_batch = [data[2] for data in minibatch]
next_state_batch = [data[3] for data in minibatch] # Step 2: calculate y
y_batch = []
Q_value_batch = self.Q_value.eval(feed_dict={self.state_input:next_state_batch})
for i in range(0,BATCH_SIZE):
done = minibatch[i][4]
if done:
y_batch.append(reward_batch[i])
else :
y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i])) self.optimizer.run(feed_dict={
self.y_input:y_batch,
self.action_input:action_batch,
self.state_input:state_batch
}) def egreedy_action(self,state):
Q_value = self.Q_value.eval(feed_dict = {
self.state_input:[state]
})[0]
if random.random() <= self.epsilon:
self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000
return random.randint(0,self.action_dim - 1)
else:
self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000
return np.argmax(Q_value) def action(self,state):
return np.argmax(self.Q_value.eval(feed_dict = {
self.state_input:[state]
})[0]) def weight_variable(self,shape):
initial = tf.truncated_normal(shape)
return tf.Variable(initial) def bias_variable(self,shape):
initial = tf.constant(0.01, shape = shape)
return tf.Variable(initial)
训练agent:
from DQN import DQN
import gym
import numpy as np
import time ENV_NAME = 'CartPole-v1'
EPISODE = 3000 # Episode limitation
STEP = 300 # Step limitation in an episode
TEST = 10 # The number of experiment test every 100 episode def main():
# initialize OpenAI Gym env and dqn agent
env = gym.make(ENV_NAME)
agent = DQN(env) for episode in range(EPISODE):
# initialize task
state = env.reset()
# Train
ep_reward = 0
for step in range(STEP):
action = agent.egreedy_action(state) # e-greedy action for train
next_state,reward,done,_ = env.step(action)
# Define reward for agent
reward = -10 if done else 1
ep_reward += reward
agent.perceive(state,action,reward,next_state,done)
state = next_state
if done:
#print('episode complete, reward: ', ep_reward)
break
# Test every 100 episodes
if episode % 100 == 0:
total_reward = 0
for i in range(TEST):
state = env.reset()
for j in range(STEP):
#env.render()
action = agent.action(state) # direct action for test
state,reward,done,_ = env.step(action)
total_reward += reward
if done:
break
ave_reward = total_reward/TEST
print ('episode: ',episode,'Evaluation Average Reward:',ave_reward) if __name__ == '__main__':
main()
reference:
https://www.cnblogs.com/pinard/p/9714655.html
https://github.com/ljpzzz/machinelearning/blob/master/reinforcement-learning/dqn.py
强化学习_Deep Q Learning(DQN)_代码解析的更多相关文章
- 强化学习9-Deep Q Learning
之前讲到Sarsa和Q Learning都不太适合解决大规模问题,为什么呢? 因为传统的强化学习都有一张Q表,这张Q表记录了每个状态下,每个动作的q值,但是现实问题往往极其复杂,其状态非常多,甚至是连 ...
- 强化学习(十二) Dueling DQN
在强化学习(十一) Prioritized Replay DQN中,我们讨论了对DQN的经验回放池按权重采样来优化DQN算法的方法,本文讨论另一种优化方法,Dueling DQN.本章内容主要参考了I ...
- 【转载】 强化学习(十一) Prioritized Replay DQN
原文地址: https://www.cnblogs.com/pinard/p/9797695.html ------------------------------------------------ ...
- 强化学习(十一) Prioritized Replay DQN
在强化学习(十)Double DQN (DDQN)中,我们讲到了DDQN使用两个Q网络,用当前Q网络计算最大Q值对应的动作,用目标Q网络计算这个最大动作对应的目标Q值,进而消除贪婪法带来的偏差.今天我 ...
- 强化学习(四)—— DQN系列(DQN, Nature DQN, DDQN, Dueling DQN等)
1 概述 在之前介绍的几种方法,我们对值函数一直有一个很大的限制,那就是它们需要用表格的形式表示.虽说表格形式对于求解有很大的帮助,但它也有自己的缺点.如果问题的状态和行动的空间非常大,使用表格表示难 ...
- 强化学习10-Deep Q Learning-fix target
针对 Deep Q Learning 可能无法收敛的问题,这里提出了一种 fix target 的方法,就是冻结现实神经网络,延时更新参数. 这个方法的初衷是这样的: 1. 之前我们每个(批)记忆都 ...
- 强化学习(3)-----DQN
看这篇https://blog.csdn.net/qq_16234613/article/details/80268564 1.DQN 原因:在普通的Q-learning中,当状态和动作空间是离散且维 ...
- 强化学习 - Q-learning Sarsa 和 DQN 的理解
本文用于基本入门理解. 强化学习的基本理论 : R, S, A 这些就不说了. 先设想两个场景: 一. 1个 5x5 的 格子图, 里面有一个目标点, 2个死亡点二. 一个迷宫, 一个出发点, ...
- 转:强化学习(Reinforcement Learning)
机器学习算法大致可以分为三种: 1. 监督学习(如回归,分类) 2. 非监督学习(如聚类,降维) 3. 增强学习 什么是增强学习呢? 增强学习(reinforcementlearning, RL)又叫 ...
随机推荐
- sql之外键变种
多对一 : 只需设个外键 外键变种之一对一:普通外键关联的表是一对多关系,如果外键上再加上唯一索引,表就会变成一对一关系. 外键变种之多对多:
- 3 手写Java HashMap核心源码
手写Java HashMap核心源码 上一章手写LinkedList核心源码,本章我们来手写Java HashMap的核心源码. 我们来先了解一下HashMap的原理.HashMap 字面意思 has ...
- 洛谷 - P1891 - 疯狂LCM - 线性筛
另一道数据范围不一样的题:https://www.cnblogs.com/Yinku/p/10987912.html $F(n)=\sum\limits_{i=1}^{n} lcm(i,n) $ $\ ...
- C#中var关键字用法分析
原文连接 本文实例分析了C#中var关键字用法.分享给大家供大家参考.具体方法如下: C#关键字是伴随着.NET 3.5以后,伴随着匿名函数.LINQ而来, 由编译器帮我们推断具体的类型.总体来说,当 ...
- Server.MapPath()相关
Server.MapPath()相关 1. Server.MapPath()介绍 Server.MapPath(string path)作用是返回与Web服务器上的指定虚拟路径相对应的物理文 ...
- Metabolic and gut microbial characterization of obesity-prone mice under high-fat diet (文献分享一组-赵容丽)
题目:高脂饮食下易肥胖小鼠的代谢和肠道微生物特性研究 Metabolic and gut microbial characterization of obesity-prone mice under ...
- 【TeamViewer】v13.2.26558版本 修改ID
TeamViewer是一款远程协作软件,可以让你在一台机器上操作另一台机器.比如我最近就经常在家里连接公司的电脑进行远程工作.可以说是对于程序员很好用的一个软件. TeamViewer 使用频繁后会被 ...
- hibernate错误总结2
- foreach循环报NPE空指针异常
前言 最近debug时忽然发现,如果一个集合赋值为null,那么对该集合进行foreach循环(也叫增强for循环)时,会报NPE(即空指针异常NullPointerException). 代码如下: ...
- nginx 一些配置
worker_processes 4; #工作进程数 events { #epoll是多路复用IO(I/O Multiplexing)中的一种方式, #仅用于linux2.6以上内核,可以大大提高ng ...