Deep Q Learning

使用gym的CartPole作为环境,使用QDN解决离散动作空间的问题。

一、导入需要的包和定义超参数

  1. import tensorflow as tf
  2. import numpy as np
  3. import gym
  4. import time
  5. import random
  6. from collections import deque
  7.  
  8. ##################### hyper parameters ####################
  9.  
  10. # Hyper Parameters for DQN
  11. GAMMA = 0.9 # discount factor for target Q
  12. INITIAL_EPSILON = 0.5 # starting value of epsilon
  13. FINAL_EPSILON = 0.01 # final value of epsilon
  14. REPLAY_SIZE = 10000 # experience replay buffer size
  15. BATCH_SIZE = 32 # size of minibatch

二、DQN构造函数

1、初始化经验重放buffer;

2、设置问题的状态空间维度,动作空间维度;

3、设置e-greedy的epsilon;

4、创建用于估计q值的Q网络,创建训练方法。

5、初始化tensorflow的session

  1. def __init__(self, env):
  2. # init experience replay
  3. self.replay_buffer = deque()
  4. # init some parameters
  5. self.time_step = 0
  6. self.epsilon = INITIAL_EPSILON
  7. self.state_dim = env.observation_space.shape[0]
  8. self.action_dim = env.action_space.n
  9.  
  10. self.create_Q_network()
  11. self.create_training_method()
  12.  
  13. # Init session
  14. self.session = tf.InteractiveSession()
  15. self.session.run(tf.global_variables_initializer())

三、创建神经网络

创建一个3层全连接的神经网络,hidden layer有20个神经元。

  1. def create_Q_network(self):
  2. # network weights
  3. W1 = self.weight_variable([self.state_dim,20])
  4. b1 = self.bias_variable([20])
  5. W2 = self.weight_variable([20,self.action_dim])
  6. b2 = self.bias_variable([self.action_dim])
  7. # input layer
  8. self.state_input = tf.placeholder("float",[None,self.state_dim])
  9. # hidden layers
  10. h_layer = tf.nn.relu(tf.matmul(self.state_input,W1) + b1)
  11. # Q Value layer
  12. self.Q_value = tf.matmul(h_layer,W2) + b2
  13.  
  14. def weight_variable(self,shape):
  15. initial = tf.truncated_normal(shape)
  16. return tf.Variable(initial)
  17.  
  18. def bias_variable(self,shape):
  19. initial = tf.constant(0.01, shape = shape)
  20. return tf.Variable(initial)

定义cost function和优化的方法,使“实际”q值(y)与当前网络估计的q值的差值尽可能小,即使当前网络尽可能接近真实的q值。

  1. def create_training_method(self):
  2. self.action_input = tf.placeholder("float",[None,self.action_dim]) # one hot presentation
  3. self.y_input = tf.placeholder("float",[None])
  4. Q_action = tf.reduce_sum(tf.multiply(self.Q_value,self.action_input),reduction_indices = 1)
  5. self.cost = tf.reduce_mean(tf.square(self.y_input - Q_action))
  6. self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.cost)

从buffer中随机取样BATCH_SIZE大小的样本,计算y(batch中(s,a)在当前网络下的实际q值)

if done: y_batch.append(reward_batch[i])

else :  y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i]))

  1. def train_Q_network(self):
  2. self.time_step += 1
  3. # Step 1: obtain random minibatch from replay memory
  4. minibatch = random.sample(self.replay_buffer,BATCH_SIZE)
  5. state_batch = [data[0] for data in minibatch]
  6. action_batch = [data[1] for data in minibatch]
  7. reward_batch = [data[2] for data in minibatch]
  8. next_state_batch = [data[3] for data in minibatch]
  9.  
  10. # Step 2: calculate y
  11. y_batch = []
  12. Q_value_batch = self.Q_value.eval(feed_dict={self.state_input:next_state_batch})
  13. for i in range(0,BATCH_SIZE):
  14. done = minibatch[i][4]
  15. if done:
  16. y_batch.append(reward_batch[i])
  17. else :
  18. y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i]))
  19.  
  20. self.optimizer.run(feed_dict={
  21. self.y_input:y_batch,
  22. self.action_input:action_batch,
  23. self.state_input:state_batch
  24. })

四、Agent感知环境的接口

每次决策采取的动作,得到环境的反馈,将(s, a, r, s_, done)存入经验重放buffer。当buffer中经验数量大于batch_size时开始训练。

  1. def perceive(self,state,action,reward,next_state,done):
  2. one_hot_action = np.zeros(self.action_dim)
  3. one_hot_action[action] = 1
  4. self.replay_buffer.append((state,one_hot_action,reward,next_state,done))
  5. if len(self.replay_buffer) > REPLAY_SIZE:
  6. self.replay_buffer.popleft()
  7.  
  8. if len(self.replay_buffer) > BATCH_SIZE:
  9. self.train_Q_network()

五、决策(选取action)

两种选取方式greedy和e-greedy。

  1. def egreedy_action(self,state):
  2. Q_value = self.Q_value.eval(feed_dict = {
  3. self.state_input:[state]
  4. })[0]
  5. if random.random() <= self.epsilon:
  6. self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000
  7. return random.randint(0,self.action_dim - 1)
  8. else:
  9. self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000
  10. return np.argmax(Q_value)
  11.  
  12. def action(self,state):
  13. return np.argmax(self.Q_value.eval(feed_dict = {
  14. self.state_input:[state]
  15. })[0])

Agent完整代码:

DQN

  1. import tensorflow as tf
  2. import numpy as np
  3. import gym
  4. import time
  5. import random
  6. from collections import deque
  7.  
  8. ##################### hyper parameters ####################
  9.  
  10. # Hyper Parameters for DQN
  11. GAMMA = 0.9 # discount factor for target Q
  12. INITIAL_EPSILON = 0.5 # starting value of epsilon
  13. FINAL_EPSILON = 0.01 # final value of epsilon
  14. REPLAY_SIZE = 10000 # experience replay buffer size
  15. BATCH_SIZE = 32 # size of minibatch
  16.  
  17. ############################### DQN ####################################
  18.  
  19. class DQN():
  20. # DQN Agent
  21. def __init__(self, env):
  22. # init experience replay
  23. self.replay_buffer = deque()
  24. # init some parameters
  25. self.time_step = 0
  26. self.epsilon = INITIAL_EPSILON
  27. self.state_dim = env.observation_space.shape[0]
  28. self.action_dim = env.action_space.n
  29.  
  30. self.create_Q_network()
  31. self.create_training_method()
  32.  
  33. # Init session
  34. self.session = tf.InteractiveSession()
  35. self.session.run(tf.global_variables_initializer())
  36.  
  37. def create_Q_network(self):
  38. # network weights
  39. W1 = self.weight_variable([self.state_dim,20])
  40. b1 = self.bias_variable([20])
  41. W2 = self.weight_variable([20,self.action_dim])
  42. b2 = self.bias_variable([self.action_dim])
  43. # input layer
  44. self.state_input = tf.placeholder("float",[None,self.state_dim])
  45. # hidden layers
  46. h_layer = tf.nn.relu(tf.matmul(self.state_input,W1) + b1)
  47. # Q Value layer
  48. self.Q_value = tf.matmul(h_layer,W2) + b2
  49.  
  50. def create_training_method(self):
  51. self.action_input = tf.placeholder("float",[None,self.action_dim]) # one hot presentation
  52. self.y_input = tf.placeholder("float",[None])
  53. Q_action = tf.reduce_sum(tf.multiply(self.Q_value,self.action_input),reduction_indices = 1)
  54. self.cost = tf.reduce_mean(tf.square(self.y_input - Q_action))
  55. self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.cost)
  56.  
  57. def perceive(self,state,action,reward,next_state,done):
  58. one_hot_action = np.zeros(self.action_dim)
  59. one_hot_action[action] = 1
  60. self.replay_buffer.append((state,one_hot_action,reward,next_state,done))
  61. if len(self.replay_buffer) > REPLAY_SIZE:
  62. self.replay_buffer.popleft()
  63.  
  64. if len(self.replay_buffer) > BATCH_SIZE:
  65. self.train_Q_network()
  66.  
  67. def train_Q_network(self):
  68. self.time_step += 1
  69. # Step 1: obtain random minibatch from replay memory
  70. minibatch = random.sample(self.replay_buffer,BATCH_SIZE)
  71. state_batch = [data[0] for data in minibatch]
  72. action_batch = [data[1] for data in minibatch]
  73. reward_batch = [data[2] for data in minibatch]
  74. next_state_batch = [data[3] for data in minibatch]
  75.  
  76. # Step 2: calculate y
  77. y_batch = []
  78. Q_value_batch = self.Q_value.eval(feed_dict={self.state_input:next_state_batch})
  79. for i in range(0,BATCH_SIZE):
  80. done = minibatch[i][4]
  81. if done:
  82. y_batch.append(reward_batch[i])
  83. else :
  84. y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i]))
  85.  
  86. self.optimizer.run(feed_dict={
  87. self.y_input:y_batch,
  88. self.action_input:action_batch,
  89. self.state_input:state_batch
  90. })
  91.  
  92. def egreedy_action(self,state):
  93. Q_value = self.Q_value.eval(feed_dict = {
  94. self.state_input:[state]
  95. })[0]
  96. if random.random() <= self.epsilon:
  97. self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000
  98. return random.randint(0,self.action_dim - 1)
  99. else:
  100. self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000
  101. return np.argmax(Q_value)
  102.  
  103. def action(self,state):
  104. return np.argmax(self.Q_value.eval(feed_dict = {
  105. self.state_input:[state]
  106. })[0])
  107.  
  108. def weight_variable(self,shape):
  109. initial = tf.truncated_normal(shape)
  110. return tf.Variable(initial)
  111.  
  112. def bias_variable(self,shape):
  113. initial = tf.constant(0.01, shape = shape)
  114. return tf.Variable(initial)

训练agent:

train.py

  1. from DQN import DQN
  2. import gym
  3. import numpy as np
  4. import time
  5.  
  6. ENV_NAME = 'CartPole-v1'
  7. EPISODE = 3000 # Episode limitation
  8. STEP = 300 # Step limitation in an episode
  9. TEST = 10 # The number of experiment test every 100 episode
  10.  
  11. def main():
  12. # initialize OpenAI Gym env and dqn agent
  13. env = gym.make(ENV_NAME)
  14. agent = DQN(env)
  15.  
  16. for episode in range(EPISODE):
  17. # initialize task
  18. state = env.reset()
  19. # Train
  20. ep_reward = 0
  21. for step in range(STEP):
  22. action = agent.egreedy_action(state) # e-greedy action for train
  23. next_state,reward,done,_ = env.step(action)
  24. # Define reward for agent
  25. reward = -10 if done else 1
  26. ep_reward += reward
  27. agent.perceive(state,action,reward,next_state,done)
  28. state = next_state
  29. if done:
  30. #print('episode complete, reward: ', ep_reward)
  31. break
  32. # Test every 100 episodes
  33. if episode % 100 == 0:
  34. total_reward = 0
  35. for i in range(TEST):
  36. state = env.reset()
  37. for j in range(STEP):
  38. #env.render()
  39. action = agent.action(state) # direct action for test
  40. state,reward,done,_ = env.step(action)
  41. total_reward += reward
  42. if done:
  43. break
  44. ave_reward = total_reward/TEST
  45. print ('episode: ',episode,'Evaluation Average Reward:',ave_reward)
  46.  
  47. if __name__ == '__main__':
  48. main()

reference:

https://www.cnblogs.com/pinard/p/9714655.html

https://github.com/ljpzzz/machinelearning/blob/master/reinforcement-learning/dqn.py

强化学习_Deep Q Learning(DQN)_代码解析的更多相关文章

  1. 强化学习9-Deep Q Learning

    之前讲到Sarsa和Q Learning都不太适合解决大规模问题,为什么呢? 因为传统的强化学习都有一张Q表,这张Q表记录了每个状态下,每个动作的q值,但是现实问题往往极其复杂,其状态非常多,甚至是连 ...

  2. 强化学习(十二) Dueling DQN

    在强化学习(十一) Prioritized Replay DQN中,我们讨论了对DQN的经验回放池按权重采样来优化DQN算法的方法,本文讨论另一种优化方法,Dueling DQN.本章内容主要参考了I ...

  3. 【转载】 强化学习(十一) Prioritized Replay DQN

    原文地址: https://www.cnblogs.com/pinard/p/9797695.html ------------------------------------------------ ...

  4. 强化学习(十一) Prioritized Replay DQN

    在强化学习(十)Double DQN (DDQN)中,我们讲到了DDQN使用两个Q网络,用当前Q网络计算最大Q值对应的动作,用目标Q网络计算这个最大动作对应的目标Q值,进而消除贪婪法带来的偏差.今天我 ...

  5. 强化学习(四)—— DQN系列(DQN, Nature DQN, DDQN, Dueling DQN等)

    1 概述 在之前介绍的几种方法,我们对值函数一直有一个很大的限制,那就是它们需要用表格的形式表示.虽说表格形式对于求解有很大的帮助,但它也有自己的缺点.如果问题的状态和行动的空间非常大,使用表格表示难 ...

  6. 强化学习10-Deep Q Learning-fix target

    针对 Deep Q Learning 可能无法收敛的问题,这里提出了一种  fix target 的方法,就是冻结现实神经网络,延时更新参数. 这个方法的初衷是这样的: 1. 之前我们每个(批)记忆都 ...

  7. 强化学习(3)-----DQN

    看这篇https://blog.csdn.net/qq_16234613/article/details/80268564 1.DQN 原因:在普通的Q-learning中,当状态和动作空间是离散且维 ...

  8. 强化学习 - Q-learning Sarsa 和 DQN 的理解

    本文用于基本入门理解. 强化学习的基本理论 : R, S, A 这些就不说了. 先设想两个场景:  一. 1个 5x5 的 格子图, 里面有一个目标点,  2个死亡点二. 一个迷宫,   一个出发点, ...

  9. 转:强化学习(Reinforcement Learning)

    机器学习算法大致可以分为三种: 1. 监督学习(如回归,分类) 2. 非监督学习(如聚类,降维) 3. 增强学习 什么是增强学习呢? 增强学习(reinforcementlearning, RL)又叫 ...

随机推荐

  1. PHP中的常用正则表达式集锦

    PHP中的常用正则表达式集锦: 匹配中文字符的正则表达式: [\u4e00-\u9fa5] 评注:匹配中文还真是个头疼的事,有了这个表达式就好办了 匹配双字节字符(包括汉字在内):[^\x00-\xf ...

  2. CTP 下单返回错误: 没有报单权限 和字段错误需要注意的问题

    没有报单权限一般被认为期货公司没有开权限, 但是更多的问题是没有填写 BrokerId, InvestorId 下单字段错误注意一个容易忽略的地方: a. order 应该全部设为0, b. orde ...

  3. DataGridView DataSource 如何实现排序

    将数据绑定在下面的类中就可以实现排序 public class SortableBindingList<T> : BindingList<T> { private ArrayL ...

  4. IT兄弟连 JavaWeb教程 Servlet会话跟踪 Cookie技术原理

    Cookie使用HTTPHeader传递数据.Cookie机制定义了两种报头,Set-Cookie报头和Cookie报头.Set-Cookie报头包含于Web服务器的响应头(ResponseHeade ...

  5. chmod 详解

    http://man.linuxde.net/chmod chmod u+x,g+w f01 //为文件f01设置自己可以执行,组员可以写入的权限 chmod u=rwx,g=rw,o=r f01 c ...

  6. day04 Calendar类

  7. 缺少mscvr100.dll

    最后使用百度电脑专家修复好的!

  8. [BZOJ5219]最长路径

    Description 在Byteland一共有n个城市,编号依次为1到n,它们之间计划修建n(n-1)/2条单向道路,对于任意两个不同的点i和 j,在它们之间有且仅有一条单向道路,方向要么是i到j, ...

  9. Codeforces 997D(STL+排序)

    D. Divide by three, multiply by two time limit per test 1 second memory limit per test 256 megabytes ...

  10. 洛谷1072(gcd的运用)

    已知正整数a0,a1,b0,b1,设某未知正整数x满足: 1. x 和 a0 的最大公约数是 a1​: 2. x 和 b0​ 的最小公倍数是b1. Hankson 的“逆问题”就是求出满足条件的正整数 ...