完整代码:https://github.com/zle1992/Reinforcement_Learning_Game

论文《Continuous control with deep reinforcement learning》https://arxiv.org/pdf/1509.02971.pdf

Deep_Deterministic_Policy_Gradient

DDPG与AC的区别:

AC:

  Actor: 利用td_error更新参数,td_error 来自Critic

  Critic:根据value(s)函数的贝尔曼方程更新梯度

DDPG:

  Actor:   maximize the q,输出action

  Critic:根据Q(s,a)函数的贝尔曼方程更新梯度, 输出q值

DDPG 只能预测连续的动作输出。

逻辑梳理:

1、DDPG是AC 模型,输入包括(S,R,S_,A)

2、Actor

intput:(S)

output: a

loss :max(q)

q 来自Critic

3、Critic

input : S 、A

output: q

loss: R+ GAMMA * q_  - q

问题来了,q_ how to get? ---->Critic网络可以输入(S_,a_)得到q_ 但是,不能用同一个网络啊,所以,利用错位时间,我们使用Critic2(不可训练的)

Critic2需要a_ how to get?/----->Action网络可以输出(S_)得到a_,同理,我们使用Actor2(不可训练的)得到a_

流程

a = actor(s ,trian)

a_ = actor(s_,not_train)

q  = critic(s,a  trian)

q_critic(s_,a_,not_train)

a_loss = max(q)

c_loss = R+ GAMMA * q_  - q

代码:

DDPY.py

 import os
import numpy as np
import tensorflow as tf
from abc import ABCMeta, abstractmethod
np.random.seed(1)
tf.set_random_seed(1) import logging # 引入logging模块
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s') # logging.basicConfig函数对日志的输出格式及方式做相关配置
# 由于日志基本配置中级别设置为DEBUG,所以一下打印信息将会全部显示在控制台上 tfconfig = tf.ConfigProto()
tfconfig.gpu_options.allow_growth = True
session = tf.Session(config=tfconfig) class DDPG(object):
__metaclass__ = ABCMeta
"""docstring for ACNetwork"""
def __init__(self,
n_actions,
n_features,
reward_decay,
lr_a,
lr_c,
memory_size,
output_graph,
log_dir,
model_dir,
TAU,
a_bound,
):
super(DDPG, self).__init__() self.n_actions = n_actions
self.n_features = n_features
self.gamma=reward_decay
self.memory_size =memory_size
self.output_graph=output_graph
self.lr_a =lr_a
self.lr_c = lr_c
self.log_dir = log_dir self.model_dir = model_dir
# total learning step
self.learn_step_counter = 0
self.TAU = TAU # soft replacement
self.a_bound = a_bound self.s = tf.placeholder(tf.float32,[None]+self.n_features,name='s')
self.s_next = tf.placeholder(tf.float32,[None]+self.n_features,name='s_next')
self.r = tf.placeholder(tf.float32,[None,],name='r') #self.a = tf.placeholder(tf.int32,[None,1],name='a') with tf.variable_scope('Actor'):
self.a = self._build_a_net(self.s, scope='eval', trainable=True)
a_ = self._build_a_net(self.s_next, scope='target', trainable=False) with tf.variable_scope('Critic'): q = self._build_c_net(self.s, self.a,scope='eval', trainable=True)
q_ = self._build_c_net(self.s_next, a_,scope='target', trainable=False) # networks parameters
self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval')
self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')
self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval')
self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target') with tf.variable_scope('train_op_actor'):
self.loss_actor = -tf.reduce_mean(q)
self.train_op_actor = tf.train.AdamOptimizer(self.lr_a).minimize(self.loss_actor,var_list=self.ae_params) with tf.variable_scope('train_op_critic'): q_target = self.r + self.gamma * q_
self.loss_critic =tf.losses.mean_squared_error(labels=q_target, predictions=q)
self.train_op_critic = tf.train.AdamOptimizer(self.lr_c).minimize(self.loss_critic,var_list=self.ce_params) # target net replacement
self.soft_replace = [tf.assign(t, (1 - self.TAU) * t + self.TAU * e)
for t, e in zip(self.at_params + self.ct_params, self.ae_params + self.ce_params)] self.sess = tf.Session()
if self.output_graph:
tf.summary.FileWriter(self.log_dir,self.sess.graph) self.sess.run(tf.global_variables_initializer()) self.cost_his =[0]
self.cost =0 self.saver = tf.train.Saver() if not os.path.exists(self.model_dir):
os.mkdir(self.model_dir) checkpoint = tf.train.get_checkpoint_state(self.model_dir)
if checkpoint and checkpoint.model_checkpoint_path:
self.saver.restore(self.sess, checkpoint.model_checkpoint_path)
print ("Loading Successfully")
self.learn_step_counter = int(checkpoint.model_checkpoint_path.split('-')[-1]) + 1 @abstractmethod
def _build_a_net(self,x,scope,trainable): raise NotImplementedError
def _build_c_net(self,x,scope,trainable): raise NotImplementedError
def learn(self,data): # soft target replacement
self.sess.run(self.soft_replace) batch_memory_s = data['s']
batch_memory_a = data['a']
batch_memory_r = data['r']
batch_memory_s_ = data['s_'] _, cost = self.sess.run(
[self.train_op_actor, self.loss_actor],
feed_dict={
self.s: batch_memory_s,
}) _, cost = self.sess.run(
[self.train_op_critic, self.loss_critic],
feed_dict={
self.s: batch_memory_s,
self.a: batch_memory_a,
self.r: batch_memory_r,
self.s_next: batch_memory_s_, }) self.cost_his.append(cost)
self.cost =cost
self.learn_step_counter += 1
# save network every 100000 iteration
if self.learn_step_counter % 10000 == 0:
self.saver.save(self.sess,self.model_dir,global_step=self.learn_step_counter) def choose_action(self,s): return self.sess.run(self.a, {self.s: s[np.newaxis,:]})[0]
# s = s[np.newaxis,:] # probs = self.sess.run(self.acts_prob,feed_dict={self.s:s})
# return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())

game.py

 import sys
import gym
import numpy as np
import tensorflow as tf
sys.path.append('./')
sys.path.append('model') from util import Memory ,StateProcessor
from DDPG import DDPG
from ACNetwork import ACNetwork
np.random.seed(1)
tf.set_random_seed(1) import logging # 引入logging模块
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s') # logging.basicConfig函数对日志的输出格式及方式做相关配置
# 由于日志基本配置中级别设置为DEBUG,所以一下打印信息将会全部显示在控制台上
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
tfconfig = tf.ConfigProto()
tfconfig.gpu_options.allow_growth = True
session = tf.Session(config=tfconfig) class DDPG4Pendulum(DDPG):
"""docstring for ClassName"""
def __init__(self, **kwargs):
super(DDPG4Pendulum, self).__init__(**kwargs) def _build_a_net(self,s,scope,trainable):
w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)
#w_initializer, b_initializer = None,None
with tf.variable_scope(scope):
e1 = tf.layers.dense(inputs=s,
units=30,
bias_initializer = b_initializer,
kernel_initializer=w_initializer,
activation = tf.nn.relu,
trainable=trainable)
a = tf.layers.dense(inputs=e1,
units=self.n_actions,
bias_initializer = b_initializer,
kernel_initializer=w_initializer,
activation = tf.nn.tanh,
trainable=trainable) return tf.multiply(a, self.a_bound, name='scaled_a') def _build_c_net(self,s,a,scope,trainable):
w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1) with tf.variable_scope(scope):
n_l1 = 30
w1_s = tf.get_variable('w1_s',self.n_features+[n_l1],trainable=trainable)
w1_a = tf.get_variable('w1_a',[self.n_actions,n_l1],trainable=trainable)
b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable)
net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1) q = tf.layers.dense(inputs=net,
units=1,
bias_initializer = b_initializer,
kernel_initializer=w_initializer,
activation =None,
trainable=trainable) return q batch_size = 32 memory_size =10000
env = gym.make('Pendulum-v0') #连续 n_features= [env.observation_space.shape[0]]
n_actions= env.action_space.shape[0]
a_bound = env.action_space.high
env = env.unwrapped
MAX_EP_STEPS =200
def run(): RL = DDPG4Pendulum(
n_actions=n_actions,
n_features=n_features,
reward_decay=0.9,
lr_a = 0.001,
lr_c = 0.002,
memory_size=memory_size,
TAU = 0.01,
output_graph=False,
log_dir = 'Pendulum/log/DDPG4Pendulum/',
a_bound =a_bound,
model_dir = 'Pendulum/model_dir/DDPG4Pendulum/'
) memory = Memory(n_actions,n_features,memory_size=memory_size) var = 3 # control exploration
step = 0 for episode in range(2000):
# initial observation
observation = env.reset()
ep_r = 0 for j in range(MAX_EP_STEPS): # RL choose action based on observation
action = RL.choose_action(observation)
action = np.clip(np.random.normal(action, var), -2, 2) # add randomness to action selection for exploration
# RL take action and get_collectiot next observation and reward
observation_, reward, done, info=env.step(action) # take a random action #print('step:%d---episode:%d----reward:%f---action:%f'%(step,episode,reward,action))
memory.store_transition(observation, action, reward/10, observation_) if step > memory_size:
#env.render()
var *= .9995 # decay the action randomness
data = memory.sample(batch_size)
RL.learn(data) # swap observation
observation = observation_
ep_r += reward
# break while loop when end of this episode
if(episode>200):
env.render() # render on the screen
if j == MAX_EP_STEPS-1:
print('step: ',step,
'episode: ', episode,
'ep_r: ', round(ep_r, 2),
'var:',var,
#loss: ',RL.cost
)
break
step += 1 # end of game
print('game over')
env.destroy() def main(): run() if __name__ == '__main__':
main()
#run2()

强化学习--DDPG---tensorflow实现的更多相关文章

  1. 强化学习之三:双臂赌博机(Two-armed Bandit)

    本文是对Arthur Juliani在Medium平台发布的强化学习系列教程的个人中文翻译,该翻译是基于个人分享知识的目的进行的,欢迎交流!(This article is my personal t ...

  2. 深度强化学习:Policy-Based methods、Actor-Critic以及DDPG

    Policy-Based methods 在上篇文章中介绍的Deep Q-Learning算法属于基于价值(Value-Based)的方法,即估计最优的action-value function $q ...

  3. 强化学习(十六) 深度确定性策略梯度(DDPG)

    在强化学习(十五) A3C中,我们讨论了使用多线程的方法来解决Actor-Critic难收敛的问题,今天我们不使用多线程,而是使用和DDQN类似的方法:即经验回放和双网络的方法来改进Actor-Cri ...

  4. 学习笔记TF053:循环神经网络,TensorFlow Model Zoo,强化学习,深度森林,深度学习艺术

    循环神经网络.https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/re ...

  5. 深度强化学习——连续动作控制DDPG、NAF

    一.存在的问题 DQN是一个面向离散控制的算法,即输出的动作是离散的.对应到Atari 游戏中,只需要几个离散的键盘或手柄按键进行控制. 然而在实际中,控制问题则是连续的,高维的,比如一个具有6个关节 ...

  6. Ubuntu下常用强化学习实验环境搭建(MuJoCo, OpenAI Gym, rllab, DeepMind Lab, TORCS, PySC2)

    http://lib.csdn.net/article/aimachinelearning/68113 原文地址:http://blog.csdn.net/jinzhuojun/article/det ...

  7. 阅读AuTO利用深度强化学习自动优化数据中心流量工程(一)

    目录 问题 解决方法 模型选择 框架构建 Sigcomm'18 AuTO: Scaling Deep Reinforcement Learning for Datacenter-Scale Autom ...

  8. Deep Learning专栏--强化学习之从 Policy Gradient 到 A3C(3)

    在之前的强化学习文章里,我们讲到了经典的MDP模型来描述强化学习,其解法包括value iteration和policy iteration,这类经典解法基于已知的转移概率矩阵P,而在实际应用中,我们 ...

  9. Flink + 强化学习 搭建实时推荐系统

    如今的推荐系统,对于实时性的要求越来越高,实时推荐的流程大致可以概括为这样: 推荐系统对于用户的请求产生推荐,用户对推荐结果作出反馈 (购买/点击/离开等等),推荐系统再根据用户反馈作出新的推荐.这个 ...

  10. 强化学习(十五) A3C

    在强化学习(十四) Actor-Critic中,我们讨论了Actor-Critic的算法流程,但是由于普通的Actor-Critic算法难以收敛,需要一些其他的优化.而Asynchronous Adv ...

随机推荐

  1. [DPI][TCP] linux API的接口如何控制urgent包的收发

    做DPI,写协议栈的时候,处理到了urgent数据包.突然好奇应用层是如何控制发出urgent包的呢?而接收端又是如何知道,接受到了urgent包的呢? man 7 tcp,中有如下一段: TCP s ...

  2. DBGridEh表尾显示合计 .....

    设置如下就可以了..... FooterRowCount  : 1 SumList--------Active:=true 双击 DBGridEh  加入所需要的列....然后在 需要合计的..... ...

  3. LeetCode 965 Univalued Binary Tree 解题报告

    题目要求 A binary tree is univalued if every node in the tree has the same value. Return true if and onl ...

  4. CloudStack和OpenStack该如何选择(如果准备选择OpenStack,请做好hack的准备。CloudStack的底层功能已经做的很完善了,更适合商用)

    国内做云计算的目前基本会在OpenStack和CloudStack中做一个选择.CloudStack 和OpenStack选哪一个,要根据自己的业务模式和研发力量来定. 作者:来源:cloudstac ...

  5. js 正则判断字符串下划线的长度

    <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

  6. DevOps理论与实践总结

    DevOps指导理论与实践 [第01篇]:郭宏泽:全开源架构下的DevOps实践(转) SonarQube应用指南 [第一篇]:SonarQube Scanner报svn: E170001错误 che ...

  7. jquery有几种选择器?

    ①.基本选择器:#id,class,element,*: ②.层次选择器:parent > child,prev + next,prev ~ siblings: ③.基本过滤器选择器::firs ...

  8. MapStruct

    一.Object mapping 的技术分类: 运行期 反射调用set/get 或者是直接对成员变量赋值 . 该方式通过invoke执行赋值,实现时一般会采用beanutil, Javassist等开 ...

  9. SC-FDM和OFDM的区别

    3GPP定义的LTE空中接口,在下行采用正交频分多址(OFDMA)技术,在上行采用的就是这个单载频频分多址(SC-FDMA)技术. SC-FDMA(Single-carrier Frequency-D ...

  10. laravel项目thinksns+安装pc前端页面

    前面我们说了laravel项目ThinkSNS+安装,现在要安装给用户看的页面,方便他们进行一些操作,比如发表文字/提问之类,这个已经模块化了,用composer就可以完成.命令行代码如下(之前是co ...