强化学习--DDPG---tensorflow实现

完整代码：https://github.com/zle1992/Reinforcement_Learning_Game

论文《Continuous control with deep reinforcement learning》https://arxiv.org/pdf/1509.02971.pdf

Deep_Deterministic_Policy_Gradient

DDPG与AC的区别：

AC:

　　Actor: 利用td_error更新参数，td_error 来自Critic

　　Critic:根据value(s)函数的贝尔曼方程更新梯度

DDPG:

　　Actor: maximize the q，输出action

　　Critic：根据Q(s,a)函数的贝尔曼方程更新梯度, 输出q值

DDPG 只能预测连续的动作输出。

逻辑梳理：

1、DDPG是AC 模型，输入包括（S,R,S_,A）

2、Actor

intput:(S)

output: a

loss :max(q)

q 来自Critic

3、Critic

input : S 、A

output: q

loss: R+ GAMMA * q_ - q

问题来了，q_ how to get? ---->Critic网络可以输入（S_,a_）得到q_ 但是，不能用同一个网络啊，所以，利用错位时间，我们使用Critic2（不可训练的）

Critic2需要a_ how to get?/----->Action网络可以输出（S_）得到a_，同理，我们使用Actor2(不可训练的)得到a_

流程

a = actor(s ,trian)

a_ = actor(s_,not_train)

q = critic(s,a trian)

q_critic(s_,a_,not_train)

a_loss = max(q)

c_loss = R+ GAMMA * q_ - q

代码：

DDPY.py

 import os

 import numpy as np

 import tensorflow as tf

 from abc import ABCMeta, abstractmethod

 np.random.seed(1)

 tf.set_random_seed(1)

 import logging  # 引入logging模块

 logging.basicConfig(level=logging.DEBUG,

                     format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s')  # logging.basicConfig函数对日志的输出格式及方式做相关配置

 # 由于日志基本配置中级别设置为DEBUG，所以一下打印信息将会全部显示在控制台上

 tfconfig = tf.ConfigProto()

 tfconfig.gpu_options.allow_growth = True

 session = tf.Session(config=tfconfig)

 class DDPG(object):

     __metaclass__ = ABCMeta

     """docstring for ACNetwork"""

     def __init__(self,

             n_actions,

             n_features,

             reward_decay,

             lr_a,

             lr_c,

             memory_size,

             output_graph,

             log_dir,

             model_dir,

             TAU,

             a_bound,

             ):

         super(DDPG, self).__init__()

         self.n_actions = n_actions

         self.n_features = n_features

         self.gamma=reward_decay

         self.memory_size =memory_size

         self.output_graph=output_graph

         self.lr_a =lr_a

         self.lr_c = lr_c

         self.log_dir = log_dir

         self.model_dir = model_dir

         # total learning step

         self.learn_step_counter = 0

         self.TAU = TAU     # soft replacement

         self.a_bound = a_bound

         self.s = tf.placeholder(tf.float32,[None]+self.n_features,name='s')

         self.s_next = tf.placeholder(tf.float32,[None]+self.n_features,name='s_next')

         self.r = tf.placeholder(tf.float32,[None,],name='r')

         #self.a = tf.placeholder(tf.int32,[None,1],name='a')

         with tf.variable_scope('Actor'):

             self.a = self._build_a_net(self.s, scope='eval', trainable=True)

             a_ = self._build_a_net(self.s_next, scope='target', trainable=False)

         with tf.variable_scope('Critic'):

             q  = self._build_c_net(self.s, self.a,scope='eval', trainable=True)

             q_  = self._build_c_net(self.s_next, a_,scope='target', trainable=False)

         # networks parameters

         self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval')

         self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')

         self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval')

         self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')

         with tf.variable_scope('train_op_actor'):

             self.loss_actor = -tf.reduce_mean(q)

             self.train_op_actor = tf.train.AdamOptimizer(self.lr_a).minimize(self.loss_actor,var_list=self.ae_params)  

         with tf.variable_scope('train_op_critic'):

             q_target = self.r + self.gamma * q_

             self.loss_critic =tf.losses.mean_squared_error(labels=q_target, predictions=q)

             self.train_op_critic = tf.train.AdamOptimizer(self.lr_c).minimize(self.loss_critic,var_list=self.ce_params)

             # target net replacement

         self.soft_replace = [tf.assign(t, (1 - self.TAU) * t + self.TAU * e)

                                for t, e in zip(self.at_params + self.ct_params, self.ae_params + self.ce_params)]

         self.sess = tf.Session()

         if self.output_graph:

             tf.summary.FileWriter(self.log_dir,self.sess.graph)

         self.sess.run(tf.global_variables_initializer())

         self.cost_his =[0]

         self.cost =0 

         self.saver = tf.train.Saver()

         if not os.path.exists(self.model_dir):

             os.mkdir(self.model_dir)

         checkpoint = tf.train.get_checkpoint_state(self.model_dir)

         if checkpoint and checkpoint.model_checkpoint_path:

             self.saver.restore(self.sess, checkpoint.model_checkpoint_path)

             print ("Loading Successfully")

             self.learn_step_counter = int(checkpoint.model_checkpoint_path.split('-')[-1]) + 1

     @abstractmethod

     def _build_a_net(self,x,scope,trainable):

         raise NotImplementedError

     def _build_c_net(self,x,scope,trainable):

         raise NotImplementedError

     def learn(self,data):

         # soft target replacement

         self.sess.run(self.soft_replace)

         batch_memory_s = data['s']

         batch_memory_a =  data['a']

         batch_memory_r = data['r']

         batch_memory_s_ = data['s_']

         _, cost = self.sess.run(

             [self.train_op_actor, self.loss_actor],

             feed_dict={

                 self.s: batch_memory_s,

             })

         _, cost = self.sess.run(

             [self.train_op_critic, self.loss_critic],

             feed_dict={

                 self.s: batch_memory_s,

                 self.a: batch_memory_a,

                 self.r: batch_memory_r,

                 self.s_next: batch_memory_s_,

             })

         self.cost_his.append(cost)

         self.cost =cost

         self.learn_step_counter += 1

             # save network every 100000 iteration

         if self.learn_step_counter % 10000 == 0:

             self.saver.save(self.sess,self.model_dir,global_step=self.learn_step_counter)

     def choose_action(self,s): 

         return self.sess.run(self.a, {self.s: s[np.newaxis,:]})[0]

         # s = s[np.newaxis,:]

         # probs = self.sess.run(self.acts_prob,feed_dict={self.s:s})

         # return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())

game.py

 import sys

 import gym

 import numpy as np

 import tensorflow as tf

 sys.path.append('./')

 sys.path.append('model')

 from util import Memory ,StateProcessor

 from DDPG import DDPG

 from ACNetwork import ACNetwork

 np.random.seed(1)

 tf.set_random_seed(1)

 import logging  # 引入logging模块

 logging.basicConfig(level=logging.DEBUG,

                     format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s')  # logging.basicConfig函数对日志的输出格式及方式做相关配置

 # 由于日志基本配置中级别设置为DEBUG，所以一下打印信息将会全部显示在控制台上

 import os

 os.environ["CUDA_VISIBLE_DEVICES"] = ""

 tfconfig = tf.ConfigProto()

 tfconfig.gpu_options.allow_growth = True

 session = tf.Session(config=tfconfig)

 class DDPG4Pendulum(DDPG):

     """docstring for ClassName"""

     def __init__(self, **kwargs):

         super(DDPG4Pendulum, self).__init__(**kwargs)

     def _build_a_net(self,s,scope,trainable):

         w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)

         #w_initializer, b_initializer = None,None

         with tf.variable_scope(scope):

             e1 = tf.layers.dense(inputs=s,

                     units=30,

                     bias_initializer = b_initializer,

                     kernel_initializer=w_initializer,

                     activation = tf.nn.relu,

                     trainable=trainable)

             a = tf.layers.dense(inputs=e1,

                     units=self.n_actions,

                     bias_initializer = b_initializer,

                     kernel_initializer=w_initializer,

                     activation = tf.nn.tanh,

                     trainable=trainable) 

         return tf.multiply(a, self.a_bound, name='scaled_a')  

     def _build_c_net(self,s,a,scope,trainable):

         w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)

         with tf.variable_scope(scope):

             n_l1 = 30

             w1_s = tf.get_variable('w1_s',self.n_features+[n_l1],trainable=trainable)

             w1_a = tf.get_variable('w1_a',[self.n_actions,n_l1],trainable=trainable)

             b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable)

             net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)

             q = tf.layers.dense(inputs=net,

                     units=1,

                     bias_initializer = b_initializer,

                     kernel_initializer=w_initializer,

                     activation =None,

                     trainable=trainable) 

         return q   

 batch_size = 32

 memory_size  =10000

 env = gym.make('Pendulum-v0') #连续

 n_features= [env.observation_space.shape[0]]

 n_actions= env.action_space.shape[0]

 a_bound = env.action_space.high

 env = env.unwrapped

 MAX_EP_STEPS =200

 def run():

     RL = DDPG4Pendulum(

         n_actions=n_actions,

         n_features=n_features,

         reward_decay=0.9,

         lr_a = 0.001,

         lr_c = 0.002,

         memory_size=memory_size,

         TAU = 0.01,

         output_graph=False,

         log_dir = 'Pendulum/log/DDPG4Pendulum/',

         a_bound =a_bound,

         model_dir = 'Pendulum/model_dir/DDPG4Pendulum/'

         )

     memory = Memory(n_actions,n_features,memory_size=memory_size)

     var = 3  # control exploration

     step = 0

     for episode in range(2000):

         # initial observation

         observation = env.reset()

         ep_r = 0

         for j in range(MAX_EP_STEPS):

             # RL choose action based on observation

             action = RL.choose_action(observation)

             action = np.clip(np.random.normal(action, var), -2, 2)    # add randomness to action selection for exploration

             # RL take action and get_collectiot next observation and reward

             observation_, reward, done, info=env.step(action) # take a random action

             #print('step:%d---episode:%d----reward:%f---action:%f'%(step,episode,reward,action))

             memory.store_transition(observation, action, reward/10, observation_)

             if step > memory_size:

                 #env.render()

                 var *= .9995    # decay the action randomness

                 data = memory.sample(batch_size)

                 RL.learn(data)

             # swap observation

             observation = observation_

             ep_r += reward

             # break while loop when end of this episode

             if(episode>200):

                 env.render()  # render on the screen

             if j == MAX_EP_STEPS-1:

                 print('step: ',step,

                     'episode: ', episode,

                       'ep_r: ', round(ep_r, 2),

                       'var:',var,

                       #loss: ',RL.cost

                       )

                 break

             step += 1

     # end of game

     print('game over')

     env.destroy()

 def main():

     run()

 if __name__ == '__main__':

     main()

     #run2()

强化学习--DDPG---tensorflow实现的更多相关文章

强化学习之三：双臂赌博机（Two-armed Bandit）
本文是对Arthur Juliani在Medium平台发布的强化学习系列教程的个人中文翻译,该翻译是基于个人分享知识的目的进行的,欢迎交流!(This article is my personal t ...
深度强化学习：Policy-Based methods、Actor-Critic以及DDPG
Policy-Based methods 在上篇文章中介绍的Deep Q-Learning算法属于基于价值(Value-Based)的方法,即估计最优的action-value function $q ...
强化学习(十六) 深度确定性策略梯度(DDPG)
在强化学习(十五) A3C中,我们讨论了使用多线程的方法来解决Actor-Critic难收敛的问题,今天我们不使用多线程,而是使用和DDQN类似的方法:即经验回放和双网络的方法来改进Actor-Cri ...
学习笔记TF053:循环神经网络，TensorFlow Model Zoo，强化学习，深度森林，深度学习艺术
循环神经网络.https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/re ...
深度强化学习——连续动作控制DDPG、NAF
一.存在的问题 DQN是一个面向离散控制的算法,即输出的动作是离散的.对应到Atari 游戏中,只需要几个离散的键盘或手柄按键进行控制. 然而在实际中,控制问题则是连续的,高维的,比如一个具有6个关节 ...
Ubuntu下常用强化学习实验环境搭建(MuJoCo, OpenAI Gym, rllab, DeepMind Lab, TORCS, PySC2)
http://lib.csdn.net/article/aimachinelearning/68113 原文地址:http://blog.csdn.net/jinzhuojun/article/det ...
阅读AuTO利用深度强化学习自动优化数据中心流量工程(一)
目录问题解决方法模型选择框架构建 Sigcomm'18 AuTO: Scaling Deep Reinforcement Learning for Datacenter-Scale Autom ...
Deep Learning专栏--强化学习之从 Policy Gradient 到 A3C（3）
在之前的强化学习文章里,我们讲到了经典的MDP模型来描述强化学习,其解法包括value iteration和policy iteration,这类经典解法基于已知的转移概率矩阵P,而在实际应用中,我们 ...
Flink + 强化学习搭建实时推荐系统
如今的推荐系统,对于实时性的要求越来越高,实时推荐的流程大致可以概括为这样: 推荐系统对于用户的请求产生推荐,用户对推荐结果作出反馈 (购买/点击/离开等等),推荐系统再根据用户反馈作出新的推荐.这个 ...
强化学习(十五) A3C
在强化学习(十四) Actor-Critic中,我们讨论了Actor-Critic的算法流程,但是由于普通的Actor-Critic算法难以收敛,需要一些其他的优化.而Asynchronous Adv ...

随机推荐

xcrun: error: unable to find utility "PackageApplication", not a developer tool or in PATH
Xcode升级到8.3后用命令进行打包提示下面这个错误 xcrun: error: unable to find utility "PackageApplication", n ...
Runloop, 多线程
Runloop是个死循环,为甚么? 1. 保证程序不退出 2.监听用户的事件, 触摸,时钟,网络事件 UITrackingMode,只能触摸事件,没有触摸事件了,直接就停止了 Runloop: sou ...
LeetCode 942 DI String Match 解题报告
题目要求 Given a string S that only contains "I" (increase) or "D" (decrease), let N ...
洛谷P4562 [JXOI2018]游戏数论
正解:数论解题报告: 传送门! 首先考虑怎么样的数可能出现在t(i)那个位置上?显然是[l,r]中所有无法被表示出来的数(就约数不在[l,r]内的数嘛QwQ 所以可以先把这些数筛出来具体怎么筛的话 ...
oracle按照指定列排序操作
按照...分组排序后,得到行编号: row_number() over(partition by ... order by ...) 按照...分组排序后,得到相应的列的第一个数据: first_va ...
SpringBoot-区分不同环境配置文件
spring.profiles.active=pre application-dev.properties:开发环境 application-test.properties:测试环境 applicat ...
Magento2与Magento1的区别有哪些
magento2是15年正式上线的正式版,框架和写法跟magento1有很大区别,用到了命名空间和composer,模块化设计更强.因为是刚出生不久所以bug比较多.目前全世界做magento2的公 ...
Python3学习之路~2.10 修改haproxy配置文件
需求: .查输入:www.oldboy.org 获取当前backend下的所有记录 .新建输入: arg = { 'bakend': 'www.oldboy.org', 'record':{ 's ...
Redis入门到高可用（十二）—— pipeline
一.回忆通信模型二.流水线 1.什么是流水线 2.pipeline-Jedis实现 3.与原生M(mget,mset等)操作对比 M操作是原子操作 pipeline命令是非原子的,Redis服务器会 ...
[django]python异步神器-celery
python异步神器celery https://segmentfault.com/a/1190000007780963

强化学习--DDPG---tensorflow实现

逻辑梳理：

流程

代码：

强化学习--DDPG---tensorflow实现的更多相关文章

随机推荐

热门专题