Deep Q-Network 学习笔记（三）—

由于 Q 值与 next Q 使用同一个网络时，是在一边更新一边学习，会不稳定。

所以，这个算法其实就是将神经网络拆分成 2 个，一个 Q 网络，用于同步更新 Q 值，另一个是 target 网络，用于计算目标 Q 值，并且每隔一段时间，自动将最新的 Q 网络的权值同步给 target 网络即可。

其实也就是在上一篇的基础上做以下修改即可：

1.增加一个 target 网络。

2.在记忆回放的时候，取 max Q 的值时将原本使用的 Q 网络修改成使用 target 网络。

3.在训练一定次数后，同步权值。

以下是源码：

import tensorflow as tf

import numpy as np

from collections import deque

import random

class DeepQNetwork:

    r = np.array([[-1, -1, -1, -1, 0, -1],

                  [-1, -1, -1, 0, -1, 100.0],

                  [-1, -1, -1, 0, -1, -1],

                  [-1, 0, 0, -1, 0, -1],

                  [0, -1, -1, 1, -1, 100],

                  [-1, 0, -1, -1, 0, 100],

                  ])

    # 执行步数。

    step_index = 0

    # 状态数。

    STATE_NUM = 6

    # 动作数。

    ACTION_NUM = 6

    # 训练之前观察多少步。

    OBSERVE = 1000.

    # 选取的小批量训练样本数。

    BATCH = 20

    # epsilon 的最小值，当 epsilon 小于该值时，将不在随机选择行为。

    FINAL_EPSILON = 0.0001

    # epsilon 的初始值，epsilon 逐渐减小。

    INITIAL_EPSILON = 0.1

    # epsilon 衰减的总步数。

    EXPLORE = 3000000.

    # 探索模式计数。

    epsilon = 0

    # 训练步数统计。

    learn_step_counter = 0

    # 学习率。

    learning_rate = 0.001

    # γ经验折损率。

    gamma = 0.9

    # 记忆上限。

    memory_size = 5000

    # 当前记忆数。

    memory_counter = 0

    # 保存观察到的执行过的行动的存储器，即：曾经经历过的记忆。

    replay_memory_store = deque()

    # 生成一个状态矩阵（6 X 6），每一行代表一个状态。

    state_list = None

    # 生成一个动作矩阵。

    action_list = None

    # q_eval 网络状态输入参数。

    q_eval_input = None

    # q_eval 网络动作输入参数。

    q_action_input = None

    # q_eval 网络中 q_target 的输入参数。

    q_eval_target = None

    # q_eval 网络输出结果。

    q_eval_output = None

    # q_eval 网络输出的结果中的最优得分。

    q_predict = None

    # q_eval 网络输出的结果中当前选择的动作得分。

    reward_action = None

    # q_eval 网络损失函数。

    loss = None

    # q_eval 网络训练。

    train_op = None

    # q_target 网络状态输入参数。

    q_target_input = None

    # q_target 网络输出结果。

    q_target_output = None

    # 更换 target_net 的步数。

    replace_target_stepper = None

    # loss 值的集合。

    cost_list = None

    # 输出图表显示 Q 值走向。

    q_list = None

    running_q = 0

    # tensorflow 会话。

    session = None

    def __init__(self, learning_rate=0.001, gamma=0.9, memory_size=5000, replace_target_stepper=300):

        self.learning_rate = learning_rate

        self.gamma = gamma

        self.memory_size = memory_size

        self.replace_target_stepper = replace_target_stepper

        # 初始化成一个 6 X 6 的状态矩阵。

        self.state_list = np.identity(self.STATE_NUM)

        # 初始化成一个 6 X 6 的动作矩阵。

        self.action_list = np.identity(self.ACTION_NUM)

        # 创建神经网络。

        self.create_network()

        # 初始化 tensorflow 会话。

        self.session = tf.InteractiveSession()

        # 初始化 tensorflow 参数。

        self.session.run(tf.initialize_all_variables())

        # 记录所有 loss 变化。

        self.cost_list = []

        # 记录 q 值的变化。

        self.q_list = []

    def create_network(self):

        """

        创建神经网络。

        :return:

        """

        neuro_layer_1 = 3

        w_init = tf.random_normal_initializer(0, 0.3)

        b_init = tf.constant_initializer(0.1)

        # -------------- 创建 eval 神经网络, 及时提升参数 -------------- #

        self.q_eval_input = tf.placeholder(shape=[None, self.STATE_NUM], dtype=tf.float32, name="q_eval_input")

        self.q_action_input = tf.placeholder(shape=[None, self.ACTION_NUM], dtype=tf.float32)

        self.q_eval_target = tf.placeholder(shape=[None], dtype=tf.float32, name="q_target")

        with tf.variable_scope("eval_net"):

            q_name = ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

            with tf.variable_scope('l1'):

                w1 = tf.get_variable('w1', [self.STATE_NUM, neuro_layer_1], initializer=w_init, collections=q_name)

                b1 = tf.get_variable('b1', [1, neuro_layer_1], initializer=b_init, collections=q_name)

                l1 = tf.nn.relu(tf.matmul(self.q_eval_input, w1) + b1)

            with tf.variable_scope('l2'):

                w2 = tf.get_variable('w2', [neuro_layer_1, self.ACTION_NUM], initializer=w_init, collections=q_name)

                b2 = tf.get_variable('b2', [1, self.ACTION_NUM], initializer=b_init, collections=q_name)

                self.q_eval_output = tf.matmul(l1, w2) + b2

                self.q_predict = tf.argmax(self.q_eval_output, 1)

        with tf.variable_scope('loss'):

            # 取出当前动作的得分。

            self.reward_action = tf.reduce_sum(tf.multiply(self.q_eval_output, self.q_action_input), reduction_indices=1)

            self.loss = tf.reduce_mean(tf.square((self.q_eval_target - self.reward_action)))

        with tf.variable_scope('train'):

            self.train_op = tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.loss)

        # -------------- 创建 target 神经网络, 及时提升参数 -------------- #

        self.q_target_input = tf.placeholder(shape=[None, self.STATE_NUM], dtype=tf.float32, name="q_target_input")

        with tf.variable_scope("target_net"):

            t_name = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

            with tf.variable_scope('l1'):

                w1 = tf.get_variable('w1', [self.STATE_NUM, neuro_layer_1], initializer=w_init, collections=t_name)

                b1 = tf.get_variable('b1', [1, neuro_layer_1], initializer=b_init, collections=t_name)

                l1 = tf.nn.relu(tf.matmul(self.q_target_input, w1) + b1)

            with tf.variable_scope('l2'):

                w2 = tf.get_variable('w2', [neuro_layer_1, self.ACTION_NUM], initializer=w_init, collections=t_name)

                b2 = tf.get_variable('b2', [1, self.ACTION_NUM], initializer=b_init, collections=t_name)

                self.q_target_output = tf.matmul(l1, w2) + b2

    def _replace_target_params(self):

        # 使用 Tensorflow 中的 assign 功能替换 target_net 所有参数

        t_params = tf.get_collection('target_net_params')                       # 提取 target_net 的参数

        e_params = tf.get_collection('eval_net_params')                         # 提取  eval_net 的参数

        self.session.run([tf.assign(t, e) for t, e in zip(t_params, e_params)])    # 更新 target_net 参数

    def select_action(self, state_index):

        """

        根据策略选择动作。

        :param state_index: 当前状态。

        :return:

        """

        current_state = self.state_list[state_index:state_index + 1]

        actions_value = self.session.run(self.q_eval_output, feed_dict={self.q_eval_input: current_state})

        action = np.argmax(actions_value)

        current_action_index = action

        # 输出图表。

        self.running_q = self.running_q * 0.99 + 0.01 * np.max(actions_value)

        self.q_list.append(self.running_q)

        if np.random.uniform() < self.epsilon:

            current_action_index = np.random.randint(0, self.ACTION_NUM)

        # 开始训练后，在 epsilon 小于一定的值之前，将逐步减小 epsilon。

        if self.step_index > self.OBSERVE and self.epsilon > self.FINAL_EPSILON:

            self.epsilon -= (self.INITIAL_EPSILON - self.FINAL_EPSILON) / self.EXPLORE

        return current_action_index

    def save_store(self, current_state_index, current_action_index, current_reward, next_state_index, done):

        """

        保存记忆。

        :param current_state_index: 当前状态 index。

        :param current_action_index: 动作 index。

        :param current_reward: 奖励。

        :param next_state_index: 下一个状态 index。

        :param done: 是否结束。

        :return:

        """

        current_state = self.state_list[current_state_index:current_state_index + 1]

        current_action = self.action_list[current_action_index:current_action_index + 1]

        next_state = self.state_list[next_state_index:next_state_index + 1]

        # 记忆动作(当前状态， 当前执行的动作， 当前动作的得分，下一个状态)。

        self.replay_memory_store.append((

            current_state,

            current_action,

            current_reward,

            next_state,

            done))

        # 如果超过记忆的容量，则将最久远的记忆移除。

        if len(self.replay_memory_store) > self.memory_size:

            self.replay_memory_store.popleft()

        self.memory_counter += 1

    def step(self, state, action):

        """

        执行动作。

        :param state: 当前状态。

        :param action: 执行的动作。

        :return:

        """

        reward = self.r[state][action]

        next_state = action

        done = False

        if action == 5:

            done = True

        return next_state, reward, done

    def experience_replay(self):

        """

        记忆回放。

        :return:

        """

        # 检查是否替换 target_net 参数

        if self.learn_step_counter % self.replace_target_stepper == 0:

            self._replace_target_params()

        # 随机选择一小批记忆样本。

        batch = self.BATCH if self.memory_counter > self.BATCH else self.memory_counter

        minibatch = random.sample(self.replay_memory_store, batch)

        batch_state = None

        batch_action = None

        batch_reward = None

        batch_next_state = None

        batch_done = None

        for index in range(len(minibatch)):

            if batch_state is None:

                batch_state = minibatch[index][0]

            elif batch_state is not None:

                batch_state = np.vstack((batch_state, minibatch[index][0]))

            if batch_action is None:

                batch_action = minibatch[index][1]

            elif batch_action is not None:

                batch_action = np.vstack((batch_action, minibatch[index][1]))

            if batch_reward is None:

                batch_reward = minibatch[index][2]

            elif batch_reward is not None:

                batch_reward = np.vstack((batch_reward, minibatch[index][2]))

            if batch_next_state is None:

                batch_next_state = minibatch[index][3]

            elif batch_next_state is not None:

                batch_next_state = np.vstack((batch_next_state, minibatch[index][3]))

            if batch_done is None:

                batch_done = minibatch[index][4]

            elif batch_done is not None:

                batch_done = np.vstack((batch_done, minibatch[index][4]))

        # -------------- 改进部分  -------------- #

        # 获得 q_next 使用另一个神经网络 target。

        # q_next：下一个状态的 Q 值。

        q_next = self.session.run([self.q_target_output], feed_dict={self.q_target_input: batch_next_state})

        # -------------- 改进部分  -------------- #

        q_target = []

        for i in range(len(minibatch)):

            # 当前即时得分。

            current_reward = batch_reward[i][0]

            # # 游戏是否结束。

            # current_done = batch_done[i][0]

            # 更新 Q 值。

            q_value = current_reward + self.gamma * np.max(q_next[0][i])

            # 当得分小于 -1 时，表示走了不可走的位置。

            if current_reward <= -1:

                q_target.append(current_reward)

            else:

                q_target.append(q_value)

        _, cost, reward = self.session.run([self.train_op, self.loss, self.reward_action],

                                           feed_dict={self.q_eval_input: batch_state,

                                                      self.q_action_input: batch_action,

                                                      self.q_eval_target: q_target})

        self.cost_list.append(cost)

        # if self.step_index % 1000 == 0:

        #     print("loss:", cost)

        self.learn_step_counter += 1

    def train(self):

        """

        训练。

        :return:

        """

        # 初始化当前状态。

        current_state = np.random.randint(0, self.ACTION_NUM - 1)

        self.epsilon = self.INITIAL_EPSILON

        while True:

            # 选择动作。

            action = self.select_action(current_state)

            # 执行动作，得到：下一个状态，执行动作的得分，是否结束。

            next_state, reward, done = self.step(current_state, action)

            # 保存记忆。

            self.save_store(current_state, action, reward, next_state, done)

            # 先观察一段时间累积足够的记忆在进行训练。

            if self.step_index > self.OBSERVE:

                self.experience_replay()

            if self.step_index - self.OBSERVE > 15000:

                break

            if done:

                current_state = np.random.randint(0, self.ACTION_NUM - 1)

            else:

                current_state = next_state

            self.step_index += 1

    def pay(self):

        """

        运行并测试。

        :return:

        """

        self.train()

        # 显示 R 矩阵。

        print(self.r)

        for index in range(5):

            start_room = index

            print("#############################", "Agent 在", start_room, "开始行动", "#############################")

            current_state = start_room

            step = 0

            target_state = 5

            while current_state != target_state:

                out_result = self.session.run(self.q_eval_output, feed_dict={

                    self.q_eval_input: self.state_list[current_state:current_state + 1]})

                next_state = np.argmax(out_result[0])

                print("Agent 由", current_state, "号房间移动到了", next_state, "号房间")

                current_state = next_state

                step += 1

            print("Agent 在", start_room, "号房间开始移动了", step, "步到达了目标房间 5")

            print("#############################", "Agent 在", 5, "结束行动", "#############################")

    def show_plt(self):

        import matplotlib.pyplot as plt

        plt.plot(np.array(self.q_list), c='r', label='natural')

        # plt.plot(np.array(q_double), c='b', label='double')

        plt.legend(loc='best')

        plt.ylabel('Q eval')

        plt.xlabel('training steps')

        plt.grid()

        plt.show()

if __name__ == "__main__":

    q_network = DeepQNetwork()

    q_network.pay()

    q_network.show_plt()

Deep Q-Network 学习笔记（三）—— 改进①：nature dqn的更多相关文章

强化学习系列之:Deep Q Network (DQN)
文章目录 [隐藏] 1. 强化学习和深度学习结合 2. Deep Q Network (DQN) 算法 3. 后续发展 3.1 Double DQN 3.2 Prioritized Replay 3. ...
深度增强学习--Deep Q Network
从这里开始换个游戏演示,cartpole游戏 Deep Q Network 实例代码 import sys import gym import pylab import random import n ...
AlphaGo的前世今生（一）Deep Q Network and Game Search Tree：Road to AI Revolution
这一个专题将会是有关AlphaGo的前世今生以及其带来的AI革命,总共分成三节.本人水平有限,如有错误还望指正.如需转载,须征得本人同意. Road to AI Revolution(通往AI革命之路 ...
Network In Network学习笔记
Network In Network学习笔记原文地址:http://blog.csdn.net/hjimce/article/details/50458190 作者:hjimce 一.相关理论本篇 ...
Deep Q Network(DQN)原理解析
1. 前言在前面的章节中我们介绍了时序差分算法(TD)和Q-Learning,当状态和动作空间是离散且维数不高时可使用Q-Table储存每个状态动作对的Q值,而当状态和动作空间是高维连续时,使用Q- ...
深度学习（二十六）Network In Network学习笔记
深度学习(二十六)Network In Network学习笔记 Network In Network学习笔记原文地址:http://blog.csdn.net/hjimce/article/deta ...
Oracle学习笔记三 SQL命令
SQL简介 SQL 支持下列类别的命令: 1.数据定义语言(DDL) 2.数据操纵语言(DML) 3.事务控制语言(TCL) 4.数据控制语言(DCL)
kvm虚拟化学习笔记(三)之windows kvm虚拟机安装
KVM虚拟化学习笔记系列文章列表----------------------------------------kvm虚拟化学习笔记(一)之kvm虚拟化环境安装http://koumm.blog.51 ...
[Firefly引擎][学习笔记三][已完结]所需模块封装
原地址:http://www.9miao.com/question-15-54671.html 学习笔记一传送门学习笔记二传送门学习笔记三导读: 笔记三主要就是各个模块的封装了,这里贴 ...
JSP学习笔记(三):简单的Tomcat Web服务器
注意:每次对Tomcat配置文件进行修改后,必须重启Tomcat 在E盘的DATA文件夹中创建TomcatDemo文件夹,并将Tomcat安装路径下的webapps/ROOT中的WEB-INF文件夹复 ...

随机推荐

.NET MVC CSRF/XSRF 漏洞
最近我跟一个漏洞还有一群阿三干起来了…… 背景: 我的客户是一个世界知名的药企,最近这个客户上台了一位阿三管理者,这个货上线第一个事儿就是要把现有的软件供应商重新洗牌一遍.由于我们的客户关系维护的非常 ...
[5.19 线下活动]Docker Meetup杭州站—拥抱Kubernetes，容器深度实践
对本次线下活动感兴趣的朋友,欢迎点击此处报名,领取免费票. 今年3月,Docker刚刚过完5岁生日,五年期间,Docker也逐渐在技术和实践方面趋于成熟,更是在去年年底主动拥抱Kubernetes. ...
Redis持久化策略（RDB &AOF）
redis持久化的几种方式 1.前言 Redis是一种高级key-value数据库.它跟memcached类似,不过数据可以持久化,而且支持的数据类型很丰富.有字符串,链表,集合和有序集合.支持在服 ...
JavaScript基础流程控制（3）
day51 参考:https://www.cnblogs.com/liwenzhou/p/8004649.html for循环 while循环三元运算 a>b条件成立,选a,不成立选b
D04——C语言基础学PYTHON
C语言基础学习PYTHON——基础学习D04 20180810内容纲要: 1 内置函数 2 装饰器 3 生成器 4 迭代器 5 软件目录结构规范 6 小结 1 内置函数内置函数方法详 ...
贪吃蛇小游戏-----C语言实现
1.分析众所周知,贪吃蛇游戏是一款经典的益智游戏,有PC和手机等多平台版本,既简单又耐玩.该游戏通过控制蛇头方向吃食物,从而使得蛇变得越来越长,蛇不能撞墙,也不能装到自己,否则游戏结束.玩过贪吃蛇的 ...
Inno Setup入门（二十三）——Inno Setup类参考（9）
今天就简单说一下ProgressBar. TNewProgressBar = class(TWinControl) property Min: Longint; read write; pro ...
Maven与Hudson集成
Hudson是一款优秀的持续集成产品,本文阐述Maven于Hudson的集成 Hudson的下载和安装 Hudson有两种安装模式,1:自运行(Hudson内建netty容器),2:放到如tomc ...
Qt: QTimer和QThread
让QTimer 跑在其他线程. 一般写法如下. 1. 在main thread中为worker thread指定定时器. QThread* thread = new QThread(this); th ...
[转载]7款开源ERP系统比较
现在有许多企业将ERP项目,在企业中没有实施好,都归咎于软件产品不好.其实,这只是你们的借口.若想要将ERP软件真正与企业融合一体,首先得考虑企业的自身情况,再去选择适合的 ERP软件. 如果你的企 ...

Deep Q-Network 学习笔记（三）—— 改进①：nature dqn

Deep Q-Network 学习笔记（三）—— 改进①：nature dqn的更多相关文章

随机推荐

热门专题