RNN神经元理解

单个RNN神经元行为

括号中表示的是维度

向前传播

def rnn_step_forward(x, prev_h, Wx, Wh, b):

  """

  Run the forward pass for a single timestep of a vanilla RNN that uses a tanh

  activation function.

  The input data has dimension D, the hidden state has dimension H, and we use

  a minibatch size of N.

  Inputs:

  - x: Input data for this timestep, of shape (N, D).

  - prev_h: Hidden state from previous timestep, of shape (N, H)

  - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)

  - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)

  - b: Biases of shape (H,)

  Returns a tuple of:

  - next_h: Next hidden state, of shape (N, H)

  - cache: Tuple of values needed for the backward pass.

  """

  next_h, cache = None, None

  ##############################################################################

  # TODO: Implement a single forward step for the vanilla RNN. Store the next  #

  # hidden state and any values you need for the backward pass in the next_h   #

  # and cache variables respectively.                                          #

  ##############################################################################

  next_h = np.tanh(x.dot(Wx) + prev_h.dot(Wh) + b)

  cache = (x, Wx, Wh, prev_h, next_h)

  ##############################################################################

  #                               END OF YOUR CODE                             #

  ##############################################################################

  return next_h, cache

反向传播

def rnn_step_backward(dnext_h, cache):

  """

  Backward pass for a single timestep of a vanilla RNN.

  Inputs:

  - dnext_h: Gradient of loss with respect to next hidden state

  - cache: Cache object from the forward pass

  Returns a tuple of:

  - dx: Gradients of input data, of shape (N, D)

  - dprev_h: Gradients of previous hidden state, of shape (N, H)

  - dWx: Gradients of input-to-hidden weights, of shape (N, H)

  - dWh: Gradients of hidden-to-hidden weights, of shape (H, H)

  - db: Gradients of bias vector, of shape (H,)

  """

  dx, dprev_h, dWx, dWh, db = None, None, None, None, None

  ##############################################################################

  # TODO: Implement the backward pass for a single step of a vanilla RNN.      #

  #                                                                            #

  # HINT: For the tanh function, you can compute the local derivative in terms #

  # of the output value from tanh.                                             #

  ##############################################################################

  x, Wx, Wh, prev_h, next_h = cache

  dtanh = 1 - next_h**2

  dx = (dnext_h*dtanh).dot(Wx.T)

  dWx = x.T.dot(dnext_h*dtanh)

  dprev_h = (dnext_h*dtanh).dot(Wh.T)

  dWh = prev_h.T.dot(dnext_h*dtanh)

  db = np.sum(dnext_h*dtanh,axis=0)

  ##############################################################################

  #                               END OF YOUR CODE                             #

  ##############################################################################

  return dx, dprev_h, dWx, dWh, db

单层RNN神经元行为

x（N，T，D）表示N样本的batch中有T个字符向量，每个响亮H维度。

RNN输出有两个方向，一个向上一层（输出层），一个向同层下一个时序，所以反向传播时两个梯度需要相加，输出层梯度可以直接求出（或是上一层中递归求出），所以使用dh(N,T,H)保存好，而同层时序梯度必须在同层中递归计算。

正向传播

def rnn_forward(x, h0, Wx, Wh, b):

  """

  Run a vanilla RNN forward on an entire sequence of data. We assume an input

  sequence composed of T vectors, each of dimension D. The RNN uses a hidden

  size of H, and we work over a minibatch containing N sequences. After running

  the RNN forward, we return the hidden states for all timesteps.

  Inputs:

  - x: Input data for the entire timeseries, of shape (N, T, D).

  - h0: Initial hidden state, of shape (N, H)

  - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)

  - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)

  - b: Biases of shape (H,)

  Returns a tuple of:

  - h: Hidden states for the entire timeseries, of shape (N, T, H).

  - cache: Values needed in the backward pass

  """

  h, cache = None, None

  ##############################################################################

  # TODO: Implement forward pass for a vanilla RNN running on a sequence of    #

  # input data. You should use the rnn_step_forward function that you defined  #

  # above.                                                                     #

  ##############################################################################

  N, T, D = x.shape

  _, H = h0.shape

  h = np.zeros((N,T,H))

  h_next = h0

  cache = []

  for i in range(T):

    h[:,i,:], cache_next = rnn_step_forward(x[:,i,:], h_next, Wx, Wh, b)

    h_next = h[:,i,:]

    cache.append(cache_next)

  ##############################################################################

  #                               END OF YOUR CODE                             #

  ##############################################################################

  return h, cache

单层RNN反向传播

def rnn_backward(dh, cache):

  """

  Compute the backward pass for a vanilla RNN over an entire sequence of data.

  Inputs:

  - dh: Upstream gradients of all hidden states, of shape (N, T, H)

  Returns a tuple of:

  - dx: Gradient of inputs, of shape (N, T, D)

  - dh0: Gradient of initial hidden state, of shape (N, H)

  - dWx: Gradient of input-to-hidden weights, of shape (D, H)

  - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)

  - db: Gradient of biases, of shape (H,)

  """

  dx, dh0, dWx, dWh, db = None, None, None, None, None

  ##############################################################################

  # TODO: Implement the backward pass for a vanilla RNN running an entire      #

  # sequence of data. You should use the rnn_step_backward function that you   #

  # defined above.                                                             #

  ##############################################################################

  x, Wx, Wh, prev_h, next_h = cache[-1]

  _, D = x.shape

  N, T, H = dh.shape

  dx = np.zeros((N,T,D))

  dh0 = np.zeros((N,H))

  dWx = np.zeros((D,H))

  dWh = np.zeros((H,H))

  db = np.zeros(H)

  dprev_h_ = np.zeros((N,H))

  for i in range(T-1,-1,-1):

    dx_, dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:,i,:] + dprev_h_, cache.pop())

    dx[:,i,:] = dx_

    dh0 = dprev_h_

    dWx += dWx_

    dWh += dWh_

    db += db_

  ##############################################################################

  #                               END OF YOUR CODE                             #

  ##############################################################################

  return dx, dh0, dWx, dWh, db

图像标注过程理解

正向传播流程如下，

几个有意思的点

字符和向量的映射

涉及两个映射，

一个是caption_in到输出节点维度向量的映射，映射矩阵是需要学习的参数
一个是输出节点向量到字符的映射，这里面有专门的映射函数，输出节点本身是变化的（被学习的）

第一个映射：

caption_in和caption_out是输入和标准（caption_in=caption[:-1],caption_out=caption[1:]），不考虑batch的话是一维数组，通过We矩阵可以映射到字符向量空间，转换以及反向传播过程如下，

def word_embedding_forward(x, W):

  """

  Forward pass for word embeddings. We operate on minibatches of size N where

  each sequence has length T. We assume a vocabulary of V words, assigning each

  to a vector of dimension D.

  Inputs:

  - x: Integer array of shape (N, T) giving indices of words. Each element idx

    of x muxt be in the range 0 <= idx < V.

  - W: Weight matrix of shape (V, D) giving word vectors for all words.

  Returns a tuple of:

  - out: Array of shape (N, T, D) giving word vectors for all input words.

  - cache: Values needed for the backward pass

  """

  out = W[x, :]

  cache = (W, x)

  return out, cache

反向传播注意，这不是个标准意义上的链式传播的门，按照逻辑分析这个映射过程的梯度是叠加的，注意函数np.func.at()的用法

def word_embedding_backward(dout, cache):

  """

  Backward pass for word embeddings. We cannot back-propagate into the words

  since they are integers, so we only return gradient for the word embedding

  matrix.

  HINT: Look up the function np.add.at

  Inputs:

  - dout: Upstream gradients of shape (N, T, D)

  - cache: Values from the forward pass

  Returns:

  - dW: Gradient of word embedding matrix, of shape (V, D).

  """

  W, x = cache

  dW = np.zeros_like(W)

  #dW[x] += dout # this will not work, see the doc of np.add.at

  np.add.at(dW, x, dout)

  return dW

第二个映射：

正常的多维y=xW计算，

y = x.reshape(x.shape[0], -1).dot(w) + b # 保留N，后面的数据化为一维

这里的y=xW计算，

y = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b # 其实使用y = x.dot(w) + b 效果是一样的，自动广播到最低维度

上面问题不大，问题在求梯度的时候，两者处理有一定差别，注意到这一点的话只要在演算的时候较对好各个维度的值就好了（保证相乘的两项维度可以相乘，而且理论结果维度和算式相符）

情况2的代码，

def affine_forward(x, w, b):

  """

  Computes the forward pass for an affine (fully-connected) layer.

  The input x has shape (N, d_1, ..., d_k) where x[i] is the ith input.

  We multiply this against a weight matrix of shape (D, M) where

  D = \prod_i d_i

  Inputs:

  x - Input data, of shape (N, d_1, ..., d_k)

  w - Weights, of shape (D, M)

  b - Biases, of shape (M,)

  Returns a tuple of:

  - out: output, of shape (N, M)

  - cache: (x, w, b)

  """

  out = x.reshape(x.shape[0], -1).dot(w) + b

  cache = (x, w, b)

  return out, cache

def affine_backward(dout, cache):

  """

  Computes the backward pass for an affine layer.

  Inputs:

  - dout: Upstream derivative, of shape (N, M)

  - cache: Tuple of:

    - x: Input data, of shape (N, d_1, ... d_k)

    - w: Weights, of shape (D, M)

  Returns a tuple of:

  - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)

  - dw: Gradient with respect to w, of shape (D, M)

  - db: Gradient with respect to b, of shape (M,)

  """

  x, w, b = cache

  dx = dout.dot(w.T).reshape(x.shape)

  dw = x.reshape(x.shape[0], -1).T.dot(dout)

  db = np.sum(dout, axis=0)

  return dx, dw, db

情况2的代码，

def temporal_affine_forward(x, w, b):

  """

  Forward pass for a temporal affine layer. The input is a set of D-dimensional

  vectors arranged into a minibatch of N timeseries, each of length T. We use

  an affine function to transform each of those vectors into a new vector of

  dimension M.

  Inputs:

  - x: Input data of shape (N, T, D)

  - w: Weights of shape (D, M)

  - b: Biases of shape (M,)

  Returns a tuple of:

  - out: Output data of shape (N, T, M)

  - cache: Values needed for the backward pass

  """

  N, T, D = x.shape

  M = b.shape[0]

  # out = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b

  out = x.dot(w) + b

  cache = x, w, b, out

  return out, cache

def temporal_affine_backward(dout, cache):

  """

  Backward pass for temporal affine layer.

  Input:

  - dout: Upstream gradients of shape (N, T, M)

  - cache: Values from forward pass

  Returns a tuple of:

  - dx: Gradient of input, of shape (N, T, D)

  - dw: Gradient of weights, of shape (D, M)

  - db: Gradient of biases, of shape (M,)

  """

  x, w, b, out = cache

  N, T, D = x.shape

  M = b.shape[0]

  #dx = dout.reshape(N * T, M).dot(w.T).reshape(N, T, D)

  #dw = dout.reshape(N * T, M).T.dot(x.reshape(N * T, D)).T

  dx = dout.dot(w.T)

  dw = x.reshape(N * T, D).T.dot(dout.reshape(N * T, M))

  db = dout.sum(axis=(0, 1))

  return dx, dw, db

进行一次向前传播&向后传播的测试函数，用于训练

def loss(self, features, captions):

    """

    Compute training-time loss for the RNN. We input image features and

    ground-truth captions for those images, and use an RNN (or LSTM) to compute

    loss and gradients on all parameters.

    Inputs:

    - features: Input image features, of shape (N, D)

    - captions: Ground-truth captions; an integer array of shape (N, T) where

      each element is in the range 0 <= y[i, t] < V

    Returns a tuple of:

    - loss: Scalar loss

    - grads: Dictionary of gradients parallel to self.params

    """

    # Cut captions into two pieces: captions_in has everything but the last word

    # and will be input to the RNN; captions_out has everything but the first

    # word and this is what we will expect the RNN to generate. These are offset

    # by one relative to each other because the RNN should produce word (t+1)

    # after receiving word t. The first element of captions_in will be the START

    # token, and the first element of captions_out will be the first word.

    captions_in = captions[:, :-1]

    captions_out = captions[:, 1:]

    # You'll need this

    mask = (captions_out != self._null)

    # Weight and bias for the affine transform from image features to initial

    # hidden state

    W_proj, b_proj = self.params['W_proj'], self.params['b_proj']

    # Word embedding matrix

    W_embed = self.params['W_embed']

    # Input-to-hidden, hidden-to-hidden, and biases for the RNN

    Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

    # Weight and bias for the hidden-to-vocab transformation.

    W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

    loss, grads = 0.0, {}

    ############################################################################

    # TODO: Implement the forward and backward passes for the CaptioningRNN.   #

    # In the forward pass you will need to do the following:                   #

    # (1) Use an affine transformation to compute the initial hidden state     #

    #     from the image features. This should produce an array of shape (N, H)#

    # (2) Use a word embedding layer to transform the words in captions_in     #

    #     from indices to vectors, giving an array of shape (N, T, W).         #

    # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to    #

    #     process the sequence of input word vectors and produce hidden state  #

    #     vectors for all timesteps, producing an array of shape (N, T, H).    #

    # (4) Use a (temporal) affine transformation to compute scores over the    #

    #     vocabulary at every timestep using the hidden states, giving an      #

    #     array of shape (N, T, V).                                            #

    # (5) Use (temporal) softmax to compute loss using captions_out, ignoring  #

    #     the points where the output word is <NULL> using the mask above.     #

    #                                                                          #

    # In the backward pass you will need to compute the gradient of the loss   #

    # with respect to all model parameters. Use the loss and grads variables   #

    # defined above to store loss and gradients; grads[k] should give the      #

    # gradients for self.params[k].                                            #

    ############################################################################

    captions_in_emb, emb_cache = word_embedding_forward(captions_in, W_embed)

    h_0, feature_cache = affine_forward(features, W_proj, b_proj)

    h, rnn_cache = rnn_forward(captions_in_emb, h_0, Wx, Wh, b)

    temporal_out, temporal_cache = temporal_affine_forward(h, W_vocab, b_vocab)

    loss, dout = temporal_softmax_loss(temporal_out, captions_out, mask)

    dtemp, grads['W_vocab'], grads['b_vocab'] = temporal_affine_backward(dout, temporal_cache)

    drnn, dh0, grads['Wx'], grads['Wh'], grads['b'] = rnn_backward(dtemp, rnn_cache)

    dfeatures, grads['W_proj'], grads['b_proj'] = affine_backward(dh0, feature_cache)

    grads['W_embed'] = word_embedding_backward(drnn, emb_cache)

    return loss, grads

用于预测的函数

补充一下 RNN 是怎么生成一句话的，一般 RNN 的训练（train）和预测（Inference or test）是不一样的，因为训练时有 label，每时刻的输入都是 ground-truth 的单词；而预测只能把自己上时刻的输出（就是预测概率最大的那个单词）当做输入，顶多只是取前几个 sampling 一下。对比如下：

Train
- 把 ground-truth word 当做 RNN 的输入，也有把这种做法叫做 teacher forcing 的
- 比较常见的做法
Inference
- 把 RNN 上时刻的输出，当做下时刻的输入。
- 比如训练的时候也这样做，就会比较难训练

def sample(self, features, max_length=30):

    """

    Run a test-time forward pass for the model, sampling captions for input

    feature vectors.

    At each timestep, we embed the current word, pass it and the previous hidden

    state to the RNN to get the next hidden state, use the hidden state to get

    scores for all vocab words, and choose the word with the highest score as

    the next word. The initial hidden state is computed by applying an affine

    transform to the input image features, and the initial word is the <START>

    token.

    For LSTMs you will also have to keep track of the cell state; in that case

    the initial cell state should be zero.

    Inputs:

    - features: Array of input image features of shape (N, D).

    - max_length: Maximum length T of generated captions.

    Returns:

    - captions: Array of shape (N, max_length) giving sampled captions,

      where each element is an integer in the range [0, V). The first element

      of captions should be the first sampled word, not the <START> token.

    """

    N = features.shape[0]

    captions = self._null * np.ones((N, max_length), dtype=np.int32)

    # Unpack parameters

    W_proj, b_proj = self.params['W_proj'], self.params['b_proj']

    W_embed = self.params['W_embed']

    Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

    W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

    ###########################################################################

    # TODO: Implement test-time sampling for the model. You will need to      #

    # initialize the hidden state of the RNN by applying the learned affine   #

    # transform to the input image features. The first word that you feed to  #

    # the RNN should be the <START> token; its value is stored in the         #

    # variable self._start. At each timestep you will need to do to:          #

    # (1) Embed the previous word using the learned word embeddings           #

    # (2) Make an RNN step using the previous hidden state and the embedded   #

    #     current word to get the next hidden state.                          #

    # (3) Apply the learned affine transformation to the next hidden state to #

    #     get scores for all words in the vocabulary                          #

    # (4) Select the word with the highest score as the next word, writing it #

    #     to the appropriate slot in the captions variable                    #

    #                                                                         #

    # For simplicity, you do not need to stop generating after an <END> token #

    # is sampled, but you can if you want to.                                 #

    #                                                                         #

    # HINT: You will not be able to use the rnn_forward or lstm_forward       #

    # functions; you'll need to call rnn_step_forward or lstm_step_forward in #

    # a loop.                                                                 #

    ###########################################################################

    # initialize the hidden state of the RNN by applying the learned affine transform to the input image features.

    prev_h, _ = affine_forward(features, W_proj, b_proj)

    # The first word that you feed to the RNN should be the <START> token

    x = np.array([self._start for i in range(N)]) # 输出N个self._start

    captions[:, 0] = self._start

    for t in range(1, max_length):

        x_emb, _ = word_embedding_forward(x, W_embed)

        next_h, cache = rnn_step_forward(x_emb, prev_h, Wx, Wh, b)

        prev_h = next_h

        vocab_out, vocab_cache = affine_forward(next_h, W_vocab, b_vocab)

        x = vocab_out.argmax(1)

        captions[:, t] = x 

    ############################################################################

    #                             END OF YOUR CODE                             #

    ############################################################################

    return captions

『cs231n』作业3问题1选讲_通过代码理解RNN&图像标注训练的更多相关文章

『cs231n』作业3问题3选讲_通过代码理解图像梯度
Saliency Maps 这部分想探究一下 CNN 内部的原理,参考论文 Deep Inside Convolutional Networks: Visualising Image Classifi ...
『cs231n』作业3问题2选讲_通过代码理解LSTM网络
LSTM神经元行为分析 LSTM 公式可以描述如下: itftotgtctht=sigmoid(Wixxt+Wihht−1+bi)=sigmoid(Wfxxt+Wfhht−1+bf)=sigmoid( ...
『cs231n』作业3问题4选讲_图像梯度应用强化
[注],本节(上节也是)的model是一个已经训练完成的CNN分类网络. 随机数图片向前传播后对目标类优化,反向优化图片本体 def create_class_visualization(target ...
『cs231n』作业2选讲_通过代码理解Dropout
Dropout def dropout_forward(x, dropout_param): p, mode = dropout_param['p'], dropout_param['mode'] i ...
『cs231n』作业2选讲_通过代码理解优化器
1).Adagrad一种自适应学习率算法,实现代码如下: cache += dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps) 这种方法的 ...
『cs231n』作业1选讲_通过代码理解KNN&交叉验证&SVM
通过K近邻算法探究numpy向量运算提速茴香豆的“茴”字有... ... 使用三种计算图片距离的方式实现K近邻算法: 1.最为基础的双循环 2.利用numpy的broadca机制实现单循环 3.利用 ...
『cs231n』通过代码理解风格迁移
『cs231n』卷积神经网络的可视化应用文件目录 vgg16.py import os import numpy as np import tensorflow as tf from downloa ...
『cs231n』计算机视觉基础
线性分类器损失函数明细: 『cs231n』线性分类器损失函数最优化Optimiz部分代码: 1.随机搜索 bestloss = float('inf') # 无穷大 for num in range ...
『PyTorch』第十弹_循环神经网络
RNN基础: 『cs231n』作业3问题1选讲_通过代码理解RNN&图像标注训练 TensorFlow RNN: 『TensotFlow』基础RNN网络分类问题『TensotFlow』基础R ...

随机推荐

Linux基础命令---zip
zip zip是一种最通用的文件压缩方式,使用于unix.msdos.windows.OS等系统.如果在编译zip时包含bzip 2库,zip现在也支持bzip 2压缩.当将大于4GB的文件添加到存档 ...
Python入门之面向对象编程(四)Python描述器详解
本文分为如下部分引言——用@property批量使用的例子来引出描述器的功能描述器的基本理论及简单实例描述器的调用机制描述器的细节实例方法.静态方法和类方法的描述器原理 property装饰 ...
WebService（JAX-WS、XFire、Axis三种）获取客户端ip
WebService(JAX-WS.XFire.Axis三种)获取客户端ip JAX-WS.XFire.Axis三种webservice的获取客户端IP的简单实现过程: 1,基于JDK6 jax-ws ...
kubernetes 一些基本的概念
k8s 原理 kubernetes API server 作为集群的核心,负责集群各功能之间的通信, 集群内的各个功能模块通过API Server将信息存入etcd,当需要获取和操作这些数据的时候通 ...
ACM数论之旅6---数论倒数，又称逆元（我整个人都倒了(￣﹏￣)）
数论倒数,又称逆元(因为我说习惯逆元了,下面我都说逆元) 数论中的倒数是有特别的意义滴你以为a的倒数在数论中还是1/a吗 (・∀・)哼哼~天真先来引入求余概念 (a + b) % p = (a% ...
Win32程序支持命令行参数的做法(转载)
转载:http://www.cnblogs.com/lanzhi/p/6470406.html 转载:http://blog.csdn.net/kelsel/article/details/52759 ...
Python3基础 dict keys+values 循环打印字典中的所有键和值
Python : 3.7.0 OS : Ubuntu 18.04.1 LTS IDE : PyCharm 2018.2.4 Conda ...
nginx 安装手记
Nginx需要依赖下面3个包 1. gzip 模块需要 zlib 库 ( 下载: http://www.zlib.net/ ) zlib-1.2.8.tar.gz 2. rewrite 模块需要 p ...
MOOC视频学习
mooc地址 2018/2/6-2/7学习计划: 学习第一周(1.1-1.4)内容. 学习笔记 2018/2/8-2/9学习计划: 学习第二周(1.5.2.1-2.5)内容. 学习笔记 2018/2/ ...
用python + hadoop streaming 编写分布式程序（二） -- 在集群上运行与监控
写在前面相关随笔: Hadoop-1.0.4集群搭建笔记用python + hadoop streaming 编写分布式程序(一) -- 原理介绍,样例程序与本地调试用python + hado ...

『cs231n』作业3问题1选讲_通过代码理解RNN&图像标注训练