LSTM神经元行为分析

LSTM 公式可以描述如下:

itftotgtctht=sigmoid(Wixxt+Wihht−1+bi)=sigmoid(Wfxxt+Wfhht−1+bf)=sigmoid(Woxxt+Wohht−1+bo)=tanh(Wgxxt+Wghht−1+bg)=ft∘ct−1+it∘gt=ot∘ct

感觉比较新奇的一点是通过点乘矩阵使用‘门’控制数据流的取舍,和卷积神经网络的激活过程有一点点相似。

反向传播时,通过链式法则一个变量一个变量后推比较清晰。

反向传播时注意Ct节点,它既是本层的输出,也是本层另一个输出ht的输入节点,即它的梯度由两部分组成——上层回传梯度&ht反向传播梯度

向前传播

单个LSTM神经元向前传播

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
"""
Forward pass for a single timestep of an LSTM. The input data has dimension D, the hidden state has dimension H, and we use
a minibatch size of N. Inputs:
- x: Input data, of shape (N, D)
- prev_h: Previous hidden state, of shape (N, H)
- prev_c: previous cell state, of shape (N, H)
- Wx: Input-to-hidden weights, of shape (D, 4H)
- Wh: Hidden-to-hidden weights, of shape (H, 4H)
- b: Biases, of shape (4H,) Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- next_c: Next cell state, of shape (N, H)
- cache: Tuple of values needed for backward pass.
"""
next_h, next_c, cache = None, None, None
#############################################################################
# TODO: Implement the forward pass for a single timestep of an LSTM. #
# You may want to use the numerically stable sigmoid implementation above. #
#############################################################################
_, H = prev_h.shape
a = x.dot(Wx) + prev_h.dot(Wh) + b
i,f,o,g = sigmoid(a[:,:H]),sigmoid(a[:,H:2*H]),sigmoid(a[:,2*H:3*H]),np.tanh(a[:,3*H:])
next_c = f*prev_c + i*g
next_h = o*np.tanh(next_c)
cache = [i, f, o, g, x, prev_h, prev_c, Wx, Wh, b, next_c] return next_h, next_c, cache

层LSTM神经元向前传播

def lstm_forward(x, h0, Wx, Wh, b):
"""
Forward pass for an LSTM over an entire sequence of data. We assume an input
sequence composed of T vectors, each of dimension D. The LSTM uses a hidden
size of H, and we work over a minibatch containing N sequences. After running
the LSTM forward, we return the hidden states for all timesteps. Note that the initial cell state is passed as input, but the initial cell
state is set to zero. Also note that the cell state is not returned; it is
an internal variable to the LSTM and is not accessed from outside. Inputs:
- x: Input data of shape (N, T, D)
- h0: Initial hidden state of shape (N, H)
- Wx: Weights for input-to-hidden connections, of shape (D, 4H)
- Wh: Weights for hidden-to-hidden connections, of shape (H, 4H)
- b: Biases of shape (4H,) Returns a tuple of:
- h: Hidden states for all timesteps of all sequences, of shape (N, T, H)
- cache: Values needed for the backward pass.
"""
h, cache = None, None
#############################################################################
# TODO: Implement the forward pass for an LSTM over an entire timeseries. #
# You should use the lstm_step_forward function that you just defined. #
#############################################################################
N,T,D = x.shape
next_c = np.zeros_like(h0)
next_h = h0
h, cache = [], []
for i in range(T):
next_h, next_c, cache_step = lstm_step_forward(x[:,i,:], next_h, next_c, Wx, Wh, b)
h.append(next_h)
cache.append(cache_step)
h = np.array(h).transpose(1,0,2) #<-----------注意分析h存储后的维度是(T,N,H),需要转置为(N,T,H) return h, cache

反向传播

注意实际反向传播时,初始的C梯度是自己初始化的,而h梯度继承自高层(分类或者h到词袋的转化层,h层和RNN实际相同)

单个LSTM神经元反向传播

def lstm_step_backward(dnext_h, dnext_c, cache):
"""
Backward pass for a single timestep of an LSTM. Inputs:
- dnext_h: Gradients of next hidden state, of shape (N, H)
- dnext_c: Gradients of next cell state, of shape (N, H)
- cache: Values from the forward pass Returns a tuple of:
- dx: Gradient of input data, of shape (N, D)
- dprev_h: Gradient of previous hidden state, of shape (N, H)
- dprev_c: Gradient of previous cell state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
dx, dprev_h, dprev_c, dWx, dWh, db = None, None, None, None, None, None
#############################################################################
# TODO: Implement the backward pass for a single timestep of an LSTM. #
# #
# HINT: For sigmoid and tanh you can compute local derivatives in terms of #
# the output value from the nonlinearity. #
#############################################################################
i, f, o, g, x, prev_h, prev_c, Wx, Wh, b, next_c = cache do = dnext_h*np.tanh(next_c)
dnext_c += dnext_h*o*(1-np.tanh(next_c)**2) #<-----------上面分析行为有提到这里的求法 di, df, dg, dprev_c = (g, prev_c, i, f) * dnext_c
da = np.concatenate([i*(1-i)*di, f*(1-f)*df, o*(1-o)*do, (1-g**2)*dg],axis=1) db = np.sum(da,axis=0)
dx, dWx, dprev_h, dWh = (da.dot(Wx.T), x.T.dot(da), da.dot(Wh.T), prev_h.T.dot(da)) return dx, dprev_h, dprev_c, dWx, dWh, db

层LSTM神经元反向传播

def lstm_backward(dh, cache):
"""
Backward pass for an LSTM over an entire sequence of data.] Inputs:
- dh: Upstream gradients of hidden states, of shape (N, T, H)
- cache: Values from the forward pass Returns a tuple of:
- dx: Gradient of input data of shape (N, T, D)
- dh0: Gradient of initial hidden state of shape (N, H)
- dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
dx, dh0, dWx, dWh, db = None, None, None, None, None
#############################################################################
# TODO: Implement the backward pass for an LSTM over an entire timeseries. #
# You should use the lstm_step_backward function that you just defined. #
#############################################################################
N,T,H = dh.shape
_, D = cache[0][4].shape
dx, dh0, dWx, dWh, db = \
[], np.zeros((N, H), dtype='float32'), \
np.zeros((D, 4*H), dtype='float32'), np.zeros((H, 4*H), dtype='float32'), np.zeros(4*H, dtype='float32') step_dprev_h, step_dprev_c = np.zeros((N,H)),np.zeros((N,H))
for i in xrange(T-1, -1, -1):
step_dx, step_dprev_h, step_dprev_c, step_dWx, step_dWh, step_db = \
lstm_step_backward(dh[:,i,:] + step_dprev_h, step_dprev_c, cache[i])
dx.append(step_dx) # 每一个输入节点都有自己的梯度
dWx += step_dWx # 层共享参数,需要累加和
dWh += step_dWh # 层共享参数,需要累加和
db += step_db # 层共享参数,需要累加和
dh0 = step_dprev_h # 只有最初输入的h0,即feature的投影(图像标注中),需要存储梯度
dx = np.array(dx[::-1]).transpose((1,0,2)) return dx, dh0, dWx, dWh, db

『cs231n』作业3问题2选讲_通过代码理解LSTM网络的更多相关文章

  1. 『cs231n』作业3问题1选讲_通过代码理解RNN&图像标注训练

    一份不错的作业3资料(含答案) RNN神经元理解 单个RNN神经元行为 括号中表示的是维度 向前传播 def rnn_step_forward(x, prev_h, Wx, Wh, b): " ...

  2. 『cs231n』作业3问题3选讲_通过代码理解图像梯度

    Saliency Maps 这部分想探究一下 CNN 内部的原理,参考论文 Deep Inside Convolutional Networks: Visualising Image Classifi ...

  3. 『cs231n』作业3问题4选讲_图像梯度应用强化

    [注],本节(上节也是)的model是一个已经训练完成的CNN分类网络. 随机数图片向前传播后对目标类优化,反向优化图片本体 def create_class_visualization(target ...

  4. 『cs231n』作业2选讲_通过代码理解Dropout

    Dropout def dropout_forward(x, dropout_param): p, mode = dropout_param['p'], dropout_param['mode'] i ...

  5. 『cs231n』作业2选讲_通过代码理解优化器

    1).Adagrad一种自适应学习率算法,实现代码如下: cache += dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps) 这种方法的 ...

  6. 『cs231n』作业1选讲_通过代码理解KNN&交叉验证&SVM

    通过K近邻算法探究numpy向量运算提速 茴香豆的“茴”字有... ... 使用三种计算图片距离的方式实现K近邻算法: 1.最为基础的双循环 2.利用numpy的broadca机制实现单循环 3.利用 ...

  7. 『cs231n』通过代码理解风格迁移

    『cs231n』卷积神经网络的可视化应用 文件目录 vgg16.py import os import numpy as np import tensorflow as tf from downloa ...

  8. 『cs231n』计算机视觉基础

    线性分类器损失函数明细: 『cs231n』线性分类器损失函数 最优化Optimiz部分代码: 1.随机搜索 bestloss = float('inf') # 无穷大 for num in range ...

  9. 『TensorFlow』DCGAN生成动漫人物头像_下

    『TensorFlow』以GAN为例的神经网络类范式 『cs231n』通过代码理解gan网络&tensorflow共享变量机制_上 『TensorFlow』通过代码理解gan网络_中 一.计算 ...

随机推荐

  1. P4180 【模板】严格次小生成树[BJWC2010]

    P4180 [模板]严格次小生成树[BJWC2010] 倍增(LCA)+最小生成树 施工队挖断学校光缆导致断网1天(大雾) 考虑直接枚举不在最小生成树上的边.但是边权可能与最小生成树上的边相等,这样删 ...

  2. swagger报错No handler found for GET /swagger-ui.html

    今天下载jeeweb框架下来研究,其他还有,就是swagger老是出不来.报错:No handler found for GET /swagger-ui.html 后来搜索才发现,这个错误,是因为资源 ...

  3. 贪婪算法(Greedy Algorithm)

    Greedy Algorithm <数据结构与算法--C语言描述> 图论涉及的三个贪婪算法 Dijkstra 算法 Prim 算法 Kruskal 算法 Greedy 经典问题:coin ...

  4. 20145317彭垚_Web基础

    20145317彭垚_Web基础 基础知识 Apache一个开放源码的网页服务器,可以在大多数计算机操作系统中运行,由于其多平台和安全性被广泛使用,是最流行的Web服务器端软件之一.它快速.可靠并且可 ...

  5. 面向对象初调用:foolish 电梯

    本周我们完成的任务是傻瓜电梯的调度,对于那十分十分详细的指导书,我感觉想要说明白题目要求,是做不到的,所以就把指导书贴出来给大家看了,,由于在下还不会网页制作,只能通过百度网盘了,https://pa ...

  6. python中的迭代器和生成器学习笔记总结

    生成器就是一个在行为上和迭代器非常类似的对象.   是个对象! 迭代,顾名思意就是不停的代换的意思,迭代是重复反馈过程的活动,其目的通常是为了逼近所需目标或结果.每一次对过程的重复称为一次“迭代”,而 ...

  7. Git 同时与多个远程库互相同步

    情形:有两个git服务器,比如github,gitosc,有个项目同时在两个服务器上,要互相同步 其实命令还是比较简单的,比如一个现有的git项目,在github,gitosc中分别创建好对应的项目. ...

  8. ACM/ICPC 2018亚洲区预选赛北京赛站网络赛 80 Days(尺取)题解

    题意:n个城市,初始能量c,进入i城市获得a[i]能量,可能负数,去i+1个城市失去b[i]能量,问你能不能完整走一圈. 思路:也就是走的路上能量不能小于0,尺取维护l,r指针,l代表出发点,r代表当 ...

  9. POJ 1222 EXTENDED LIGHTS OUT(高斯消元)题解

    题意:5*6的格子,你翻一个地方,那么这个地方和上下左右的格子都会翻面,要求把所有为1的格子翻成0,输出一个5*6的矩阵,把要翻的赋值1,不翻的0,每个格子只翻1次 思路:poj 1222 高斯消元详 ...

  10. 第十一章 非对称加密算法--DH

    注意:本节内容主要参考自<Java加密与解密的艺术(第2版)>第8章“高等加密算法--非对称加密算法” 11.1.非对称加密算法 特点: 发送方和接收方均有一个密钥对(公钥+私钥),其中公 ...