In the previous post I go through basic 1-layer Neural Network with sigmoid activation function, including

  • How to get sigmoid function from a binary classification problem?

  • NN is still an optimization problem, so what's the target to optimize? - cost function

  • How does model learn?- gradient descent

  • Work flow of NN? - Backward/Forward propagation

Now let's get deeper to 2-layers Neural Network, from where you can have as many hidden layers as you want. Also let's try to vectorize everything.

1. The architecture of 2-layers shallow NN

Below is the architecture of 2-layers NN, including input layer, one hidden layer and one output layer. The input layer is not counted.

(1) Forward propagation

In each neuron, there are 2 activities going on after take in the input from previous hidden layer:

  1. a linear transformation of the input
  2. a non-linear activation function applied after

Then the ouput will pass to the next hidden layer as input.

From input layer to output layer, do above computation layer by layer is forward propagation. It tries to map each input \(x \in R^n\) to $ y$.

For each training sample, the forward propagation is defined as following:

\(x \in R^{n*1}\) denotes the input data. In the picture n = 4.

\((w^{[1]} \in R^{k*n},b^{[1]}\in R^{k*1})\) is the parameter in the first hidden layer. Here k = 3.

\((w^{[2]} \in R^{1*k},b^{[2]}\in R^{1*1})\) is the parameter in the output layer. The output is a binary variable with 1 dimension.

\((z^{[1]} \in R^{k*1},z^{[2]}\in R^{1*1})\) is the intermediate output after linear transformation in the hidden and output layer.

\((a^{[1]} \in R^{k*1},a^{[2]}\in R^{1*1})\) is the output from each layer. To make it more generalize we can use \(a^{[0]} \in R^n\) to denote \(x\)

*Here we use \(g(x)\) as activation function for hidden layer, and sigmoid \(\sigma(x)\) for output layer. we will discuss what are the available activation functions \(g(x)\) out there in the following post. What happens in forward propagation is following:

\([1]\) \(z^{[1]} = {w^{[1]}} a^{[0]} + b^{[1]}\)
\([2]\) \(a^{[1]} = g((z^{[1]} ) )\)
\([3]\) \(z^{[2]} = {w^{[2]}} a^{[1]} + b^{[2]}\)
\([4]\) \(a^{[2]} = \sigma(z^{[2]} )\)

(2) Backward propagation

After forward propagation, for each training sample \(x\) is done ,we will have a prediction \(\hat{y}\). Comparing \(\hat{y}\) with \(y\), we then use the error between prediction and real value to update the parameter via gradient descent.

Backward propagation is passing the gradient descent from output layer back to input layer using chain rule like below. The deduction is in the previous post.

\[ \frac{\partial L(a,y)}{\partial w} =
\frac{\partial L(a,y)}{\partial a} \cdot
\frac{\partial a}{\partial z} \cdot
\frac{\partial z}{\partial w}\]

\([4]\) \(dz^{[2]} = a^{[2]} - y\)
\([3]\) \(dw^{[2]} = dz^{[2]} a^{[1]T}\)
\([3]\) \(db^{[2]} = dz^{[2]}\)
\([2]\) \(dz^{[1]} = da^{[1]} * g^{[1]'}(z[1]) = w^{[2]T} dz^{[2]}* g^{[1]'}(z[1])\)
\([1]\) \(dw^{[1]} = dz^{[1]} a^{[0]T}\)
\([1]\) \(db^{[1]} = dz^{[1]}\)

2. Vectorize and Generalize your NN

Let's derive the vectorize representation of the above forward and backward propagation. The usage of vector is to speed up the computation. We will talk about this again in batch gradient descent.

\(w^{[1]},b^{[1]}, w^{[2]}, b^{[2]}\) stays the same. Generally \(w^{[i]}\) has dimension \((h_{i},h_{i-1})\) and \(b^{[i]}\) has dimension \((h_{i},1)\)

\(Z^{[1]} \in R^{k*m}, Z^{[2]} \in R^{1*m}, A^{[0]} \in R^{n*m}, A^{[1]} \in R^{k*m}, A^{[2]}\in R^{1*m}\) where \(A^{[0]}\)is the input vector, each column is one training sample.

(1) Forward propogation

Follow above logic, vectorize representation is below:

\([1]\) \(Z^{[1]} = {w^{[1]}} A^{[0]} + b^{[1]}\)
\([2]\) \(A^{[1]} = g((Z^{[1]} ) )\)
\([3]\) \(Z^{[2]} = {w^{[2]}} A^{[1]} + b^{[2]}\)
\([4]\) \(A^{[2]} = \sigma(Z^{[2]} )\)

Have you noticed that the dimension above is not a exact matched?
\({w^{[1]}} A^{[0]}\) has dimension \((k,m)\), \(b^{[1]}\) has dimension \((k,1)\).
However Python will take care of this for you with Broadcasting. Basically it will replicate the lower dimension to the higher dimension. Here \(b^{[1]}\) will be replicated m times to become \((k,m)\)

(1) Backward propogation

Same as above, backward propogation will be:
\([4]\) \(dZ^{[2]} = A^{[2]} - Y\)
\([3]\) \(dw^{[2]} =\frac{1}{m} dZ^{[2]} A^{[1]T}\)
\([3]\) \(db^{[2]} = \frac{1}{m} \sum{dZ^{[2]}}\)
\([2]\) \(dZ^{[1]} = dA^{[1]} * g^{[1]'}(z[1]) = w^{[2]T} dZ^{[2]}* g^{[1]'}(z[1])\)
\([1]\) \(dw^{[1]} = \frac{1}{m} dZ^{[1]} A^{[0]T}\)
\([1]\) \(db^{[1]} = \frac{1}{m} \sum{dZ^{[1]} }\)

In the next post, I will talk about some other details in NN, like hyper parameter, activation function.

To be continued.


Reference

  1. Ian Goodfellow, Yoshua Bengio, Aaron Conrville, "Deep Learning"
  2. Deeplearning.ai https://www.deeplearning.ai/

DeepLearning - Forard & Backward Propogation的更多相关文章

  1. Deeplearning - Overview of Convolution Neural Network

    Finally pass all the Deeplearning.ai courses in March! I highly recommend it! If you already know th ...

  2. DeepLearning - Regularization

    I have finished the first course in the DeepLearnin.ai series. The assignment is relatively easy, bu ...

  3. Coursera机器学习+deeplearning.ai+斯坦福CS231n

    日志 20170410 Coursera机器学习 2017.11.28 update deeplearning 台大的机器学习课程:台湾大学林轩田和李宏毅机器学习课程 Coursera机器学习 Wee ...

  4. DeepLearning - Overview of Sequence model

    I have had a hard time trying to understand recurrent model. Compared to Ng's deep learning course, ...

  5. back propogation 的线代描述

    参考资料: 算法部分: standfor, ufldl  : http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial 一文弄懂BP:https: ...

  6. DeepLearning Intro - sigmoid and shallow NN

    This is a series of Machine Learning summary note. I will combine the deep learning book with the de ...

  7. 用纯Python实现循环神经网络RNN向前传播过程(吴恩达DeepLearning.ai作业)

    Google TensorFlow程序员点赞的文章!   前言 目录: - 向量表示以及它的维度 - rnn cell - rnn 向前传播 重点关注: - 如何把数据向量化的,它们的维度是怎么来的 ...

  8. 吴恩达DeepLearning.ai的Sequence model作业Dinosaurus Island

    目录 1 问题设置 1.1 数据集和预处理 1.2 概览整个模型 2. 创建模型模块 2.1 在优化循环中梯度裁剪 2.2 采样 3. 构建语言模型 3.1 梯度下降 3.2 训练模型 4. 结论   ...

  9. Sql Server 聚集索引扫描 Scan Direction的两种方式------FORWARD 和 BACKWARD

    最近发现一个分页查询存储过程中的的一个SQL语句,当聚集索引列的排序方式不同的时候,效率差别达到数十倍,让我感到非常吃惊 由此引发出来分页查询的情况下对大表做Clustered Scan的时候, 不同 ...

随机推荐

  1. 课时8.HTML作用(掌握)

    什么是HTML? HTML其实是HyperText Markup Language的缩写,超文本标记语言 如何重命名文件? 点击右键重命名 点击F2 首先利用记事本保存了一个标题和两段描述,然后修改纯 ...

  2. 算法是什么(二)手写个链表(java)

    算法是什么(二)手写个链表(java)   liuyuhang原创,未经允许禁止转载 目录 算法是什么(〇) 很多语言的API中都提供了链表实现,或者扩展库中实现了链表. 但是更多的情况下,Map(或 ...

  3. ARC下IBOutlet用weak还是strong

    原文来自这里. 今天用Xcode5的时候,发现默认的IBoutlet的属性设置为weak——因为Xcode5建立的工程都是ARC的了.但是当时还有点不明白,因为项目的原因,一直没有正式使用过ARC.于 ...

  4. MySql is marked as crashed and should be repaired问题

    在一次电脑不知道为什么重启之后数据库某表出现了 is marked as crashed and should be repaired这个错误,百度了一下,很多都是去找什么工具然后输入命令之类的,因为 ...

  5. 类似"音速启动"的原创工具简码"万能助手"在线用户数终于突破100了!

    原本只是开发出来方便自己的一个小工具,看到群友也喜欢,就随手分享了, 经过1个多月的自然积累,在线用户数终于突破100了,这增长速度实在让人泪奔~ 博客园的朋友如果看到,喜欢的话就拿去用吧, 万能助手 ...

  6. C++笔记011:C++对C的扩展——变量检测增强

    原创笔记,转载请注明出处! 点击[关注],关注也是一种美德~ 在C语言中重复定义多个同名的变量是合法的,多个同名的全局变量最终会被链接到全局数据区的同一个地址空间上. 在C++中,不允许定义多个同名的 ...

  7. 利用MFC Picture Control控件 加载bmp,png

    1.在资源视图,选择PictureControl,并且在属性中把Type设置为Bitmap. 2.加载PNG CStatic* pWnd = (CStatic*)GetDlgItem(IDC_PIC) ...

  8. [SHELL]软件管理

  9. keepalived+nginx+tomcat+redis实现负载均衡和session共享(原创)

    keepalived+nginx+tomcat+redis实现负载均衡和session共享 直接上链接,码了一天,就不再重写了,希望能帮到大家,有问题欢迎留言交流.

  10. jquery输入框动态查询l<li></li>列表

    1.html代码如下: 2.js代码如下: $('#search').bind('input propertychange', function () { searchKpoint(); }); fu ...