week1

一张图片,设像素为64*64, 颜色通道为红蓝绿三通道,则对应3个64*64实数矩阵

为了用向量表示这些矩阵,将这些矩阵的像素值展开为一个向量x作为算法的输入

从红色到绿色再到蓝色,依次按行一个个将元素读到向量x中,则x是一个\(1\times64*64*3\)的矩阵,也就是一个64*64*3维的向量

用 \(n_x = 64*64*3\) 表示特征向量x的维度

而所有的训练样本表示成:\(X = \begin{bmatrix}\mid & \mid &\mid &&\mid \\ x^{(1)}& x^{(2)}& x^{(3)}& \cdots & x^{(m)}\\ \mid & \mid &\mid &&\mid \end{bmatrix}\) (\(n_x \times m\)矩阵)

注意不是\(X = \begin{bmatrix} (x^{(1)})^T\\ \vdots \\ (x^{(m)})^T \end{bmatrix}\) ,用上面的方法运算会简单点)

\(Y=\begin{bmatrix}y^{(1)} & y^{(2)} & \cdots & y^{(m)}\end{bmatrix}\)

之前的机器学习课上的\(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}\)的形式不再使用,而用\(\large b = \theta_0, \; w = \begin{bmatrix} \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}\)代替( it will be easier to just keep \(b\) and \(w\) as separate parameters )

则output : \(\large \hat{y}^{(i)} = \sigma(w^Tx^{(i)}+b),{\rm where\;}\sigma(z^{(i)}) = \frac{1}{1+e^{-z^{(i)}}}\)

\(\text{Given \{}(x^{(1)}, y^{(1)}),\dots,(x^{(m)},y^{(m)})\text{\}, want } \hat{y}^{(i)} \approx y^{(i)}\)


week2

Loss Function/Error Function

Loss Function/Error Function(误差函数): used to measure how well our algorism is doing

\[{\cal L}(\hat{y},y) = -y\cdot log(\hat{y})-(1-y)\cdot log(1-\hat{y})
\]

Cost Function

\[J(w,b) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\, log\,\hat{y}^{(i)})+(1-y^{(i)})\, log\,(1-\hat{y}^{(i)})]
\]

Gradient Descent

​ 看ML的笔记,实质上是一样的

Vectorization:

#Non-vecotrized
#slow
z = 0
for i in range(n_x):
z += w[i] * x[i]
z += b #Vectorized
#import numpy as np
z = np.dot(w,x) + b

whenever possible, avoid explicit for-loops(因为是解释型语言), 用numpy带的行数可以简洁而高效地实现

Vectorizing Logistic Regression

\(X = \begin{bmatrix} \lvert & \lvert & \cdots & \lvert \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \lvert & \lvert & \cdots & \lvert \end{bmatrix}, \mathbb{R}^{n_x \times m}\)

\(Z = \begin{bmatrix}z^{(1)} & z^{(2)} & \cdots & z^{(m)} \end{bmatrix} = w^TX + \begin{bmatrix}b &b & \cdots & b \end{bmatrix}\)

\(z^{(i)}\) 是 sigmoid function的输入值

\(A = \begin{bmatrix}a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix} = \sigma(Z)\)

(这里的不同上标的元素似乎实际是在同一个layer中的,跟ML课上不大一样。 \(a^{[j](i)}\)中方括号括起来的是层数,圆括号括起来的是第\(i\)个训练实例)

import numpy as np
z = np.dot(w,x) + b\
#Python automatically takes this real number b and expands it out to this 1*m row vector

Gradient Output

\({\rm d}z^{(i)} = a^{(i)} - y^{(i)}\)

\(\begin{align}{\rm d}Z &= \begin{bmatrix}{\rm d}z^{(1)} & {\rm d}z^{(2)} & \cdots & {\rm d}z^{(m)} \end{bmatrix} \\&= A-Y = \begin{bmatrix}a^{(1)} - y^{(1)} & a^{(2)} - y^{(2)} & \cdots & a^{(m)} - y^{(m)} \end{bmatrix} \end{align}\)

${\rm d}b = $1/m*np.sum(dZ)

\({\rm d}w = \frac{1}{m}X{\rm d}Z^T\)

单次迭代免for-loop法(vectorize):

\[\begin{align}
\downarrow&\begin{cases}
Z & = w^T+b\\
& = {\rm np.dot(}w{\rm .T, }X{\rm)}\\
A & = \sigma(Z)\\
{\rm d}Z &= A-Y \\
{\rm d}w &= \frac{1}{m}X{\rm d}Z^T\\
\end{cases}\\\\
w& := w - \alpha{\rm d}w\\
b &:= b - \alpha{\rm d}b
\end{align}
\]

若要多次迭代,最外层的显式for-loop是不可避免的

Broadcasting

reshape()确保矩阵的尺寸

举个例子说明numpy 的 broadcasting机制:

>>> import numpy as np
>>> a = np.arange(0,6).reshape(6,1)
>>> a
array([[0],
[1],
[2],
[3],
[4],
[5]])
>>> b = np.arange(0,5)
>>> b
array([0, 1, 2, 3, 4])
>>> a * b
array([[ 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4],
[ 0, 2, 4, 6, 8],
[ 0, 3, 6, 9, 12],
[ 0, 4, 8, 12, 16],
[ 0, 5, 10, 15, 20]])
>>> a + b
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])

也就是说matrix+-*/number/vector时,numpy会将number/vector通过自我复制拓展成合法的矩阵

注意这会导致 在期望抛出异常的地方 不抛出异常而是发生奇怪的BUG:

​ 比如 有时我想 行向量和列向量相加时抛出异常, 但是numpy却用broadcasting机制把它给算出来了...

numpy的坑

import numpy as np
a = np.random.randn(5)
>>> a
array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])
>>> a.shape
(5,)
# which is called a rank 1 array in Python and is neither a row vector nor a column vector >>> a.T
array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])
# which is same as 'a' i self >>> np.dot(a,a.T)
3.2288264718632416
# it is a number rather than a matrix in expectation(just like array([[55]]))

不要使用形如(5,)或者(n,)这样的“rank 1 array”, 而是显式地说明是\(m \times n\)的矩阵:

>>> a = np.random.randn(5,1)
>>> a
array([[ 0.7643396 ],
[-1.66945103],
[ 1.66235712],
[-0.06892102],
[-1.61347409]])
>>> a.T
array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])

注意array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])的区别(后者有两个方括号), 这说明前者是秩为1的数组而后者是一个真正的\(1 \times 5\)矩阵(就像C里一样矩阵是用二维数组表示的)(另外我觉得rank 1 array翻译为一维数组更为准确)

It can use assert() statement to make sure the dimension of one of vectors.

When you get a rank 1 array, you can use a.reshape to transform it into a (n,1) array or a (1,n) array.

Logistic Regression Cost Function

\[\left.
\begin{array}{l}
\text{If y=1:}\quad p(y|x)=\hat{y}\\
\text{If y=0:}\quad p(y|x)=1-\hat{y}
\end{array}
\right\}
p(y|x) = \hat{y}^y\cdot (1-\hat{y})^{1-y}\\
\,\\
\begin{align}
\therefore {\rm log}(p(y|x)) &= y\cdot log\,\hat{y} + (1-y)\cdot log\, (1-\hat{y}) \\
&= -\mathcal{L}(\hat{y},y)
\end{align}
\]

所以:

\[\begin{align}
{\rm log }[p(\text{labels in training set})] &= {\rm log } \prod_{i=1}^mp(y^{(i)}|x^{(i)})\\
&=\sum_{i=1}^m {\rm log\,}p(y^{(i)}|x^{(i)})\\
&=\sum_{i=1}^m-\mathcal{L}(\hat{y}^{(i)},y^{(i)})\\
&=-\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\end{align}\\
\text{Cost: }J(w,b) = \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\]

maximum likelihood estimation (极大似然估计)


week3

\(Z^{[j]} = W^{[j]}A^{[j-1]} + b^{[j]} = w^{[j]}\begin{bmatrix} | & | & | & \\ a^{[j-1](1)} & a^{[j-1](2)} & a^{[j-1](3)} & \cdots \\ | & | & | & \end{bmatrix} + b^{[j]} = \begin{bmatrix} | & | & | & \\ z^{[j](1)} & z^{[j](2)} & z^{[j](3)} & \cdots \\ | & | & | & \end{bmatrix}\)

其中\((i) \in [(1),(m)],\quad [j] \in [[1],[n]],\quad X = A^{[0]}\)

Other Activation Function

①\(tanh(z)\) function:

\[a= tanh(z)=\frac{e^z -e^{-z}}{e^z +e^{-z}}\text{ , when } tanh(z) \in (-1,1), tanh(0)=0
\]

​ \(tanh(z)\) 可以把 数据中心化 为 0 (Sigmoid Function 将数据中心化为 0.5)

​ 之后只有 \(0 \le \hat{y} \le 1\) (即二元分类问题)才用 Sigmoid Function,因为\(tanh\)几乎严格优于Sigmoid...

②Rectified Linear Unit(线性整流函数, ReLU):\(Q = max(0,z)\)

​ When not sure what to use for your hidden layer, can use the ReLU function

​ Disadvantage of ReLU: when \(z\) is negative, the value is 0.

​ It can use what names Leaky ReLU to overcome the disadvantage below.

​ Leaky ReLU: \(a = max(0.01z, z)\)

​ ReLU可以使得斜率不变(Sigmoid 和 \(tanh(z)\) 在\(z\rightarrow \infin\)时斜率趋向于0,会使得学习速度下降)

​ 最常用的 Activation Function

③Tannish Function(双曲函数)

当且仅当要解决回归问题的时候,在生成到output layer才使用线性的Activation Function(\(g(z)=z\)) ,比如预测房价时,y不限于 0 和 1(\(y \in \mathbb{R}\)),所以可以用\(g(z)=z\) 输出,隐藏单元不应该使用Linear Activation Function, 而是应该使用tanh/ReLU/Leaky ReLU

Derivatives of Activation Functions

  • Sigmoid:

    • \(\frac{{\rm d}}{{\rm d}z}g(z) = g(z)(1-g(z))\)

      \(tanh(z)\):
    • \(g\prime(z) = 1-(tanh(z))^2\)
  • ReLU:
    • \(g\prime(z) = \begin{cases}1, \text{if }z\ge0 \\0, \text{if }z\lt0 \end{cases}\)

Gradient Descents For Neural Networks

Parameters : \(w^{[1]},b^{[1]},w^{[2]},b^{[2]}\)

Cost Function : \(J(w^{[1]},b^{[1]},w^{[2]},b^{[2]})= \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y},y)\)

Gradient Function:

\[\begin{align}
&\text{Repeat \{}\\
&\quad \text{compute predicts} (\hat{y}^{(i)}, i = 1,\dots,m) \\
&\quad {\rm d}w^{[1]} = \frac{\partial J}{\partial w^{[1]}}, {\rm d}b^{[1]} = \frac{\partial J}{\partial b^{[1]}},\dots\\
&\quad w^{[1]} = w^{[1]} - \alpha {\rm d}w^{[1]}\\
&\quad b^{[1]} = b^{[1]} - \alpha {\rm d}b^{[1]}\\
&\quad w^{[2]} = w^{[2]} - \alpha {\rm d}w^{[2]}\\
&\quad b^{[2]} = b^{[2]} - \alpha {\rm d}b^{[2]}\\
\text{\}}
\end{align}
\]

Forward Propagation :

\[\begin{align}
Z^{[1]} &= w^{[1]}X + b^{[1]}\\
A^{[1]} &= g^{[1]}(z^{[1]})\\
Z^{[2]} &= w^{[2]}A^{[1]} + b^{[2]}\\
A^{[2]} &= g^{[2]}(z^{[2]}) = \sigma(Z^{[2]})
\end{align}
\]

Backward Propagation :

\[\begin{align}
{\rm d}Z^{[2]} &= A^{[2]} - Y, \quad Y = \begin{bmatrix}y^{[1]} & y^{[2]} & \dots & y^{[m]}\end{bmatrix}\\
{\rm d}w^{[2]} &= \frac{1}{m} {\rm d}z^{[2]} A^{[1]T}\\
{\rm d}d^{[2]} &= \frac{1}{m}\text{np.sum(d}z^{[2]}\text{,axis=1,keepdims=True)}\\
{\rm d}Z^{[1]} &= w^{[2]T}{\rm d}Z^{[2]}\; .* \; g^{[1]\prime}(Z^{[1]})\\
{\rm d}w^{[1]} &= \frac{1}{m} {\rm d}Z^{[1]}X^T\\
{\rm d}d^{[1]} &= \frac{1}{m}\text{np.sum(d}z^{[1]}\text{,axis=1,keepdims=True)}\\
\end{align}
\]

注:axis = 1 means summing horizontally, and keepdims = True means prevent from outputting Rank 1 Array. You can call reshape function explicitly rather than keeping these parameters.

又注:\(由于A^{[1]} = g^{[1]}(Z^{[1]})且g^{[1]\prime}(z) = 1-a^2,\;所以 g^{[1]\prime}(Z^{[1]}) = 1-(A^{[1]})^2\), 即:\(Z^{[1]} = w^{[2]T}{\rm d}Z^{[2]}\; .* \; (1-(A^{[1]})^2\)

Random Initialization

For a neural network, if initialize the weights to parameters to all zero and then apply gradient descent, it won't work.

Deep Learning--week1~week3的更多相关文章

  1. Coursera, Deep Learning 1, Neural Networks and Deep Learning - week1, Introduction to deep learning

    整个deep learing 系列课程主要包括哪些内容 Intro to Deep learning

  2. Neural Networks and Deep Learning(week3)Planar data classification with one hidden layer(基于单隐藏层神经网络的平面数据分类)

    Planar data classification with one hidden layer 你会学习到如何: 用单隐层实现一个二分类神经网络 使用一个非线性激励函数,如 tanh 计算交叉熵的损 ...

  3. 【DeepLearning学习笔记】Coursera课程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning课堂笔记

    Coursera课程<Neural Networks and Deep Learning> deeplearning.ai Week1 Introduction to deep learn ...

  4. Coursera, Deep Learning 4, Convolutional Neural Networks - week1

    CNN 主要解决 computer vision 问题,同时解决input X 维度太大的问题. Edge detection 下面演示了convolution 的概念 下图的 vertical ed ...

  5. Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Gradient Checking)

    声明:所有内容来自coursera,作为个人学习笔记记录在这里. Gradient Checking Welcome to the final assignment for this week! In ...

  6. Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Regularization)

    声明:所有内容来自coursera,作为个人学习笔记记录在这里. Regularization Welcome to the second assignment of this week. Deep ...

  7. Deep learning:五十一(CNN的反向求导及练习)

    前言: CNN作为DL中最成功的模型之一,有必要对其更进一步研究它.虽然在前面的博文Stacked CNN简单介绍中有大概介绍过CNN的使用,不过那是有个前提的:CNN中的参数必须已提前学习好.而本文 ...

  8. 【深度学习Deep Learning】资料大全

    最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books  by Yoshua Bengio, Ian Goodfellow and Aaron C ...

  9. 《Neural Network and Deep Learning》_chapter4

    <Neural Network and Deep Learning>_chapter4: A visual proof that neural nets can compute any f ...

  10. Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN

    http://m.blog.csdn.net/blog/wu010555688/24487301 本文整理了网上几位大牛的博客,详细地讲解了CNN的基础结构与核心思想,欢迎交流. [1]Deep le ...

随机推荐

  1. 通过Ajax方式上传文件(input file),使用FormData进行Ajax请求

    <script type="text/jscript"> $(function () { $("#btn_uploadimg").click(fun ...

  2. Mac上配置GTK环境

    Mac上配置GTK环境 安装command line工具, 如果安装了Xcode, 就直接跳过该步骤 安装Homebrew 使用brew install pkg-config 使用brew insta ...

  3. 多个.txt文件合并到一个.txt文件中

    如果想要将多个.txt文件合并到一个.txt文件中,可以先将所有.txt文件放到一个文件夹中,然后使用.bat文件完成任务. 例如,在一个文件夹下有1.txt, 2.txt, 3.txt三个文件,想把 ...

  4. python NLTK安装

    stanford nltk在python中如何安装使用一直都很神秘,看了一些帖子感觉讳莫如深.研究了几天,参考<nlp汉语自然语言处理原理与实践>,发现方法如下: 1.安装JAVA 8+环 ...

  5. ReLU激活函数的缺点

    训练的时候很”脆弱”,很容易就”die”了,训练过程该函数不适应较大梯度输入,因为在参数更新以后,ReLU的神经元不会再有激活的功能,导致梯度永远都是零. 例如,一个非常大的梯度流过一个 ReLU 神 ...

  6. jq点击事件不生效,效果只闪现一次又立马消失的原因?

    出现的问题:jq点击事件不生效,点击的时候效果实现但又立马消失,页面重新刷新了一次 可能出现的原因: a标签href属性的原因,虽然点击事件生效,但页面又刷新了一次,所以没有效果,只闪了一次 解决方案 ...

  7. cocos creator 重写源码按钮Button点击音频封装

    (function(){ var Super = function(){}; Super.prototype = cc.Button.prototype; //实例化原型 Super.prototyp ...

  8. Android通过Chrome Inspect调试WebView出现404页面的解决方法

    无论是调试Web页面还是调试Hybrid混合应用,只要是调试Android的webview,都需要使用Chrome://inspect进行调试.但是国内开发者会出现404 Not Found错误: 解 ...

  9. javascript中计算两个时间日期间隔的天数

    <script>              /*                  计算两个日期的时间间隔天数              */              //时间字符串的格 ...

  10. 软件测试之Soot

    详情请见:https://github.com/fogmisty/SoftwareTest