Machine Learning Week_5 Cost Function and BackPropagation
As for the back propagation algorithm, the formula given by the teacher is really useful.
But you don't understand why you're doing this, including what delta means. And the best way to do that is to actually compute a small neural network, using the chain rule for derivatives. Calculate each θ once. Then put them together to understand how to use vectorization implementation.
There are no meanings. There are just laws of arithmetic.
0 Neural Networks: Learning
In Week 5, you will be learning how to train Neural Networks. The Neural Network is one of the most powerful learning algorithms (when a linear classifier doesn't work, this is what I usually turn to), and this week's videos explain the 'backpropagation' algorithm for training these models. In this week's programming assignment, you'll also get to implement this algorithm and see it work for yourself.
The Neural Network programming exercise will be one of the more challenging ones of this class. So please start early and do leave extra time to get it done, and I hope you'll stick with it until you get it to work! As always, if you get stuck on the quiz and programming assignment, you should post on the Discussions to ask for help. (And if you finish early, I hope you'll go there to help your fellow classmates as well.)-- by Andrew NG
1 Cost Function and BackPropagation
1.1 Cost Function
Let's first define a few variables that we will need to use:
L = total number of layers in the network
\(s_l\) = number of units (not counting bias unit) in layer l
K = number of output units/classes
Binary classification: y = 0 or y = 1, K=1;
Multi-class classification: K>=3;
\]
Recall that in neural networks, we may have many output nodes. We denote \(h_\Theta(x)_k\) as being a hypothesis that results in the \(k^{th}\) output. Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. Recall that the cost function for regularized logistic regression was:
\]
For neural networks, it is going to be slightly more complicated:
\]
We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.
With the explanation of the regularization part, the lectures are not as same as what theacher says. So I do some corrections.
Teacher
In the regularization part, Completely, we don't sum over the terms responding to where i is equal to 0. And so this is kinda like a bias unit and by analogy to what we were doing for logistic progression, we won't sum over those terms in our regularization term because we don't want to regularize them and string their values as zero. But this is just one possible convention, and even if you were to sum over i equals 0 up to Sl, it would work about the same and doesn't make a big difference. But maybe this convention of not regularizing the bias term is just slightly more common. Corresponds to the formula above.
Lecture
\]
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.
Note:
the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
the triple sum simply adds up the squares of all the individual Θs in the entire network.
the i in the triple sum does not refer to training example i
1.2 Backpropagation Algorithm
"Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:
\]
That is, we want to minimize our cost function J using an optimal set of parameters in theta. In this section we'll look at the equations we use to compute the partial derivative of J(Θ):
\]
To do so, we use the following algorithm:
Back propagation Algorithm
Given training set \(\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace\)
- Set \(\Delta^{(l)}_{i,j}\) := 0 for all (l,i,j), (hence you end up having a matrix full of zeros)
For training example t =1 to m:
Set \(a^{(1)} := x^{(t)}\)
Perform forward propagation to compute \(a^{(l)}\) for l=2,3,…,L
Using \(y^{(t)}\), compute \(\delta^{(L)} = a^{(L)} - y^{(t)}\)
Where L is our total number of layers and \(a^{(L)}\) is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:
- Compute \(\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}\) using \(\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})\)
The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by \(z^{(l)}\).
The g-prime derivative terms can also be written out as:
\]
- \(\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}\) or with vectorization, \(\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T\)
Hence we update our new \(\Delta\) matrix.
- \(D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)\) , if j≠0.
- \(D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}\) If j=0.
The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative. Thus we get \(\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)\)
1.3 Backpropagation Intuition
Note: [4:39, the last term for the calculation for \(z^3_1\) (three-color handwritten formula) should be \(a^2_2\) instead of \(a^2_1\). 6:08 - the equation for cost(i) is incorrect. The first term is missing parentheses for the log() function, and the second term should be \((1-y^{(i)})\log(1-h{_\theta}{(x^{(i)}}))\). 8:50 - \(\delta^{(4)} = y - a^{(4)}\) is incorrect and should be \(\delta^{(4)} = a^{(4)} - y\).]
Recall that the cost function for a neural network is:
\]
If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:
\]
Intuitively, \(\delta_j^{(l)}\) is the "error" for \(a^{(l)}_j\) (unit j in layer l). More formally, the delta values are actually the derivative of the cost function:
\]
Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we could calculate some \(\delta_j^{(l)}\):
In the image above, to calculate \(\delta_2^{(2)}\), we multiply the weights \(\Theta_{12}^{(2)}\) and \(\Theta_{22}^{(2)}\) by their respective \(\delta\) values found to the right of each edge. So we get \(\delta_2^{(2)}\) = \(\Theta_{12}^{(2)}\) * \(\delta_1^{(3)}\) +\(\Theta_{22}^{(2)}\) * \(\delta_2^{(3)}\). To calculate every single possible \(\delta_j^{(l)}\), we could start from the right of our diagram. We can think of our edges as our \(\Theta_{ij}\). Going from right to left, to calculate the value of \(\delta_j^{(l)}\), you can just take the over all sum of each weight times the \(\delta\) it is coming from. Hence, another example would be \(\delta_2^{(3)}\) = \(\Theta_{12}^{(3)}\) * \(\delta_1^{(4)}\).
2 Backpropagation in Pratice
2.1 Implementation Note: Unrolling Parameters
With neural networks, we are working with sets of matrices:
\]
In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the elements and put them into one long vector:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the "unrolled" versions as follows:
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
To summarize:
2.2 Gradient Checking
Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:
\]
With multiple theta matrices, we can approximate the derivative with respect to \(Θ_j\) as follows:
\]
A small value for \({\epsilon}\) (epsilon) such as \({\epsilon = 10^{-4}}\), guarantees that the math works out properly. If the value for ϵ is too small, we can end up with numerical problems.
Hence, we are only adding or subtracting epsilon to the \(\Theta_j\) matrix. In octave we can do it as follows:
epsilon = 1e-4;
for i = 1:n,
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.
Once you have verified once that your backpropagation algorithm is correct, you don't need to compute gradApprox again. The code to compute gradApprox can be very slow.
2.3 Random Initialization
When you're running an algorithm of gradient descent, or also the advanced optimization algorithms, we need to pick some initial value for the parameters theta. So for the advanced optimization algorithm, it assumes you will pass it some initial value for the parameters theta.
optTheta = fminunc(@costFunction, initialTheta, options)
Is it possible to set the initial value of theta to the vector of all zeros.Whereas this worked okay when we were using logistic regression, initializing all of your parameters to zero actually does not work when you are trading on your own network.
神经网络是多个函数作用在一起,形成非线性的函数做出预测。隐藏层的每一个节点都对应一个不同的函数,也就是不同的参数。例如一个100个节点的隐藏层就计算了100个不同的参数。
一旦所有初始参数都相同,那么所有的隐藏层节点计算的函数只有一个。整个神经网络就从100个节点变成了一个节点,无论是前向传播还是反向传播,都只是一个函数,怎么更新都是一样的。
同时注意到,老师所讲的,神经网络的梯度下降常常会得到一个局部最优解而不是全局最优解,也就是说,代价函数不是一个convex function。这里很耐人寻味。因为逻辑回归的代价函数与神经网络的代价函数基本上是一样的,但是神经网络的代价函数里嵌套了多个逻辑回归函数。逻辑回归是一个convex function, 神经网络的就不是了。
Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our Θ matrices using the following method:
Hence, we initialize each \(\Theta^{(l)}_{ij}\) to a random value between \([-\epsilon,\epsilon]\). Using the above formula guarantees that we get the desired bound. The same procedure applies to all the \(\Theta\)'s. Below is some working code you could use to experiment.
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.
(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)
2.4 Putting it Together
First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.
Number of input units = dimension of features \(x^{(i)}\)
Number of output units = number of classes
Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
Training a Neural Network
Randomly initialize the weights
Implement forward propagation to get \(h_\Theta(x^{(i)})\) for any \(x^{(i)}\)
Implement the cost function
Implement backpropagation to compute partial derivatives
Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
When we perform forward and back propagation, we loop on every training example:
Training a Neural Network
for i = 1:m,
Perform forward propagation and backpropagation using example (x(i),y(i))
(Get activations a(l) and delta terms d(l) for l = 2,...,L
The following image gives us an intuition of what is happening as we are implementing our neural network:
And by the way, for neural networks, this cost function j of theta is non-convex, or is not convex and so it can theoretically be susceptible to local minima, and in fact algorithms like gradient descent and the advance optimization methods can, in theory, get stuck in local optima, but it turns out that in practice this is not usually a huge problem and even though we can't guarantee that these algorithms will find a global optimum, usually algorithms like gradient descent will do a very good job minimizing this cost function j of theta and get a very good local minimum, even if it doesn't get to the global optimum. Finally, gradient descents for a neural network might still seem a little bit magical. So, let me just show one more figure to try to get that intuition about what gradient descent for a neural network is doing.
This was actually similar to the figure that I was using earlier to explain gradient descent. So, we have some cost function, and we have a number of parameters in our neural network. Right here I've just written down two of the parameter values.
So what gradient descent does is we'll start from some random initial point like that one over there, and it will repeatedly go downhill.
And so what back propagation is doing is computing the direction of the gradient, and what gradient descent is doing is it's taking little steps downhill until hopefully it gets to, in this case, a pretty good local optimum.
So, when you implement back propagation and use gradient descent or one of the advanced optimization methods, this picture sort of explains what the algorithm is doing. It's trying to find a value of the parameters where the output values in the neural network closely matches the values of the y(i)'s observed in your training set. So, hopefully this gives you a better sense of how the many different pieces of neural network learning fit together.
Ideally, you want \(h_\Theta(x^{(i)}) \approx y^{(i)}\). This will minimize our cost function. However, keep in mind that \(J(\Theta)\) is not convex and thus we can end up in a local minimum instead.
3 Autonomous Driving
In this video, I'd like to show you a fun and historically important example of neural networks learning of using a neural network for autonomous driving. That is getting a car to learn to drive itself.
The video that I'll showed a minute was something that I'd gotten from Dean Pomerleau, who was a colleague who works out in Carnegie Mellon University out on the east coast of the United States. And in part of the video you see visualizations like this. And I want to tell what a visualization looks like before starting the video.
Down here on the lower left is the view seen by the car of what's in front of it. And so here you kinda see a road that's maybe going a bit to the left, and then going a little bit to the right.
And up here on top, this first horizontal bar shows the direction selected by the human driver. And this location of this bright white band that shows the steering direction selected by the human driver where you know here far to the left corresponds to steering hard left, here corresponds to steering hard to the right. And so this location which is a little bit to the left, a little bit left of center means that the human driver at this point was steering slightly to the left. And this second bar here corresponds to the steering direction selected by the learning algorithm and again the location of this sort of white band means that the neural network was here selecting a steering direction that's slightly to the left.
And in fact before the neural network starts leaning initially, you see that the network outputs a grey band, like a grey, like a uniform grey band throughout this region and sort of a uniform gray fuzz corresponds to the neural network having been randomly initialized. And initially having no idea how to drive the car. Or initially having no idea of what direction to steer in. And is only after it has learned for a while, that will then start to output like a solid white band in just a small part of the region corresponding to choosing a particular steering direction. And that corresponds to when the neural network becomes more confident in selecting a band in one particular location, rather than outputting a sort of light gray fuzz, but instead outputting a white band that's more constantly selecting one's steering direction.
Neural Network-Based Autonomous Driving. 1992 11 23
ALVINN is a system of artificial neural networks that learns to steer by watching a person drive. ALVINN is designed to control the NAVLAB 2, a modified Army Humvee who had put sensors, computers, and actuators for autonomous navigation experiments.
The initial step in configuring ALVINN is creating a network just here. During training, a person drives the vehicle while ALVINN watches. Once every two seconds, ALVINN digitizes a video image of the road ahead, and records the person's steering direction.
This training image is reduced in resolution to 30 by 32 pixels and provided as input to ALVINN's three layered network. Using the back propagation learning algorithm,ALVINN is training to output the same steering direction as the human driver for that image.
Initially the network steering response is random. After about two minutes of training the network learns to accurately imitate the steering reactions of the human driver. This same training procedure is repeated for other road types. After the networks have been trained the operator pushes the run switch and ALVINN begins driving.
Twelve times per second, ALVINN digitizes the image and feeds it to its neural networks. Each network, running in parallel, produces a steering direction, and a measure of its' confidence in its' response.
The steering direction, from the most confident network, in this network training for the one lane road, is used to control the vehicle.
Suddenly an intersection appears ahead of the vehicle. As the vehicle approaches the intersection the confidence of the lone lane network decreases. As it crosses the intersection and the two lane road ahead comes into view, the confidence of the two lane network rises.
When its' confidence rises the two lane network is selected to steer. Safely guiding the vehicle into its lane onto the two lane road.
So that was autonomous driving using the neural network. Of course there are more recently more modern attempts to do autonomous driving. There are few projects in the US and Europe and so on, that are giving more robust driving controllers than this, but I think it's still pretty remarkable and pretty amazing how instant neural network trained with backpropagation can actually learn to drive a car somewhat well.
Machine Learning Week_5 Cost Function and BackPropagation的更多相关文章
- Machine Learning/Introducing Logistic Function
Machine Learning/Introducing Logistic Function 打算写点关于Machine Learning的东西, 正好也在cnBlogs上新开了这个博客, 也就更新在 ...
- CheeseZH: Stanford University: Machine Learning Ex4:Training Neural Network(Backpropagation Algorithm)
1. Feedforward and cost function; 2.Regularized cost function: 3.Sigmoid gradient The gradient for t ...
- 白话machine learning之Loss Function
转载自:http://eletva.com/tower/?p=186 有关Loss Function(LF),只想说,终于写了 一.Loss Function 什么是Loss Function?wik ...
- machine learning 之 Neural Network 2
整理自Andrew Ng的machine learning 课程 week5. 目录: Neural network and classification Cost function Backprop ...
- Course Machine Learning Note
Machine Learning Note Introduction Introduction What is Machine Learning? Two definitions of Machine ...
- Machine Learning - 第5周(Neural Networks: Learning)
The Neural Network is one of the most powerful learning algorithms (when a linear classifier doesn't ...
- [Machine Learning] 浅谈LR算法的Cost Function
了解LR的同学们都知道,LR采用了最小化交叉熵或者最大化似然估计函数来作为Cost Function,那有个很有意思的问题来了,为什么我们不用更加简单熟悉的最小化平方误差函数(MSE)呢? 我个人理解 ...
- machine learning(11) -- classification: advanced optimization 去求cost function最小值的方法
其它的比gradient descent快, 在某些场合得到广泛应用的求cost function的最小值的方法 when have a large machine learning problem, ...
- machine learning(10) -- classification:logistic regression cost function 和 使用 gradient descent to minimize cost function
logistic regression cost function(single example) 图像分布 logistic regression cost function(m examples) ...
- [machine learning] Loss Function view
[machine learning] Loss Function view 有关Loss Function(LF),只想说,终于写了 一.Loss Function 什么是Loss Function? ...
随机推荐
- 恭喜又一白鲸开源成员成为 Apache SeaTunnel PMC Member
个人简介 王海林 白鲸开源研发工程师 GitHub ID:hailin0 做过性能监控.数据开发平台等,目前聚焦在数据集成同步及其周边生态的研发 问:作为白鲸开源的一员,您为社区做出过哪些贡献?具体方 ...
- CF1992场题解
Only Pluses 算法:数学. 题意简述:有三个数,每次选择一个数 \(x\),使得 \(x\) 增加一,至多操作 \(5\) 次,最后求出这三个数的乘积最大值. 简单题,一眼秒了.考虑把这 \ ...
- [学习笔记] 斜率优化DP - DP
这个真的好容易啊 --wzw 斜率优化dP 例题 [SDOI2012] 任务安排 毒瘤题,让我惨淡经营了两天.这道题luogu有简单版,可以先去看简单版. 显然这是一只DP题,直接开始推狮子.令 dp ...
- 9个Linux 查看系统硬件信息命令(实例详解)
在Linux下,我们精要遇到需要查看系统的硬件信息, 这里我罗列了查看系统硬件信息的实用命令,并做了分类,实例解说. 执行环境:ubuntu 16.04 1. cpu lscpu命令,查看的是cpu的 ...
- Jetpack架构组件学习(5)——Hilt 注入框架使用
原文: Jetpack架构组件学习(5)--Hilt 注入框架使用-Stars-One的杂货小窝 本篇需要有Kotlin基础知识,否则可能阅读本篇会有所困难! 介绍说明 实际上,郭霖那篇文章已经讲得比 ...
- 【CMake系列】10-cmake测试 ctest
cmake作为一个强大的构建系统指导工具,同时也提供了测试功能,可用于项目的单元测试等,也可以与其他测试框架协作,如googletest,共同完成项目开发中的测试工作,本节我们就来学习 如何借助cma ...
- 一种PyInstaller中优雅的控制包大小的方法
PyInstaller会在打包时自动为我们收集一些依赖项,特别是我们在打包PyQt/PySide相关的应用时,PyInstaller会自动包含我们程序通常不需要的文件,如'tanslations'文件 ...
- 【YashanDB知识库】数据库审计shutdown immediate操作导致数据库异常退出
[问题分类]功能使用 [关键字]数据库审计,shutdown immediate [问题描述]审计shutdown immediate 操作,数据库作主从切换时会导致数据库异常退出. [问题原因分析] ...
- 传染病模型 SI
参考了这篇写的很好的[1],讲了各种模型 因为是各种模型都是用微分方程写的,所以又去学习了一下微分方程 ,真的忘了有没有学过这个,反正一点印象也没有了. 好在[2] 这个文章又把我带回去了. SI 的 ...
- 人形动画常见IK的处理
Unity中常见人形动画IK的处理方式 本文将尝试仅使用Untiy内置的Animator来解决常见的几种运动所需的IK.也会给出核心功能的代码实现. 效果一览:b站视频 Unity中人形角色的IK I ...