DTU DeepLearning: exercise 6
- Batch Gradient Descent. Batch Size = Size of Training Set
- Stochastic Gradient Descent. Batch Size = 1
- Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
from here. The "sample size" you're talking about is referred to as batch size, BB. The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.
To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]
- Batch gradient descent, B=|x|
- Online stochastic gradient descent: B=1
- Mini-batch stochastic gradient descent: B>1 but B<|x|.
Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.
SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let x be our training set and let m⊂x. The batch size B is just the cardinality of m: B=|m|.
Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average of the gradients for a mini-batch m. (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent. E[∇LSGD(θ,m)]=∇L(θ,x).
Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.
Let's say that we have some data vector x:RD, an initial weight vector that parameterizes our neural network, θ0:RSθ0:RS, and a loss function L(θ,x):RS→RD→RS that we are trying to minimize. If we have TT training examples and a batch size of B, then we can split those training examples into C mini-batches:
For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.
An iterative algorithm for SGD with MM epochs is given below:
Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.
The major parameters for the vanilla (no momentum) SGD described above are:
- Learning Rate: ϵ
I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.
If you want to have the learning rate fixed, just define epsilon as a constant function.
- Batch Size
Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.
Citations & Further Reading:
DTU DeepLearning: exercise 6的更多相关文章
- DTU DeepLearning: exercise 7
torch activation functions: sigmoid, relu, tanh, softplus. https://morvanzhou.github.io/tutorials/ma ...
- 【DeepLearning】Exercise:Softmax Regression
Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = ...
- 【DeepLearning】Exercise:Convolution and Pooling
Exercise:Convolution and Pooling 习题链接:Exercise:Convolution and Pooling cnnExercise.m %% CS294A/CS294 ...
- 【DeepLearning】Exercise:Learning color features with Sparse Autoencoders
Exercise:Learning color features with Sparse Autoencoders 习题链接:Exercise:Learning color features with ...
- 【DeepLearning】Exercise: Implement deep networks for digit classification
Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...
- 【DeepLearning】Exercise:Self-Taught Learning
Exercise:Self-Taught Learning 习题链接:Exercise:Self-Taught Learning feedForwardAutoencoder.m function [ ...
- 【DeepLearning】Exercise:PCA and Whitening
Exercise:PCA and Whitening 习题链接:Exercise:PCA and Whitening pca_gen.m %%============================= ...
- 【DeepLearning】Exercise:PCA in 2D
Exercise:PCA in 2D 习题的链接:Exercise:PCA in 2D pca_2d.m close all %%=================================== ...
- 【DeepLearning】Exercise:Vectorization
Exercise:Vectorization 习题的链接:Exercise:Vectorization 注意点: MNIST图片的像素点已经经过归一化. 如果再使用Exercise:Sparse Au ...
随机推荐
- Spring Mvc Http 400 Bad Request问题排查
如果遇到了Spring MVC报错400,而且没有返回任何信息的情况下该如何排查问题? 问题描述 一直都没毛病的接口,今天测试的时候突然报错400 Bad Request,而且Response没有返回 ...
- mysql行转列,函数GROUP_CONCAT(expr)
demo: 语句: SELECT '行' id, '' product_nameUNIONSELECT id, product_name FROM `product` WHERE id < 5 ...
- D. Domino for Young
基本思想是利用涂色的方法,用黑白两种颜色把方格全部涂色,相邻方格不同色. 方法1:基于二分图匹配的思想 一开始也想过二分图匹配,但数据量太大,就放弃了这种想法.其实根据增广路的定义.如果白色的方格的数 ...
- windows的系统变量
在Windows cmd小黑框里输入set命令,可以查看现有的系统变量 “="前的部分就是变量名. %USERPROFILE% =C:\Users\用户名 %SystemRoot% =C:\ ...
- VMware Workstation Pro工具
安装包 链接:https://pan.baidu.com/s/1n-URb83lHtric3Ds8UbF9Q 提取码:c9z5 密钥 FF31K-AHZD1-H8ETZ-8WWEZ-WUUVA CV7 ...
- C#简单的LogHelper
适用于不想使用log4net等第三方的Log工具的LogHelper.正规的还是要使用<C# 工具类LogHelper>的这种做法. using System; using System. ...
- 论文阅读笔记(十三)【arxiv2018】:Revisiting Temporal Modeling for Video-based Person ReID
Introduction (1)Motivation: 当前的一些video-based reid方法在特征提取.损失函数方面不统一,无法客观比较效果.本文作者将特征提取和损失函数固定,对当前较新的4 ...
- Codeforces #454 div1 C party(状态压缩bfs)
题意: 给你N个点的一幅图,初始图中有M条边,每次操作可以使得一个点连接的所有点变成一个团,问你最少多少次操作可以使得整个图变成一个团. 解法: 因为N很小 所以我们可以二进制压缩来表示一个点与其他点 ...
- 假期学习【五】RDD编程实验四
今天完成了实验四的第二问和第三问 第二题 对于两个输入文件 A 和 B,编写 Spark 独立应用程序,对两个文件进行合并,并剔除其 中重复的内容,得到一个新文件 C.下面是输入文件和输出文件的一个样 ...
- 编辑当前目录及其子目录,对比指定文件大小 (bat)
@echo off :: 设置对比大小校验(单位为kb) set COMPARE=100 ::指定起始文件夹 cd %~dp0/ set DIR_PATH=%cd% :: 输出文件目录 set RES ...