Does batch_size have any affects in results quality? how to set the optimal batch size and number of iterations?
1. the batch size, a : the number of training data feed into neural network once.
2. the number of iterations, b : how many time we should feed into NN. 
a*b = the number of training examples. 
  • Batch Gradient Descent. Batch Size = Size of Training Set
  • Stochastic Gradient Descent. Batch Size = 1
  • Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

from here.  The "sample size" you're talking about is referred to as batch size, BB. The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.

To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]

  1. Batch gradient descent, B=|x|
  2. Online stochastic gradient descent: B=1
  3. Mini-batch stochastic gradient descent: B>1 but B<|x|.

Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.

SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let x be our training set and let m⊂x. The batch size B is just the cardinality of m: B=|m|.

Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average of the gradients for a mini-batch m. (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent. E[∇LSGD(θ,m)]=∇L(θ,x).

Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.

Let's say that we have some data vector x:RD, an initial weight vector that parameterizes our neural network, θ0:RSθ0:RS, and a loss function L(θ,x):RS→RD→RS that we are trying to minimize. If we have TT training examples and a batch size of B, then we can split those training examples into C mini-batches:

C=⌈T/B⌉ 

For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.

An iterative algorithm for SGD with MM epochs is given below:

Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.

The major parameters for the vanilla (no momentum) SGD described above are:

  1. Learning Rate: ϵ

I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.

ϵ(t):N→R 

If you want to have the learning rate fixed, just define epsilon as a constant function.

  1. Batch Size

Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.

Citations & Further Reading:

  1. Introduction to Gradient Based Learning
  2. Practical recommendations for gradient-based training of deep architectures
  3. Efficient Mini-batch Training for Stochastic Optimization
 
 from here: when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)).
 
 

DTU DeepLearning: exercise 6的更多相关文章

  1. DTU DeepLearning: exercise 7

    torch activation functions: sigmoid, relu, tanh, softplus. https://morvanzhou.github.io/tutorials/ma ...

  2. 【DeepLearning】Exercise:Softmax Regression

    Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = ...

  3. 【DeepLearning】Exercise:Convolution and Pooling

    Exercise:Convolution and Pooling 习题链接:Exercise:Convolution and Pooling cnnExercise.m %% CS294A/CS294 ...

  4. 【DeepLearning】Exercise:Learning color features with Sparse Autoencoders

    Exercise:Learning color features with Sparse Autoencoders 习题链接:Exercise:Learning color features with ...

  5. 【DeepLearning】Exercise: Implement deep networks for digit classification

    Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...

  6. 【DeepLearning】Exercise:Self-Taught Learning

    Exercise:Self-Taught Learning 习题链接:Exercise:Self-Taught Learning feedForwardAutoencoder.m function [ ...

  7. 【DeepLearning】Exercise:PCA and Whitening

    Exercise:PCA and Whitening 习题链接:Exercise:PCA and Whitening pca_gen.m %%============================= ...

  8. 【DeepLearning】Exercise:PCA in 2D

    Exercise:PCA in 2D 习题的链接:Exercise:PCA in 2D pca_2d.m close all %%=================================== ...

  9. 【DeepLearning】Exercise:Vectorization

    Exercise:Vectorization 习题的链接:Exercise:Vectorization 注意点: MNIST图片的像素点已经经过归一化. 如果再使用Exercise:Sparse Au ...

随机推荐

  1. mongo shell

    mongo shell mongo 连接 本地 mongo # 连接127.0.0.1:27017 远程 mongo "mongodb://mongodb0.example.com:2801 ...

  2. opencv —— boxFilter、blur、GaussianBlur、medianBlur、bilateralFilter 线性滤波(方框滤波、均值滤波、高斯滤波)与非线性滤波(中值滤波、双边滤波)

    图像滤波,指在尽量保留图像细节特征的条件下对目标图像的噪声进行抑制,是图像与处理中不可缺少的操作. 邻域算子,指利用给定像素及其周围的像素值,决定此像素的最终输出值的一种算子.线性邻域滤波器就是一种常 ...

  3. 【Android】java中调用JS的方法

    最近因为学校换了新的教务系统,想做一个模拟登陆功能,发现登陆的账号和密码有一个js脚本来进行加密 整理了一下java中执行JS的方法 智强教务 账号 密码 加密方法 var keyStr = &quo ...

  4. Beego 输出数据格式JSON、XML、JSONP

    JSON.XML.JSONP beego 当初设计的时候就考虑了 API 功能的设计,而我们在设计 API 的时候经常是输出 JSON 或者 XML 数据,那么 beego 提供了这样的方式直接输出: ...

  5. ftp 拉去远程文件脚本

    ftp 拉去远程文件脚本 cat ftp.sh #!/bin/bash ftp -i -n 192.168.1.1 << EOF user ftpadmin gaofeng binary ...

  6. js获取页面缩放比例

    今天在网上看到一位大神写的一篇文章,出处记不得了,只是因为我在做项目的时候需要用到所以看了一眼. 经理要求我把两张图表上下排列(非响应式的)改成可以适配浏览器的,刚开始只是想改样式,看到代码才发现原来 ...

  7. LeetCode 13、罗马数字转整数

    罗马数字包含以下七种字符: I, V, X, L,C,D 和 M. 字符 数值I 1V 5X 10L 50C 100D 500M 1000例如, 罗马数字 2 写做 II ,即为两个并列的 1.12 ...

  8. #4864. [BeiJing 2017 Wc]神秘物质 [FHQ Treap]

    这题其实挺简单的,有个东西可能稍微难维护了一点点.. \(merge\ x\ e\) 当前第 \(x\) 个原子和第 \(x+1\) 个原子合并,得到能量为 \(e\) 的新原子: \(insert\ ...

  9. 2020牛客寒假算法基础集训营4-I 匹配星星【贪心】

    链接:https://ac.nowcoder.com/acm/contest/3005/I来源:牛客网 示例1 输入 复制 2 1 1 0 2 2 1 2 1 1 0 2 2 1 输出 复制 1 1 ...

  10. C# LINQ学习笔记二:LINQ标准查询操作概述

    本笔记摘抄自:https://www.cnblogs.com/liqingwen/p/5801249.html,记录一下学习过程以备后续查用. “标准查询运算符”是组成语言集成查询 (LINQ) 模式 ...