Does batch_size have any affects in results quality? how to set the optimal batch size and number of iterations?
1. the batch size, a : the number of training data feed into neural network once.
2. the number of iterations, b : how many time we should feed into NN. 
a*b = the number of training examples. 
  • Batch Gradient Descent. Batch Size = Size of Training Set
  • Stochastic Gradient Descent. Batch Size = 1
  • Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

from here.  The "sample size" you're talking about is referred to as batch size, BB. The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.

To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]

  1. Batch gradient descent, B=|x|
  2. Online stochastic gradient descent: B=1
  3. Mini-batch stochastic gradient descent: B>1 but B<|x|.

Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.

SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let x be our training set and let m⊂x. The batch size B is just the cardinality of m: B=|m|.

Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average of the gradients for a mini-batch m. (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent. E[∇LSGD(θ,m)]=∇L(θ,x).

Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.

Let's say that we have some data vector x:RD, an initial weight vector that parameterizes our neural network, θ0:RSθ0:RS, and a loss function L(θ,x):RS→RD→RS that we are trying to minimize. If we have TT training examples and a batch size of B, then we can split those training examples into C mini-batches:

C=⌈T/B⌉ 

For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.

An iterative algorithm for SGD with MM epochs is given below:

Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.

The major parameters for the vanilla (no momentum) SGD described above are:

  1. Learning Rate: ϵ

I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.

ϵ(t):N→R 

If you want to have the learning rate fixed, just define epsilon as a constant function.

  1. Batch Size

Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.

Citations & Further Reading:

  1. Introduction to Gradient Based Learning
  2. Practical recommendations for gradient-based training of deep architectures
  3. Efficient Mini-batch Training for Stochastic Optimization
 
 from here: when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)).
 
 

DTU DeepLearning: exercise 6的更多相关文章

  1. DTU DeepLearning: exercise 7

    torch activation functions: sigmoid, relu, tanh, softplus. https://morvanzhou.github.io/tutorials/ma ...

  2. 【DeepLearning】Exercise:Softmax Regression

    Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = ...

  3. 【DeepLearning】Exercise:Convolution and Pooling

    Exercise:Convolution and Pooling 习题链接:Exercise:Convolution and Pooling cnnExercise.m %% CS294A/CS294 ...

  4. 【DeepLearning】Exercise:Learning color features with Sparse Autoencoders

    Exercise:Learning color features with Sparse Autoencoders 习题链接:Exercise:Learning color features with ...

  5. 【DeepLearning】Exercise: Implement deep networks for digit classification

    Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...

  6. 【DeepLearning】Exercise:Self-Taught Learning

    Exercise:Self-Taught Learning 习题链接:Exercise:Self-Taught Learning feedForwardAutoencoder.m function [ ...

  7. 【DeepLearning】Exercise:PCA and Whitening

    Exercise:PCA and Whitening 习题链接:Exercise:PCA and Whitening pca_gen.m %%============================= ...

  8. 【DeepLearning】Exercise:PCA in 2D

    Exercise:PCA in 2D 习题的链接:Exercise:PCA in 2D pca_2d.m close all %%=================================== ...

  9. 【DeepLearning】Exercise:Vectorization

    Exercise:Vectorization 习题的链接:Exercise:Vectorization 注意点: MNIST图片的像素点已经经过归一化. 如果再使用Exercise:Sparse Au ...

随机推荐

  1. jQuery---小火箭返回顶部案例

    小火箭返回顶部案例 1. 滚动页面,当页面距离顶部超出1000px,显示小火箭. 封装在scroll函数里,当前页面距离顶部为$(window).scrollTop >=1000 小火箭显示和隐 ...

  2. 刷题78. Subsets

    一.题目说明 题目78. Subsets,给一列整数,求所有可能的子集.题目难度是Medium! 二.我的解答 这个题目,前面做过一个类似的,相当于求闭包: 刷题22. Generate Parent ...

  3. sqli-labs less-9 --> less-10

    时间盲注: 利用时间函数,观察不同条件的等待时长:利用sleep(),benchmark()等函数,让MySQL的执行时间变长 时间盲注多于if这样的函数结合(if(expr1,expr2,expr3 ...

  4. Linux C++ 单链表添加,删除,输出,逆序操作

    /*单链表操作*/#include <iostream>using namespace std; class Node{ public: Node(){ next=0; } Node(in ...

  5. jenkins - docker搭建jenkins

    jenkins镜像拉取 docker pull jenkins/jenkins 为jenkins镜像分配容器 docker run -d --name jenkins \ -p 8080:8080 \ ...

  6. adworld MISC002 | Linux的挂载文件系统的运用

    EXT3是第三代扩展文件系统(英语:Third extended filesystem,缩写为ext3),是一个日志文件系统,常用于Linux操作系统. Plan 1: 直接将附件使用mount命令挂 ...

  7. LeetCode 860. 柠檬水找零 (贪心)

    在柠檬水摊上,每一杯柠檬水的售价为 5 美元. 顾客排队购买你的产品,(按账单 bills 支付的顺序)一次购买一杯. 每位顾客只买一杯柠檬水,然后向你付 5 美元.10 美元或 20 美元.你必须给 ...

  8. 51Nod 1182 完美字符串 (贪心)

    约翰认为字符串的完美度等于它里面所有字母的完美度之和.每个字母的完美度可以由你来分配,不同字母的完美度不同,分别对应一个1-26之间的整数. 约翰不在乎字母大小写.(也就是说字母F和f)的完美度相同. ...

  9. [USACO10MAR] 伟大的奶牛聚集 - 树形dp

    每个点有重数,求到所有点距离最小的点 就是魔改的重心了 #include <bits/stdc++.h> using namespace std; #define int long lon ...

  10. 前端 form select js处理

    1.代码如下 function initializeSelect(data) { var area = $("#ServiceName"); area.find("opt ...