Multilayer Perceptron
note:这部分假设读者已经通读之前的一个练习 Classifying
MNIST digits using Logistic Regression.(。另外,它使用新的theano函数和概念: T.tanh, shared
variables, basic
arithmetic ops, T.grad, L1
and L2 regularization, floatX。如果你想要在GPU上运行代码,记得看GPU.
接下来要呈现的使用theano的架构是单隐藏层多层感知机(MLP)。一个MLP可以被视为一个逻辑回归分类器,其中的输入首先通过学到的非线性 来转换。该转换是将输入数据映射到一个空间中,在该空间中不同的类别可以线性可分。中间层也就是指隐藏层。一个隐藏层已经足够让MLPs成为一个通用的逼近器。然而我们随后看到的是在使用许多这样的隐藏层之后可以得到很大的好处,即深度学习的前提条件(指的是隐藏层必须超过一层)。可以看这些课程的笔记:ntroduction
to MLPs, the back-propagation algorithm, and how to train MLPs.
正式的说,一层隐藏层MLP表示为函数形式: ,这里是输入向量 的size, 是输出向量的size,表示矩阵符号形式如下:
有着偏置向量 , ,权重矩阵 , ,激活函数 和s。
向量构成这个隐藏层。是连接输入向量到隐藏层之间的权重矩阵。每一列 表示输入单元到第
i 个隐藏单元的权重。对s 的选择通常是tanh: 或者逻辑sigmoid函数: 。我们在这个教程中将会使用tanh,因为它的训练速度一般可以更快(而且有时候有着更好的局部最小)。tanh和sigmoid都是标量to标量的函数,不过它们自然的扩展到向量和张量的时候都是逐元素计算的(例如,在向量的每个元素上独立计算,生成一个同样size的向量)。
输出向量计算结果为:。读者应该在前一个练习(Theano3.3-练习之逻辑回归)就该看过这个形式了,和之前一样,属于哪一类的概率可以通过选择 为softmax函数来计算得到(多类分类情况下)。
为了训练一个MLP,我们需要对这个模型的所有参数进行学习,这里我们使用带有minibatch的 Stochastic
Gradient Descent 。需要学习的参数集就是: 。可以通过BP算法(导数链式规则的特殊情况)来得到梯度 。不过幸运的是,因为theano可以自动的进行求导微分,我们不需要在本教程中介绍如何求导
class HiddenLayer(object):
def __init__(self, rng, input, n_in, n_out, W=None, b=None,
Typical hidden layer of a MLP: units are fully-connected and have
sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)
and the bias vector b is of shape (n_out,). NOTE : The nonlinearity used here is tanh Hidden unit activation is given by: tanh(dot(input,W) + b) :type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights :type input: theano.tensor.dmatrix
:param input: a symbolic tensor of shape (n_examples, n_in) :type n_in: int
:param n_in: dimensionality of input :type n_out: int
:param n_out: number of hidden units :type activation: theano.Op or function
:param activation: Non linearity to be applied in the hidden
self.input = input
i 的权重的初始化值需要从依赖于激活函数的对称间隔上统一采样得到。对于tanh激活函数,在 [Xavier10] 中的获得的结果上来看,这个间隔应该是。这里 是第 -th层的单元个数, 是第-th层的单元个数。对于sigmoid函数来说,间隔是 。在训练的早期,这个初始化是可以确保每个神经元会在它的激活函数的变化较大的区域部分,使得能够很容易往上传播(从输入到输出方向)和往回传播(梯度从输出到输入方向):
# `W` is initialized with `W_values` which is uniformely sampled
# from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
# for tanh activation function
# the output of uniform if converted using asarray to dtype
# theano.config.floatX so that the code is runable on GPU
# Note : optimal initialization of weights is dependent on the
# activation function used (among other things).
# For example, results presented in [Xavier10] suggest that you
# should use 4 times larger initial weights for sigmoid
# compared to tanh
# We have no info for other function, so we use the same as
# tanh.
if W is None:
W_values = numpy.asarray(
low=-numpy.sqrt(6. / (n_in + n_out)),
high=numpy.sqrt(6. / (n_in + n_out)),
size=(n_in, n_out)
if activation == theano.tensor.nnet.sigmoid:
W_values *= 4 W = theano.shared(value=W_values, name='W', borrow=True) if b is None:
b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
b = theano.shared(value=b_values, name='b', borrow=True) self.W = W
self.b = b
lin_output =, self.W) + self.b
self.output = (
lin_output if activation is None
else activation(lin_output)
如果深入原理部分,这个类实现graph的时候需要计算隐藏层的值 。如果给graph的输入和LogisticRegression类一样,就像之前的教程一样,就可以得到MLP的输出。下面的是MLP类的简单实现代码:
class MLP(object):
"""Multi-Layer Perceptron Class A multilayer perceptron is a feedforward artificial neural network model
that has one layer or more of hidden units and nonlinear activations.
Intermediate layers usually have as activation function tanh or the
sigmoid function (defined here by a ``HiddenLayer`` class) while the
top layer is a softmax layer (defined here by a ``LogisticRegression``
""" def __init__(self, rng, input, n_in, n_hidden, n_out):
"""Initialize the parameters for the multilayer perceptron :type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights :type input: theano.tensor.TensorType
:param input: symbolic variable that describes the input of the
architecture (one minibatch) :type n_in: int
:param n_in: number of input units, the dimension of the space in
which the datapoints lie :type n_hidden: int
:param n_hidden: number of hidden units :type n_out: int
:param n_out: number of output units, the dimension of the space in
which the labels lie """ # Since we are dealing with a one hidden layer MLP, this will translate
# into a HiddenLayer with a tanh activation function connected to the
# LogisticRegression layer; the activation function can be replaced by
# sigmoid or any other nonlinear function
self.hiddenLayer = HiddenLayer(
) # The logistic regression layer gets as input the hidden units
# of the hidden layer
self.logRegressionLayer = LogisticRegression(
在这个教程中,我们同样会使用L1和L2正则化( L1
and L2 regularization)。同时我们需要计算L1范数和权重 的L2范数的平方:
# L1 norm ; one regularization option is to enforce L1 norm to
# be small
self.L1 = (
+ abs(self.logRegressionLayer.W).sum()
) # square of L2 norm ; one regularization option is to enforce
# square of L2 norm to be small
self.L2_sqr = (
(self.hiddenLayer.W ** 2).sum()
+ (self.logRegressionLayer.W ** 2).sum()
) # negative log likelihood of the MLP is given by the negative
# log likelihood of the output of the model, computed in the
# logistic regression layer
self.negative_log_likelihood = (
# same holds for the function computing the number of errors
self.errors = self.logRegressionLayer.errors # the parameters of the model are the parameters of the two layer it is
# made out of
self.params = self.hiddenLayer.params + self.logRegressionLayer.params
# the cost we minimize during training is the negative log likelihood of
# the model plus the regularization terms (L1 and L2); cost is expressed
# here symbolically
cost = (
+ L1_reg * classifier.L1
+ L2_reg * classifier.L2_sqr
# compute the gradient of cost with respect to theta (sotred in params)
# the resulting gradients will be stored in a list gparams
gparams = [T.grad(cost, param) for param in classifier.params] # specify how to update the parameters of the model as a list of
# (variable, update expression) pairs # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of
# same length, zip generates a list C of same size, where each element
# is a pair formed from the two lists :
# C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
updates = [
(param, param - learning_rate * gparam)
for param, gparam in zip(classifier.params, gparams)
] # compiling a Theano function `train_model` that returns the cost, but
# in the same time updates the parameter of the model based on the rules
# defined in `updates`
train_model = theano.function(
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
This tutorial introduces the multilayer perceptron using Theano. A multilayer perceptron is a logistic regressor where
instead of feeding the input to the logistic regression you insert a
intermediate layer, called the hidden layer, that has a nonlinear
activation function (usually tanh or sigmoid) . One can use many such
hidden layers making the architecture deep. The tutorial will also tackle
the problem of MNIST digit classification. .. math:: f(x) = G( b^{(2)} + W^{(2)}( s( b^{(1)} + W^{(1)} x))), References: - textbooks: "Pattern Recognition and Machine Learning" -
Christopher M. Bishop, section 5 """
__docformat__ = 'restructedtext en' import os
import sys
import time import numpy import theano
import theano.tensor as T from logistic_sgd import LogisticRegression, load_data # start-snippet-1
class HiddenLayer(object):
def __init__(self, rng, input, n_in, n_out, W=None, b=None,
Typical hidden layer of a MLP: units are fully-connected and have
sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)
and the bias vector b is of shape (n_out,). NOTE : The nonlinearity used here is tanh Hidden unit activation is given by: tanh(dot(input,W) + b) :type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights :type input: theano.tensor.dmatrix
:param input: a symbolic tensor of shape (n_examples, n_in) :type n_in: int
:param n_in: dimensionality of input :type n_out: int
:param n_out: number of hidden units :type activation: theano.Op or function
:param activation: Non linearity to be applied in the hidden
self.input = input
# end-snippet-1 # `W` is initialized with `W_values` which is uniformely sampled
# from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
# for tanh activation function
# the output of uniform if converted using asarray to dtype
# theano.config.floatX so that the code is runable on GPU
# Note : optimal initialization of weights is dependent on the
# activation function used (among other things).
# For example, results presented in [Xavier10] suggest that you
# should use 4 times larger initial weights for sigmoid
# compared to tanh
# We have no info for other function, so we use the same as
# tanh.
if W is None:
W_values = numpy.asarray(
low=-numpy.sqrt(6. / (n_in + n_out)),
high=numpy.sqrt(6. / (n_in + n_out)),
size=(n_in, n_out)
if activation == theano.tensor.nnet.sigmoid:
W_values *= 4 W = theano.shared(value=W_values, name='W', borrow=True) if b is None:
b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
b = theano.shared(value=b_values, name='b', borrow=True) self.W = W
self.b = b lin_output =, self.W) + self.b
self.output = (
lin_output if activation is None
else activation(lin_output)
# parameters of the model
self.params = [self.W, self.b] # start-snippet-2
class MLP(object):
"""Multi-Layer Perceptron Class A multilayer perceptron is a feedforward artificial neural network model
that has one layer or more of hidden units and nonlinear activations.
Intermediate layers usually have as activation function tanh or the
sigmoid function (defined here by a ``HiddenLayer`` class) while the
top layer is a softmax layer (defined here by a ``LogisticRegression``
""" def __init__(self, rng, input, n_in, n_hidden, n_out):
"""Initialize the parameters for the multilayer perceptron :type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights :type input: theano.tensor.TensorType
:param input: symbolic variable that describes the input of the
architecture (one minibatch) :type n_in: int
:param n_in: number of input units, the dimension of the space in
which the datapoints lie :type n_hidden: int
:param n_hidden: number of hidden units :type n_out: int
:param n_out: number of output units, the dimension of the space in
which the labels lie """ # Since we are dealing with a one hidden layer MLP, this will translate
# into a HiddenLayer with a tanh activation function connected to the
# LogisticRegression layer; the activation function can be replaced by
# sigmoid or any other nonlinear function
self.hiddenLayer = HiddenLayer(
) # The logistic regression layer gets as input the hidden units
# of the hidden layer
self.logRegressionLayer = LogisticRegression(
# end-snippet-2 start-snippet-3
# L1 norm ; one regularization option is to enforce L1 norm to
# be small
self.L1 = (
+ abs(self.logRegressionLayer.W).sum()
) # square of L2 norm ; one regularization option is to enforce
# square of L2 norm to be small
self.L2_sqr = (
(self.hiddenLayer.W ** 2).sum()
+ (self.logRegressionLayer.W ** 2).sum()
) # negative log likelihood of the MLP is given by the negative
# log likelihood of the output of the model, computed in the
# logistic regression layer
self.negative_log_likelihood = (
# same holds for the function computing the number of errors
self.errors = self.logRegressionLayer.errors # the parameters of the model are the parameters of the two layer it is
# made out of
self.params = self.hiddenLayer.params + self.logRegressionLayer.params
# end-snippet-3 def test_mlp(learning_rate=0.01, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000,
dataset='mnist.pkl.gz', batch_size=20, n_hidden=500):
Demonstrate stochastic gradient descent optimization for a multilayer
perceptron This is demonstrated on MNIST. :type learning_rate: float
:param learning_rate: learning rate used (factor for the stochastic
gradient :type L1_reg: float
:param L1_reg: L1-norm's weight when added to the cost (see
regularization) :type L2_reg: float
:param L2_reg: L2-norm's weight when added to the cost (see
regularization) :type n_epochs: int
:param n_epochs: maximal number of epochs to run the optimizer :type dataset: string
:param dataset: the path of the MNIST dataset file from """
datasets = load_data(dataset) train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2] # compute number of minibatches for training, validation and testing
n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size
n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size ######################
print '... building the model' # allocate symbolic variables for the data
index = T.lscalar() # index to a [mini]batch
x = T.matrix('x') # the data is presented as rasterized images
y = T.ivector('y') # the labels are presented as 1D vector of
# [int] labels rng = numpy.random.RandomState(1234) # construct the MLP class
classifier = MLP(
n_in=28 * 28,
) # start-snippet-4
# the cost we minimize during training is the negative log likelihood of
# the model plus the regularization terms (L1 and L2); cost is expressed
# here symbolically
cost = (
+ L1_reg * classifier.L1
+ L2_reg * classifier.L2_sqr
# end-snippet-4 # compiling a Theano function that computes the mistakes that are made
# by the model on a minibatch
test_model = theano.function(
x: test_set_x[index * batch_size:(index + 1) * batch_size],
y: test_set_y[index * batch_size:(index + 1) * batch_size]
) validate_model = theano.function(
x: valid_set_x[index * batch_size:(index + 1) * batch_size],
y: valid_set_y[index * batch_size:(index + 1) * batch_size]
) # start-snippet-5
# compute the gradient of cost with respect to theta (sotred in params)
# the resulting gradients will be stored in a list gparams
gparams = [T.grad(cost, param) for param in classifier.params] # specify how to update the parameters of the model as a list of
# (variable, update expression) pairs # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of
# same length, zip generates a list C of same size, where each element
# is a pair formed from the two lists :
# C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
updates = [
(param, param - learning_rate * gparam)
for param, gparam in zip(classifier.params, gparams)
] # compiling a Theano function `train_model` that returns the cost, but
# in the same time updates the parameter of the model based on the rules
# defined in `updates`
train_model = theano.function(
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
# end-snippet-5 ###############
print '... training' # early-stopping parameters
patience = 10000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience / 2)
# go through this many
# minibatche before checking the network
# on the validation set; in this case we
# check every epoch best_validation_loss = numpy.inf
best_iter = 0
test_score = 0.
start_time = time.clock() epoch = 0
done_looping = False while (epoch < n_epochs) and (not done_looping):
epoch = epoch + 1
for minibatch_index in xrange(n_train_batches): minibatch_avg_cost = train_model(minibatch_index)
# iteration number
iter = (epoch - 1) * n_train_batches + minibatch_index if (iter + 1) % validation_frequency == 0:
# compute zero-one loss on validation set
validation_losses = [validate_model(i) for i
in xrange(n_valid_batches)]
this_validation_loss = numpy.mean(validation_losses) print(
'epoch %i, minibatch %i/%i, validation error %f %%' %
minibatch_index + 1,
this_validation_loss * 100.
) # if we got the best validation score until now
if this_validation_loss < best_validation_loss:
#improve patience if loss improvement is good enough
if (
this_validation_loss < best_validation_loss *
patience = max(patience, iter * patience_increase) best_validation_loss = this_validation_loss
best_iter = iter # test it on the test set
test_losses = [test_model(i) for i
in xrange(n_test_batches)]
test_score = numpy.mean(test_losses) print((' epoch %i, minibatch %i/%i, test error of '
'best model %f %%') %
(epoch, minibatch_index + 1, n_train_batches,
test_score * 100.)) if patience <= iter:
done_looping = True
break end_time = time.clock()
print(('Optimization complete. Best validation score of %f %% '
'obtained at iteration %i, with test performance %f %%') %
(best_validation_loss * 100., best_iter + 1, test_score * 100.))
print >> sys.stderr, ('The code for file ' +
os.path.split(__file__)[1] +
' ran for %.2fm' % ((end_time - start_time) / 60.)) if __name__ == '__main__':
python code/
Optimization complete. Best validation score of 1.690000 % obtained at iteration 2070000, with test performance 1.650000 %
The code for file ran for 97.34m
在Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz 上,该代码的速度大约为10.3 epoch/minute,并且在828 epochs的时候达到了测试错误率为1.65%。为了更好的了解MNIST上的结果,推荐读者去 this 看看不同算法结果比较。
LeCun ,Leon Bottou, Genevieve Orr, and Klaus-Robert Mueller写的 Efficient
两个最常用的激活函数就是tanh和sigmoid函数。和 Section
这里 是输入的个数, 是隐藏单元的个数。数学上的思考可以参考 [Xavier10]。
随着时间来降低学习率是一个好想法,简单的方法就是 ,这里 是初始化率(一般是用上面说的网格搜索技术来选择的), 被称为“下降常量”用来控制学习率下降的速率(通常来说,是一个更小的正数, 或者更小), 是epoch//stage。
4.7 详细介绍了为每个参数(权重)选择一个学习率的过程,和基于分类器的误差来自适应的对它们进行选择。
超参数是非常的数据集依赖的。含糊的说,更复杂的输入分布就需要具有更大能力(capacity)的网络来对它进行建模,同样的也就需要更多的隐藏单元(注意到一层中权重的个数,这通常是一个更加直观的可以用来测量网络能力(capacity)的方法,也就是 (是输入单元的数量,而是隐藏单元的数量))。
vs 泛化效果graph这两者呈现的是U的形状(即在中间某个点上是最好的权衡点,两头都是独立上升的)。
通常用来试探L1/L2正则化参数 的值是 。在这个框架中,我们到目前介绍的
[1] 官网:
[2] Deep learning with Theano
- DeepLearning学习(1)--多层感知机
想直接学习卷积神经网络,结果发现因为神经网络的基础较弱,学习起来比较困难,所以准备一步步学.并记录下来,其中会有很多摘抄. (一)什么是多层感知器和反向传播 1,单个神经元 神经网络的基本单元就是神经 ...
- 学习笔记TF026:多层感知机
隐含层,指除输入.输出层外,的中间层.输入.输出层对外可见.隐含层对外不可见.理论上,只要隐含层节点足够多,只有一个隐含层,神经网络可以拟合任意函数.隐含层越多,越容易拟合复杂函数.拟合复杂函数,所需 ...
- 『TensorFlow』读书笔记_多层感知机
多层感知机 输入->线性变换->Relu激活->线性变换->Softmax分类 多层感知机将mnist的结果提升到了98%左右的水平 知识点 过拟合:采用dropout解决,本 ...
- MXNET:多层感知机
从零开始 前面了解了多层感知机的原理,我们来实现一个多层感知机. # -*- coding: utf-8 -*- from mxnet import init from mxnet import nd ...
- 基于theano的多层感知机的实现
1.引言 一个多层感知机(Multi-Layer Perceptron,MLP)可以看做是,在逻辑回归分类器的中间加了非线性转换的隐层,这种转换把数据映射到一个线性可分的空间.一个单隐层的MLP就可以 ...
- (数据科学学习手札44)在Keras中训练多层感知机
一.简介 Keras是有着自主的一套前端控制语法,后端基于tensorflow和theano的深度学习框架,因为其搭建神经网络简单快捷明了的语法风格,可以帮助使用者更快捷的搭建自己的神经网络,堪称深度 ...
- (数据科学学习手札34)多层感知机原理详解&Python与R实现
一.简介 机器学习分为很多个领域,其中的连接主义指的就是以神经元(neuron)为基本结构的各式各样的神经网络,规范的定义是:由具有适应性的简单单元组成的广泛并行互连的网络,它的组织能够模拟生物神经系 ...
- DeepLearning tutorial(3)MLP多层感知机原理简介+代码详解
本文介绍多层感知机算法,特别是详细解读其代码实现,基于python theano,代码来自:Multilayer Perceptron,如果你想详细了解多层感知机算法,可以参考:UFLDL教程,或者参 ...
- 动手学深度学习10- pytorch多层感知机从零实现
多层感知机 定义模型的参数 定义激活函数 定义模型 定义损失函数 训练模型 小结 多层感知机 import torch import numpy as np import sys sys.path.a ...
- Biee 迁移和刷新GUIDs
Biee11g迁移 与刷新 一.停止biee服务 二.备份文件 1. rpd文件夹路径: biee_home\instances\instance1\bifoundation\Oracle ...
- Windows下用Codeblocks建立一个最简单的DLL动态链接库
转自: 来源: ...
- Java api 入门教程 之 JAVA的SYSTEM类
System类代表系统,系统级的很多属性和控制方法都放置在该类的内部.该类位于java.lang包. 由于该类的构造方法是private的,所以无法创建该类的对象,也就是无法实例化该类.其内部的成员变 ...
- React-Native测试报告
React-native 使用js编写android和ios程序,前端时间开始支持android,本人根据官方的教程,先安装开发环境,然后运行hello world,最后看了下官方提供的实例程序UI ...
- 烂泥:使用nginx利用虚拟主机搭建WordPress博客
本文由秀依林枫提供友情赞助,首发于烂泥行天下. 最近开始打算学习nginx web服务器,既然是学习还是以实用为目的的.我们在此以搭建WordPress博客为例. 搭建WordPress博客,我们需要 ...
- x01.Game.LitSkull: 梯次防御
1.人要有点精神 人要有点精神,否则,不是沦落为毫无意义的看客,就是退化成食色性也的动物,有被开除球籍的危险,如晚清. 2.框架 引号头文件在当前目录下搜寻,三角头文件在配置目录下搜寻,这是一个简单的 ...
- Android程序函数 将assets文件夹下的文件复制到手机的sd卡中(包括子文件夹)
最近在做个功能是将asset文件夹下的所有文件(包括子文件)全部拷贝出来到指定目录下.所用的方法无非是用AssetManager.但是这里 有个问题是也要讲子文件夹和子文件都要拷贝出来.到网上Goog ...
- 景瑞地产商业智能BI整体实施过程
1.1行业背景 1.1.1景瑞地产 景瑞地产成立于1993年,专注于房地产开发,并一直秉持“永远诚信.恪守专业.锐意进取.共赢未来”的核心价值观和“舒适之道”的企业使命. 景瑞地产,源自上海.通过多年 ...
- openfire+asmack搭建的安卓即时通讯(七) 15.5.27
本地化之章! 往期传送门: 1. 2. ...
- selenium结合sikuliX操作Flash网页
sikuli的官网地址: 首先下载sikuliX的jar包: java-d ...