Theano3.4-练习之多层感知机

来自http://deeplearning.net/tutorial/mlp.html#mlp

Multilayer Perceptron

note：这部分假设读者已经通读之前的一个练习 Classifying
MNIST digits using Logistic Regression.（http://blog.csdn.net/shouhuxianjian/article/details/46375461）。另外，它使用新的theano函数和概念： T.tanh, shared
variables, basic
arithmetic ops, T.grad, L1
and L2 regularization, floatX。如果你想要在GPU上运行代码，记得看GPU.

note：这部分的代码可以从这里下载here.

接下来要呈现的使用theano的架构是单隐藏层多层感知机（MLP）。一个MLP可以被视为一个逻辑回归分类器，其中的输入首先通过学到的非线性来转换。该转换是将输入数据映射到一个空间中，在该空间中不同的类别可以线性可分。中间层也就是指隐藏层。一个隐藏层已经足够让MLPs成为一个通用的逼近器。然而我们随后看到的是在使用许多这样的隐藏层之后可以得到很大的好处，即深度学习的前提条件（指的是隐藏层必须超过一层）。可以看这些课程的笔记：ntroduction
to MLPs, the back-propagation algorithm, and how to train MLPs.

该教程依然是在MNIST数字分类上来介绍的。

一、模型

有着单一隐藏层的MLP（或者人工神经网络，ANN）的图示如下：

正式的说，一层隐藏层MLP表示为函数形式：，这里是输入向量的size，是输出向量的size，表示矩阵符号形式如下：

有着偏置向量 , ，权重矩阵 , ，激活函数和s。

向量构成这个隐藏层。是连接输入向量到隐藏层之间的权重矩阵。每一列表示输入单元到第
i 个隐藏单元的权重。对s 的选择通常是tanh：或者逻辑sigmoid函数：。我们在这个教程中将会使用tanh，因为它的训练速度一般可以更快（而且有时候有着更好的局部最小）。tanh和sigmoid都是标量to标量的函数，不过它们自然的扩展到向量和张量的时候都是逐元素计算的（例如，在向量的每个元素上独立计算，生成一个同样size的向量）。

输出向量计算结果为：。读者应该在前一个练习（Theano3.3-练习之逻辑回归）就该看过这个形式了，和之前一样，属于哪一类的概率可以通过选择为softmax函数来计算得到（多类分类情况下）。

为了训练一个MLP，我们需要对这个模型的所有参数进行学习，这里我们使用带有minibatch的 Stochastic
Gradient Descent 。需要学习的参数集就是：。可以通过BP算法（导数链式规则的特殊情况）来得到梯度。不过幸运的是，因为theano可以自动的进行求导微分，我们不需要在本教程中介绍如何求导

二、从LR到MLP

该教程关注的是单隐藏层的MLP。所以先编写单层隐藏层的类。为了构造这个MLP，我们随后只需要在顶部放上一个逻辑回归层就好：

class HiddenLayer(object):

    def __init__(self, rng, input, n_in, n_out, W=None, b=None,

                 activation=T.tanh):

        """

        Typical hidden layer of a MLP: units are fully-connected and have

        sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)

        and the bias vector b is of shape (n_out,).

        NOTE : The nonlinearity used here is tanh

        Hidden unit activation is given by: tanh(dot(input,W) + b)

        :type rng: numpy.random.RandomState

        :param rng: a random number generator used to initialize weights

        :type input: theano.tensor.dmatrix

        :param input: a symbolic tensor of shape (n_examples, n_in)

        :type n_in: int

        :param n_in: dimensionality of input

        :type n_out: int

        :param n_out: number of hidden units

        :type activation: theano.Op or function

        :param activation: Non linearity to be applied in the hidden

                           layer

        """

        self.input = input

隐藏层
i 的权重的初始化值需要从依赖于激活函数的对称间隔上统一采样得到。对于tanh激活函数，在 [Xavier10] 中的获得的结果上来看，这个间隔应该是。这里是第 -th层的单元个数，是第-th层的单元个数。对于sigmoid函数来说，间隔是。在训练的早期，这个初始化是可以确保每个神经元会在它的激活函数的变化较大的区域部分，使得能够很容易往上传播（从输入到输出方向）和往回传播（梯度从输出到输入方向）：

        # `W` is initialized with `W_values` which is uniformely sampled

        # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))

        # for tanh activation function

        # the output of uniform if converted using asarray to dtype

        # theano.config.floatX so that the code is runable on GPU

        # Note : optimal initialization of weights is dependent on the

        #        activation function used (among other things).

        #        For example, results presented in [Xavier10] suggest that you

        #        should use 4 times larger initial weights for sigmoid

        #        compared to tanh

        #        We have no info for other function, so we use the same as

        #        tanh.

        if W is None:

            W_values = numpy.asarray(

                rng.uniform(

                    low=-numpy.sqrt(6. / (n_in + n_out)),

                    high=numpy.sqrt(6. / (n_in + n_out)),

                    size=(n_in, n_out)

                ),

                dtype=theano.config.floatX

            )

            if activation == theano.tensor.nnet.sigmoid:

                W_values *= 4

            W = theano.shared(value=W_values, name='W', borrow=True)

        if b is None:

            b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)

            b = theano.shared(value=b_values, name='b', borrow=True)

        self.W = W

        self.b = b

注意到我们使用了一个给定的非线性函数作为隐藏层的激活函数。默认情况下是tanh，不过在许多情况下我们想使用其他激活函数：

        lin_output = T.dot(input, self.W) + self.b

        self.output = (

            lin_output if activation is None

            else activation(lin_output)

        )

如果深入原理部分，这个类实现graph的时候需要计算隐藏层的值。如果给graph的输入和LogisticRegression类一样，就像之前的教程一样，就可以得到MLP的输出。下面的是MLP类的简单实现代码：

class MLP(object):

    """Multi-Layer Perceptron Class

    A multilayer perceptron is a feedforward artificial neural network model

    that has one layer or more of hidden units and nonlinear activations.

    Intermediate layers usually have as activation function tanh or the

    sigmoid function (defined here by a ``HiddenLayer`` class)  while the

    top layer is a softmax layer (defined here by a ``LogisticRegression``

    class).

    """

    def __init__(self, rng, input, n_in, n_hidden, n_out):

        """Initialize the parameters for the multilayer perceptron

        :type rng: numpy.random.RandomState

        :param rng: a random number generator used to initialize weights

        :type input: theano.tensor.TensorType

        :param input: symbolic variable that describes the input of the

        architecture (one minibatch)

        :type n_in: int

        :param n_in: number of input units, the dimension of the space in

        which the datapoints lie

        :type n_hidden: int

        :param n_hidden: number of hidden units

        :type n_out: int

        :param n_out: number of output units, the dimension of the space in

        which the labels lie

        """

        # Since we are dealing with a one hidden layer MLP, this will translate

        # into a HiddenLayer with a tanh activation function connected to the

        # LogisticRegression layer; the activation function can be replaced by

        # sigmoid or any other nonlinear function

        self.hiddenLayer = HiddenLayer(

            rng=rng,

            input=input,

            n_in=n_in,

            n_out=n_hidden,

            activation=T.tanh

        )

        # The logistic regression layer gets as input the hidden units

        # of the hidden layer

        self.logRegressionLayer = LogisticRegression(

            input=self.hiddenLayer.output,

            n_in=n_hidden,

            n_out=n_out

        )

在这个教程中，我们同样会使用L1和L2正则化（ L1
and L2 regularization）。同时我们需要计算L1范数和权重的L2范数的平方：

        # L1 norm ; one regularization option is to enforce L1 norm to

        # be small

        self.L1 = (

            abs(self.hiddenLayer.W).sum()

            + abs(self.logRegressionLayer.W).sum()

        )

        # square of L2 norm ; one regularization option is to enforce

        # square of L2 norm to be small

        self.L2_sqr = (

            (self.hiddenLayer.W ** 2).sum()

            + (self.logRegressionLayer.W ** 2).sum()

        )

        # negative log likelihood of the MLP is given by the negative

        # log likelihood of the output of the model, computed in the

        # logistic regression layer

        self.negative_log_likelihood = (

            self.logRegressionLayer.negative_log_likelihood

        )

        # same holds for the function computing the number of errors

        self.errors = self.logRegressionLayer.errors

        # the parameters of the model are the parameters of the two layer it is

        # made out of

        self.params = self.hiddenLayer.params + self.logRegressionLayer.params

就像之前一样，通过MSGD来训练，不同之处在于我们会修改cost函数，使得它包含正则化项。L1_reg和L2_reg是用来控制整个cost函数中的正则化项权重的超参数。新的cost的代码如下：

    # the cost we minimize during training is the negative log likelihood of

    # the model plus the regularization terms (L1 and L2); cost is expressed

    # here symbolically

    cost = (

        classifier.negative_log_likelihood(y)

        + L1_reg * classifier.L1

        + L2_reg * classifier.L2_sqr

    )

然后使用梯度来更新模型的参数。这里的代码差不多和逻辑回归的代码一样。只有参数的个数不同。为了避开这个问题（然代码可以用在任意数量的参数上），我们将会创建带有params的模型来生成参数列表然后对它进行解析，每一步计算一个替代：

    # compute the gradient of cost with respect to theta (sotred in params)

    # the resulting gradients will be stored in a list gparams

    gparams = [T.grad(cost, param) for param in classifier.params]

    # specify how to update the parameters of the model as a list of

    # (variable, update expression) pairs

    # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of

    # same length, zip generates a list C of same size, where each element

    # is a pair formed from the two lists :

    #    C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]

    updates = [

        (param, param - learning_rate * gparam)

        for param, gparam in zip(classifier.params, gparams)

    ]

    # compiling a Theano function `train_model` that returns the cost, but

    # in the same time updates the parameter of the model based on the rules

    # defined in `updates`

    train_model = theano.function(

        inputs=[index],

        outputs=cost,

        updates=updates,

        givens={

            x: train_set_x[index * batch_size: (index + 1) * batch_size],

            y: train_set_y[index * batch_size: (index + 1) * batch_size]

        }

    )

三、将上面的部分合并到一起

在了解了基本的概念之后，写一个MLP类变得非常容易。下面的代码就是过程，类似于之前的LR实现的方式：

"""

This tutorial introduces the multilayer perceptron using Theano.

 A multilayer perceptron is a logistic regressor where

instead of feeding the input to the logistic regression you insert a

intermediate layer, called the hidden layer, that has a nonlinear

activation function (usually tanh or sigmoid) . One can use many such

hidden layers making the architecture deep. The tutorial will also tackle

the problem of MNIST digit classification.

.. math::

    f(x) = G( b^{(2)} + W^{(2)}( s( b^{(1)} + W^{(1)} x))),

References:

    - textbooks: "Pattern Recognition and Machine Learning" -

                 Christopher M. Bishop, section 5

"""

__docformat__ = 'restructedtext en'

import os

import sys

import time

import numpy

import theano

import theano.tensor as T

from logistic_sgd import LogisticRegression, load_data

# start-snippet-1

class HiddenLayer(object):

    def __init__(self, rng, input, n_in, n_out, W=None, b=None,

                 activation=T.tanh):

        """

        Typical hidden layer of a MLP: units are fully-connected and have

        sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)

        and the bias vector b is of shape (n_out,).

        NOTE : The nonlinearity used here is tanh

        Hidden unit activation is given by: tanh(dot(input,W) + b)

        :type rng: numpy.random.RandomState

        :param rng: a random number generator used to initialize weights

        :type input: theano.tensor.dmatrix

        :param input: a symbolic tensor of shape (n_examples, n_in)

        :type n_in: int

        :param n_in: dimensionality of input

        :type n_out: int

        :param n_out: number of hidden units

        :type activation: theano.Op or function

        :param activation: Non linearity to be applied in the hidden

                           layer

        """

        self.input = input

        # end-snippet-1

        # `W` is initialized with `W_values` which is uniformely sampled

        # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))

        # for tanh activation function

        # the output of uniform if converted using asarray to dtype

        # theano.config.floatX so that the code is runable on GPU

        # Note : optimal initialization of weights is dependent on the

        #        activation function used (among other things).

        #        For example, results presented in [Xavier10] suggest that you

        #        should use 4 times larger initial weights for sigmoid

        #        compared to tanh

        #        We have no info for other function, so we use the same as

        #        tanh.

        if W is None:

            W_values = numpy.asarray(

                rng.uniform(

                    low=-numpy.sqrt(6. / (n_in + n_out)),

                    high=numpy.sqrt(6. / (n_in + n_out)),

                    size=(n_in, n_out)

                ),

                dtype=theano.config.floatX

            )

            if activation == theano.tensor.nnet.sigmoid:

                W_values *= 4

            W = theano.shared(value=W_values, name='W', borrow=True)

        if b is None:

            b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)

            b = theano.shared(value=b_values, name='b', borrow=True)

        self.W = W

        self.b = b

        lin_output = T.dot(input, self.W) + self.b

        self.output = (

            lin_output if activation is None

            else activation(lin_output)

        )

        # parameters of the model

        self.params = [self.W, self.b]

# start-snippet-2

class MLP(object):

    """Multi-Layer Perceptron Class

    A multilayer perceptron is a feedforward artificial neural network model

    that has one layer or more of hidden units and nonlinear activations.

    Intermediate layers usually have as activation function tanh or the

    sigmoid function (defined here by a ``HiddenLayer`` class)  while the

    top layer is a softmax layer (defined here by a ``LogisticRegression``

    class).

    """

    def __init__(self, rng, input, n_in, n_hidden, n_out):

        """Initialize the parameters for the multilayer perceptron

        :type rng: numpy.random.RandomState

        :param rng: a random number generator used to initialize weights

        :type input: theano.tensor.TensorType

        :param input: symbolic variable that describes the input of the

        architecture (one minibatch)

        :type n_in: int

        :param n_in: number of input units, the dimension of the space in

        which the datapoints lie

        :type n_hidden: int

        :param n_hidden: number of hidden units

        :type n_out: int

        :param n_out: number of output units, the dimension of the space in

        which the labels lie

        """

        # Since we are dealing with a one hidden layer MLP, this will translate

        # into a HiddenLayer with a tanh activation function connected to the

        # LogisticRegression layer; the activation function can be replaced by

        # sigmoid or any other nonlinear function

        self.hiddenLayer = HiddenLayer(

            rng=rng,

            input=input,

            n_in=n_in,

            n_out=n_hidden,

            activation=T.tanh

        )

        # The logistic regression layer gets as input the hidden units

        # of the hidden layer

        self.logRegressionLayer = LogisticRegression(

            input=self.hiddenLayer.output,

            n_in=n_hidden,

            n_out=n_out

        )

        # end-snippet-2 start-snippet-3

        # L1 norm ; one regularization option is to enforce L1 norm to

        # be small

        self.L1 = (

            abs(self.hiddenLayer.W).sum()

            + abs(self.logRegressionLayer.W).sum()

        )

        # square of L2 norm ; one regularization option is to enforce

        # square of L2 norm to be small

        self.L2_sqr = (

            (self.hiddenLayer.W ** 2).sum()

            + (self.logRegressionLayer.W ** 2).sum()

        )

        # negative log likelihood of the MLP is given by the negative

        # log likelihood of the output of the model, computed in the

        # logistic regression layer

        self.negative_log_likelihood = (

            self.logRegressionLayer.negative_log_likelihood

        )

        # same holds for the function computing the number of errors

        self.errors = self.logRegressionLayer.errors

        # the parameters of the model are the parameters of the two layer it is

        # made out of

        self.params = self.hiddenLayer.params + self.logRegressionLayer.params

        # end-snippet-3

def test_mlp(learning_rate=0.01, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000,

             dataset='mnist.pkl.gz', batch_size=20, n_hidden=500):

    """

    Demonstrate stochastic gradient descent optimization for a multilayer

    perceptron

    This is demonstrated on MNIST.

    :type learning_rate: float

    :param learning_rate: learning rate used (factor for the stochastic

    gradient

    :type L1_reg: float

    :param L1_reg: L1-norm's weight when added to the cost (see

    regularization)

    :type L2_reg: float

    :param L2_reg: L2-norm's weight when added to the cost (see

    regularization)

    :type n_epochs: int

    :param n_epochs: maximal number of epochs to run the optimizer

    :type dataset: string

    :param dataset: the path of the MNIST dataset file from

                 http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz

   """

    datasets = load_data(dataset)

    train_set_x, train_set_y = datasets[0]

    valid_set_x, valid_set_y = datasets[1]

    test_set_x, test_set_y = datasets[2]

    # compute number of minibatches for training, validation and testing

    n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size

    n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size

    n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size

    ######################

    # BUILD ACTUAL MODEL #

    ######################

    print '... building the model'

    # allocate symbolic variables for the data

    index = T.lscalar()  # index to a [mini]batch

    x = T.matrix('x')  # the data is presented as rasterized images

    y = T.ivector('y')  # the labels are presented as 1D vector of

                        # [int] labels

    rng = numpy.random.RandomState(1234)

    # construct the MLP class

    classifier = MLP(

        rng=rng,

        input=x,

        n_in=28 * 28,

        n_hidden=n_hidden,

        n_out=10

    )

    # start-snippet-4

    # the cost we minimize during training is the negative log likelihood of

    # the model plus the regularization terms (L1 and L2); cost is expressed

    # here symbolically

    cost = (

        classifier.negative_log_likelihood(y)

        + L1_reg * classifier.L1

        + L2_reg * classifier.L2_sqr

    )

    # end-snippet-4

    # compiling a Theano function that computes the mistakes that are made

    # by the model on a minibatch

    test_model = theano.function(

        inputs=[index],

        outputs=classifier.errors(y),

        givens={

            x: test_set_x[index * batch_size:(index + 1) * batch_size],

            y: test_set_y[index * batch_size:(index + 1) * batch_size]

        }

    )

    validate_model = theano.function(

        inputs=[index],

        outputs=classifier.errors(y),

        givens={

            x: valid_set_x[index * batch_size:(index + 1) * batch_size],

            y: valid_set_y[index * batch_size:(index + 1) * batch_size]

        }

    )

    # start-snippet-5

    # compute the gradient of cost with respect to theta (sotred in params)

    # the resulting gradients will be stored in a list gparams

    gparams = [T.grad(cost, param) for param in classifier.params]

    # specify how to update the parameters of the model as a list of

    # (variable, update expression) pairs

    # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of

    # same length, zip generates a list C of same size, where each element

    # is a pair formed from the two lists :

    #    C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]

    updates = [

        (param, param - learning_rate * gparam)

        for param, gparam in zip(classifier.params, gparams)

    ]

    # compiling a Theano function `train_model` that returns the cost, but

    # in the same time updates the parameter of the model based on the rules

    # defined in `updates`

    train_model = theano.function(

        inputs=[index],

        outputs=cost,

        updates=updates,

        givens={

            x: train_set_x[index * batch_size: (index + 1) * batch_size],

            y: train_set_y[index * batch_size: (index + 1) * batch_size]

        }

    )

    # end-snippet-5

    ###############

    # TRAIN MODEL #

    ###############

    print '... training'

    # early-stopping parameters

    patience = 10000  # look as this many examples regardless

    patience_increase = 2  # wait this much longer when a new best is

                           # found

    improvement_threshold = 0.995  # a relative improvement of this much is

                                   # considered significant

    validation_frequency = min(n_train_batches, patience / 2)

                                  # go through this many

                                  # minibatche before checking the network

                                  # on the validation set; in this case we

                                  # check every epoch

    best_validation_loss = numpy.inf

    best_iter = 0

    test_score = 0.

    start_time = time.clock()

    epoch = 0

    done_looping = False

    while (epoch < n_epochs) and (not done_looping):

        epoch = epoch + 1

        for minibatch_index in xrange(n_train_batches):

            minibatch_avg_cost = train_model(minibatch_index)

            # iteration number

            iter = (epoch - 1) * n_train_batches + minibatch_index

            if (iter + 1) % validation_frequency == 0:

                # compute zero-one loss on validation set

                validation_losses = [validate_model(i) for i

                                     in xrange(n_valid_batches)]

                this_validation_loss = numpy.mean(validation_losses)

                print(

                    'epoch %i, minibatch %i/%i, validation error %f %%' %

                    (

                        epoch,

                        minibatch_index + 1,

                        n_train_batches,

                        this_validation_loss * 100.

                    )

                )

                # if we got the best validation score until now

                if this_validation_loss < best_validation_loss:

                    #improve patience if loss improvement is good enough

                    if (

                        this_validation_loss < best_validation_loss *

                        improvement_threshold

                    ):

                        patience = max(patience, iter * patience_increase)

                    best_validation_loss = this_validation_loss

                    best_iter = iter

                    # test it on the test set

                    test_losses = [test_model(i) for i

                                   in xrange(n_test_batches)]

                    test_score = numpy.mean(test_losses)

                    print(('     epoch %i, minibatch %i/%i, test error of '

                           'best model %f %%') %

                          (epoch, minibatch_index + 1, n_train_batches,

                           test_score * 100.))

            if patience <= iter:

                done_looping = True

                break

    end_time = time.clock()

    print(('Optimization complete. Best validation score of %f %% '

           'obtained at iteration %i, with test performance %f %%') %

          (best_validation_loss * 100., best_iter + 1, test_score * 100.))

    print >> sys.stderr, ('The code for file ' +

                          os.path.split(__file__)[1] +

                          ' ran for %.2fm' % ((end_time - start_time) / 60.))

if __name__ == '__main__':

    test_mlp()

用户可以如下这样运行这个代码：

python code/mlp.py

输出会有如下的形式：

Optimization complete. Best validation score of 1.690000 % obtained at iteration 2070000, with test performance 1.650000 %

The code for file mlp.py ran for 97.34m

在Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz 上，该代码的速度大约为10.3 epoch/minute，并且在828 epochs的时候达到了测试错误率为1.65%。为了更好的了解MNIST上的结果，推荐读者去 this 看看不同算法结果比较。

四、训练MLPs的提示和技巧

在上面的代码中有许多超参数，它们不是被（通常来说也不能被）梯度下降而优化的。严格来说，找到一组最优超参数的值不是个容易解决的问题。首先，我们不能简单的将它们独立的进行优化。其次，我们不能容易的使用和前面介绍的梯度技术来处理（部分原因是因为一些参数离散值而另一些是实值）。第三，这个最优化问题不是凸优化和找到一个（局部）最小值的工作量可不小。好消息是在过去的25年中，研究者发明了各种经验规则来选择NN中的超参数。一个非常好的有关这些技巧的综述是由Yann
LeCun ,Leon Bottou, Genevieve Orr, and Klaus-Robert Mueller写的 Efficient
BackProp。这里，我们归纳下这些同样的问题，并重点关注我们代码中实际用到的参数和技术。

非线性

两个最常用的激活函数就是tanh和sigmoid函数。和 Section
4.4,中解释的原因一样，这两个非线性是中心对称的，这它们就能在下一层的时候生成的是0均值的输入（这是一个理想的属性）。经验上来时，我们发现tanh有着更好的收敛特性。（当然在2015年现在有relu和prelu等其他的激活函数，有兴趣的可以了解下）。

权重初始化

在初始化的时候，我们想要权重围绕着原点（即数值0）足够小，这样激活函数就能呈现线性操作的趋势（这个看了sigmoid的函数图就能明白，在0点附近趋近于线性），在这个区域上梯度是最大的。其他理想的特性，特别对于深度网络来说，保存的激活函数的方差就像是从层到层的BP梯度的方差一样。这使得信息能够在网络中向上和向下传播，并且减少层间的差异。在某些假设的基础上，一个介于这两个约束条件的折中会导致有下面的初始化区别：

tanh的初始化：

sigmoid的初始化：

这里是输入的个数，是隐藏单元的个数。数学上的思考可以参考 [Xavier10]。

学习率

有许多文献是关注于如何选取一个好的学习率。最简单的解决方法就是简单的选择一个常量。经验规则：尝试几个log空间的值（），并缩小（对数）网格区域搜索到你得到的最低验证集误差的那个区域。

随着时间来降低学习率是一个好想法，简单的方法就是，这里是初始化率（一般是用上面说的网格搜索技术来选择的），被称为“下降常量”用来控制学习率下降的速率（通常来说，是一个更小的正数，或者更小），是epoch//stage。

Section
4.7 详细介绍了为每个参数（权重）选择一个学习率的过程,和基于分类器的误差来自适应的对它们进行选择。

隐藏单元个数

超参数是非常的数据集依赖的。含糊的说，更复杂的输入分布就需要具有更大能力（capacity）的网络来对它进行建模，同样的也就需要更多的隐藏单元（注意到一层中权重的个数，这通常是一个更加直观的可以用来测量网络能力（capacity）的方法，也就是 (是输入单元的数量，而是隐藏单元的数量)）。

除非我们使用一些正则化方案（早期停止或者L1/L2惩罚），隐藏单元个数
vs 泛化效果graph这两者呈现的是U的形状（即在中间某个点上是最好的权衡点，两头都是独立上升的）。

正则化参数

通常用来试探L1/L2正则化参数的值是。在这个框架中，我们到目前介绍的
优化这些参数不会明显的得到更好的结果，不过却值得探索。

参考资料：

[1] 官网：http://deeplearning.net/tutorial/mlp.html#mlp

[2] Deep learning with Theano
官方中文教程（翻译）（三）——多层感知机（MLP）：http://www.cnblogs.com/charleshuang/p/3648804.html

Theano3.4-练习之多层感知机的更多相关文章

DeepLearning学习（1）--多层感知机
想直接学习卷积神经网络,结果发现因为神经网络的基础较弱,学习起来比较困难,所以准备一步步学.并记录下来,其中会有很多摘抄. (一)什么是多层感知器和反向传播 1,单个神经元神经网络的基本单元就是神经 ...
学习笔记TF026:多层感知机
隐含层,指除输入.输出层外,的中间层.输入.输出层对外可见.隐含层对外不可见.理论上,只要隐含层节点足够多,只有一个隐含层,神经网络可以拟合任意函数.隐含层越多,越容易拟合复杂函数.拟合复杂函数,所需 ...
『TensorFlow』读书笔记_多层感知机
多层感知机输入->线性变换->Relu激活->线性变换->Softmax分类多层感知机将mnist的结果提升到了98%左右的水平知识点过拟合:采用dropout解决,本 ...
MXNET：多层感知机
从零开始前面了解了多层感知机的原理,我们来实现一个多层感知机. # -*- coding: utf-8 -*- from mxnet import init from mxnet import nd ...
基于theano的多层感知机的实现
1.引言一个多层感知机(Multi-Layer Perceptron,MLP)可以看做是,在逻辑回归分类器的中间加了非线性转换的隐层,这种转换把数据映射到一个线性可分的空间.一个单隐层的MLP就可以 ...
（数据科学学习手札44）在Keras中训练多层感知机
一.简介 Keras是有着自主的一套前端控制语法,后端基于tensorflow和theano的深度学习框架,因为其搭建神经网络简单快捷明了的语法风格,可以帮助使用者更快捷的搭建自己的神经网络,堪称深度 ...
（数据科学学习手札34）多层感知机原理详解&Python与R实现
一.简介机器学习分为很多个领域,其中的连接主义指的就是以神经元(neuron)为基本结构的各式各样的神经网络,规范的定义是:由具有适应性的简单单元组成的广泛并行互连的网络,它的组织能够模拟生物神经系 ...
DeepLearning tutorial（3）MLP多层感知机原理简介+代码详解
本文介绍多层感知机算法,特别是详细解读其代码实现,基于python theano,代码来自:Multilayer Perceptron,如果你想详细了解多层感知机算法,可以参考:UFLDL教程,或者参 ...
动手学深度学习10- pytorch多层感知机从零实现
多层感知机定义模型的参数定义激活函数定义模型定义损失函数训练模型小结多层感知机 import torch import numpy as np import sys sys.path.a ...

随机推荐

【英文版本】Android开源项目分类汇总
Action Bars ActionBarSherlock Extended ActionBar FadingActionBar GlassActionBar v7 appcompat library ...
innobackupex --rsync 报错 Error: can't create file (null)/xtrabackup_rsyncfiles_pass1
在使用最新版的innobackupex(2.3.2): innobackupex /backup --rsync --user=xx --password=xxx 备份时报错: Error: can' ...
win8.1 user profile service 服务登录失败
在Win 8.1 上新建个用户后,不能登录. 出现 user profile service 服务登录失败. 无法加载用户配置文件. 网上大部分相同提示的问题是有关已有账号不能再次登陆的. 解决方式是 ...
Linux下Awk详解(转载)
什么是Awk Awk是一种小巧的编程语言及命令行工具.(其名称得自于它的创始人Alfred Aho.Peter Weinberger 和 Brian Kernighan姓氏的首个字母).它非常适合服务 ...
shell脚本实现随机筛选
#!/bin/bash name=(val1 val2 val3 val4 ...) a=$() #以时间产生随机数向39取余得到0~38的值
（转）android.intent.action.MAIN与android.intent.category.LAUNCHER
android.intent.action.MAIN决定应用程序最先启动的Activity android.intent.category.LAUNCHER决定应用程序是否显示在程序列表里在网上看到 ...
用字符串模拟两个大数相加——java实现
问题: 大数相加不能直接使用基本的int类型,因为int可以表示的整数有限,不能满足大数的要求.可以使用字符串来表示大数,模拟大数相加的过程. 思路: 1.反转两个字符串,便于从低位到高位相加和最高位 ...
DirectX API 编程起步 #01 项目设置
=========================================================== 目录: DirectX API 编程起步 #02 窗口的诞生 DirectX A ...
安装rpm包
下载好一个rpm包怎样安装? [root@localhost ~]# ls anaconda-ks.cfg install.log install.log.syslog jboss-as-7.1.1- ...
偷懒的一天-jQuery之事件与应用
hi 睡得恍恍惚惚不知精神为何物了 1.jQuery -----事件与应用----- ----页面加载时触发ready()事件 ready()事件类似于onLoad()事件,但前者只要页面的DOM结构 ...

Theano3.4-练习之多层感知机

Theano3.4-练习之多层感知机的更多相关文章

随机推荐

热门专题