三层神经网络自编码算法推导和MATLAB实现（转载）

转载自：http://www.cnblogs.com/tornadomeet/archive/2013/03/20/2970724.html

前言：

　　现在来进入sparse autoencoder的一个实例练习，参考Ng的网页教程：Exercise:Sparse Autoencoder。这个例子所要实现的内容大概如下：从给定的很多张自然图片中截取出大小为8*8的小patches图片共10000张，现在需要用sparse autoencoder的方法训练出一个隐含层网络所学习到的特征。该网络共有3层，输入层是64个节点，隐含层是25个节点，输出层当然也是64个节点了。

　　实验基础：

　　其实实现该功能的主要步骤还是需要计算出网络的损失函数以及其偏导数，具体的公式可以参考前面的博文Deep learning：八(Sparse Autoencoder)。下面用简单的语言大概介绍下这个步骤，方便大家理清算法的流程。

　　1. 计算出网络每个节点的输入值（即程序中的z值）和输出值（即程序中的a值，a是z的sigmoid函数值）。

　　2. 利用z值和a值计算出网络每个节点的误差值（即程序中的delta值）。

　　3. 这样可以利用上面计算出的每个节点的a，z，delta来表达出系统的损失函数以及损失函数的偏导数了，当然这些都是一些数学推导，其公式就是前面的博文Deep learning：八(Sparse Autoencoder)了。

　　其实步骤1是前向进行的，也就是说按照输入层——》隐含层——》输出层的方向进行计算。而步骤2是方向进行的（这也是该算法叫做BP算法的来源），即每个节点的误差值是按照输出层——》隐含层——》输入层方向进行的。

　　一些malab函数：

　　bsxfun:

　　C=bsxfun(fun,A,B)表达的是两个数组A和B间元素的二值操作，fun是函数句柄或者m文件，或者是内嵌的函数。在实际使用过程中fun有很多选择比如说加，减等，前面需要使用符号’@’.一般情况下A和B需要尺寸大小相同，如果不相同的话，则只能有一个维度不同，同时A和B中在该维度处必须有一个的维度为1。比如说bsxfun(@minus, A, mean(A))，其中A和mean(A)的大小是不同的，这里的意思需要先将mean(A)扩充到和A大小相同，然后用A的每个元素减去扩充后的mean(A)对应元素的值。

　　rand：

　　生成均匀分布的伪随机数。分布在（0~1）之间　　主要语法：rand(m,n)生成m行n列的均匀分布的伪随机数 rand(m,n,'double')生成指定精度的均匀分布的伪随机数，参数还可以是'single' rand(RandStream,m,n)利用指定的RandStream(我理解为随机种子)生成伪随机数

　　randn：

　　生成标准正态分布的伪随机数（均值为0，方差为1）。主要语法：和上面一样

　　randi：

　　生成均匀分布的伪随机整数　主要语法：randi（iMax）在闭区间（0，iMax）生成均匀分布的伪随机整数 randi（iMax，m，n）在闭区间（0，iMax）生成mXn型随机矩阵 r = randi([iMin,iMax],m,n)在闭区间（iMin，iMax）生成mXn型随机矩阵

　　exist:

　　测试参数是否存在，比如说exist('opt_normalize', 'var')表示检测变量opt_normalize是否存在，其中的’var’表示变量的意思。

　　colormap:

　　设置当前常见的颜色值表。

　　floor：

　　floor(A):取不大于A的最大整数。

　　ceil:

　　ceil(A):取不小于A的最小整数。

　　imagesc:

　　imagesc和image类似，可以用于显示图像。比如imagesc(array,'EraseMode','none',[-1 1])，这里的意思是将array中的数据线性映射到[-1,1]之间，然后使用当前设置的颜色表进行显示。此时的[-1,1]充满了整个颜色表。背景擦除模式设置为node，表示不擦除背景。

　　repmat:

　　该函数是扩展一个矩阵并把原来矩阵中的数据复制进去。比如说B = repmat(A,m,n)，就是创建一个矩阵B，B中复制了共m*n个A矩阵，因此B矩阵的大小为[size(A,1)*m size(A,2)*m]。

　　使用函数句柄的作用：

　　不使用函数句柄的情况下，对函数多次调用，每次都要为该函数进行全面的路径搜索，直接影响计算速度，借助句柄可以完全避免这种时间损耗。也就是直接指定了函数的指针。函数句柄就像一个函数的名字，有点类似于C++程序中的引用。

　　实验流程：

　　首先运行主程序train.m中的步骤1，即随机采样出10000个小的patch，并且显示出其中的204个patch图像，图像显示如下所示：

　　然后运行train.m中的步骤2和步骤3，进行损失函数和梯度函数的计算并验证。进行gradient checking的时间可能会太长，我这里大概用了1个半小时以上（反正1个多小时还没checking完，所以去睡觉了），当用gradient checking时，发现误差只有6.5101e-11，远小于1e-9，所以说明前面的损失函数和偏导函数程序是对的。后面就可以接着用优化算法来求参数了，本程序给的是优化算法是L-BFGS。经过几分钟的优化，就出结果了。

　　最后的W1的权值如下所示：

　　实验代码：

　　train.m:

%% CS294A/CS294W Programming Assignment Starter Code

%  Instructions

%  ------------

%

%  This file contains code that helps you get started on the

%  programming assignment. You will need to complete the code in sampleIMAGES.m,

%  sparseAutoencoderCost.m and computeNumericalGradient.m.

%  For the purpose of completing the assignment, you do not need to

%  change the code in this file.

%

%%======================================================================

%% STEP 0: Here we provide the relevant parameters values that will

%  allow your sparse autoencoder to get good filters; you do not need to

%  change the parameters below.

visibleSize = 8*8;   % number of input units

hiddenSize = 25;     % number of hidden units

sparsityParam = 0.01;   % desired average activation of the hidden units.

                     % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",

             %  in the lecture notes).

lambda = 0.0001;     % weight decay parameter

beta = 3;            % weight of sparsity penalty term       

%%======================================================================

%% STEP 1: Implement sampleIMAGES

%

%  After implementing sampleIMAGES, the display_network command should

%  display a random sample of 200 patches from the dataset

patches = sampleIMAGES;

display_network(patches(:,randi(size(patches,2),204,1)),8);%randi(size(patches,2),204,1)

                                                           %为产生一个204维的列向量，每一维的值为0~10000

                                                           %中的随机数，说明是随机取204个patch来显示

%  Obtain random parameters theta

theta = initializeParameters(hiddenSize, visibleSize);

%%======================================================================

%% STEP 2: Implement sparseAutoencoderCost

%

%  You can implement all of the components (squared error cost, weight decay term,

%  sparsity penalty) in the cost function at once, but it may be easier to do

%  it step-by-step and run gradient checking (see STEP 3) after each step.  We

%  suggest implementing the sparseAutoencoderCost function using the following steps:

%

%  (a) Implement forward propagation in your neural network, and implement the

%      squared error term of the cost function.  Implement backpropagation to

%      compute the derivatives.   Then (using lambda=beta=0), run Gradient Checking

%      to verify that the calculations corresponding to the squared error cost

%      term are correct.

%

%  (b) Add in the weight decay term (in both the cost function and the derivative

%      calculations), then re-run Gradient Checking to verify correctness.

%

%  (c) Add in the sparsity penalty term, then re-run Gradient Checking to

%      verify correctness.

%

%  Feel free to change the training settings when debugging your

%  code.  (For example, reducing the training set size or

%  number of hidden units may make your code run faster; and setting beta

%  and/or lambda to zero may be helpful for debugging.)  However, in your

%  final submission of the visualized weights, please use parameters we

%  gave in Step 0 above.

[cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, lambda, ...

                                     sparsityParam, beta, patches);

%%======================================================================

%% STEP 3: Gradient Checking

%

% Hint: If you are debugging your code, performing gradient checking on smaller models

% and smaller training sets (e.g., using only 10 training examples and 1-2 hidden

% units) may speed things up.

% First, lets make sure your numerical gradient computation is correct for a

% simple function.  After you have implemented computeNumericalGradient.m,

% run the following:

checkNumericalGradient();

% Now we can use it to check your cost function and derivative calculations

% for the sparse autoencoder.

numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ...

                                                  hiddenSize, lambda, ...

                                                  sparsityParam, beta, ...

                                                  patches), theta);

% Use this to visually compare the gradients side by side

%disp([numgrad grad]); 

% Compare numerically computed gradients with the ones obtained from backpropagation

diff = norm(numgrad-grad)/norm(numgrad+grad);

disp(diff); % Should be small. In our implementation, these values are

            % usually less than 1e-9.

            % When you got this working, Congratulations!!! 

%%======================================================================

%% STEP 4: After verifying that your implementation of

%  sparseAutoencoderCost is correct, You can start training your sparse

%  autoencoder with minFunc (L-BFGS).

%  Randomly initialize the parameters

theta = initializeParameters(hiddenSize, visibleSize);

%  Use minFunc to minimize the function

addpath minFunc/

options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost

                          % function. Generally, for minFunc to work, you

                          % need a function pointer with two outputs: the

                          % function value and the gradient. In our problem,

                          % sparseAutoencoderCost.m satisfies this.

options.maxIter = 400;      % Maximum number of iterations of L-BFGS to run

options.display = 'on';

[opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...

                                   visibleSize, hiddenSize, ...

                                   lambda, sparsityParam, ...

                                   beta, patches), ...

                              theta, options);

%%======================================================================

%% STEP 5: Visualization 

W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);

figure;

display_network(W1', 12); 

print -djpeg weights.jpg   % save the visualization to a file

　　sampleIMAGES.m:

function patches = sampleIMAGES()

% sampleIMAGES

% Returns 10000 patches for training

load IMAGES;    % load images from disk 

patchsize = 8;  % we'll use 8x8 patches

numpatches = 10000;

% Initialize patches with zeros.  Your code will fill in this matrix--one

% column per patch, 10000 columns.

patches = zeros(patchsize*patchsize, numpatches);

%% ---------- YOUR CODE HERE --------------------------------------

%  Instructions: Fill in the variable called "patches" using data

%  from IMAGES.

%

%  IMAGES is a 3D array containing 10 images

%  For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,

%  and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize

%  it. (The contrast on these images look a bit off because they have

%  been preprocessed using using "whitening."  See the lecture notes for

%  more details.) As a second example, IMAGES(21:30,21:30,1) is an image

%  patch corresponding to the pixels in the block (21,21) to (30,30) of

%  Image 1

for imageNum = 1:10%在每张图片中随机选取1000个patch，共10000个patch

    [rowNum colNum] = size(IMAGES(:,:,imageNum));

    for patchNum = 1:1000%实现每张图片选取1000个patch

        xPos = randi([1,rowNum-patchsize+1]);

        yPos = randi([1, colNum-patchsize+1]);

        patches(:,(imageNum-1)*1000+patchNum) = reshape(IMAGES(xPos:xPos+7,yPos:yPos+7,...

                                                        imageNum),64,1);

    end

end

%% ---------------------------------------------------------------

% For the autoencoder to work well we need to normalize the data

% Specifically, since the output of the network is bounded between [0,1]

% (due to the sigmoid activation function), we have to make sure

% the range of pixel values is also bounded between [0,1]

patches = normalizeData(patches);

end

%% ---------------------------------------------------------------

function patches = normalizeData(patches)

% Squash data to [0.1, 0.9] since we use sigmoid as the activation

% function in the output layer

% Remove DC (mean of images).

patches = bsxfun(@minus, patches, mean(patches));

% Truncate to +/-3 standard deviations and scale to -1 to 1

pstd = 3 * std(patches(:));

patches = max(min(patches, pstd), -pstd) / pstd;%因为根据3sigma法则，95%以上的数据都在该区域内

                                                % 这里转换后将数据变到了-1到1之间

% Rescale from [-1,1] to [0.1,0.9]

patches = (patches + 1) * 0.4 + 0.1;

end

　　initializeParameters.m:

function theta = initializeParameters(hiddenSize, visibleSize)

%% Initialize parameters randomly based on layer sizes.

r  = sqrt(6) / sqrt(hiddenSize+visibleSize+1);   % we'll choose weights uniformly from the interval [-r, r]

W1 = rand(hiddenSize, visibleSize) * 2 * r - r;

W2 = rand(visibleSize, hiddenSize) * 2 * r - r;

b1 = zeros(hiddenSize, 1);

b2 = zeros(visibleSize, 1);

% Convert weights and bias gradients to the vector form.

% This step will "unroll" (flatten and concatenate together) all

% your parameters into a vector, which can then be used with minFunc.

theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];

end

sparseAutoencoderCost.m:

function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...

                                             lambda, sparsityParam, beta, data)

% visibleSize: the number of input units (probably 64)

% hiddenSize: the number of hidden units (probably 25)

% lambda: weight decay parameter

% sparsityParam: The desired average activation for the hidden units (denoted in the lecture

%                           notes by the greek alphabet rho, which looks like a lower-case "p").

% beta: weight of sparsity penalty term

% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example. 

% The input theta is a vector (because minFunc expects the parameters to be a vector).

% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this

% follows the notation convention of the lecture notes. 

%将长向量转换成每一层的权值矩阵和偏置向量值

W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);

W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);

b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values).

% Here, we initialize them to zeros.

cost = 0;

W1grad = zeros(size(W1));

W2grad = zeros(size(W2));

b1grad = zeros(size(b1));

b2grad = zeros(size(b2));

%% ---------- YOUR CODE HERE --------------------------------------

%  Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,

%                and the corresponding gradients W1grad, W2grad, b1grad, b2grad.

%

% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.

% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions

% as b1, etc.  Your code should set W1grad to be the partial derivative of J_sparse(W,b) with

% respect to W1.  I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b)

% with respect to the input parameter W1(i,j).  Thus, W1grad should be equal to the term

% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2

% of the lecture notes (and similarly for W2grad, b1grad, b2grad).

%

% Stated differently, if we were using batch gradient descent to optimize the parameters,

% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2.

% 

Jcost = 0;%直接误差

Jweight = 0;%权值惩罚

Jsparse = 0;%稀疏性惩罚

[n m] = size(data);%m为样本的个数，n为样本的特征数

%前向算法计算各神经网络节点的线性组合值和active值

z2 = W1*data+repmat(b1,1,m);%注意这里一定要将b1向量复制扩展成m列的矩阵

a2 = sigmoid(z2);

z3 = W2*a2+repmat(b2,1,m);

a3 = sigmoid(z3);

% 计算预测产生的误差

Jcost = (0.5/m)*sum(sum((a3-data).^2));

%计算权值惩罚项

Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));

%计算稀释性规则项

rho = (1/m).*sum(a2,2);%求出第一个隐含层的平均值向量

Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ...

        (1-sparsityParam).*log((1-sparsityParam)./(1-rho)));

%损失函数的总表达式

cost = Jcost+lambda*Jweight+beta*Jsparse;

%反向算法求出每个节点的误差值

d3 = -(data-a3).*sigmoidInv(z3);

sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%因为加入了稀疏规则项，所以

                                                             %计算偏导时需要引入该项

d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2); 

%计算W1grad

W1grad = W1grad+d2*data';

W1grad = (1/m)*W1grad+lambda*W1;

%计算W2grad

W2grad = W2grad+d3*a2';

W2grad = (1/m).*W2grad+lambda*W2;

%计算b1grad

b1grad = b1grad+sum(d2,2);

b1grad = (1/m)*b1grad;%注意b的偏导是一个向量，所以这里应该把每一行的值累加起来

%计算b2grad

b2grad = b2grad+sum(d3,2);

b2grad = (1/m)*b2grad;

% %%方法二,每次处理1个样本，速度慢

% m=size(data,2);

% rho=zeros(size(b1));

% for i=1:m

%     %feedforward

%     a1=data(:,i);

%     z2=W1*a1+b1;

%     a2=sigmoid(z2);

%     z3=W2*a2+b2;

%     a3=sigmoid(z3);

%     %cost=cost+(a1-a3)'*(a1-a3)*0.5;

%     rho=rho+a2;

% end

% rho=rho/m;

% sterm=beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));

% %sterm=beta*2*rho;

% for i=1:m

%     %feedforward

%     a1=data(:,i);

%     z2=W1*a1+b1;

%     a2=sigmoid(z2);

%     z3=W2*a2+b2;

%     a3=sigmoid(z3);

%     cost=cost+(a1-a3)'*(a1-a3)*0.5;

%     %backpropagation

%     delta3=(a3-a1).*a3.*(1-a3);

%     delta2=(W2'*delta3+sterm).*a2.*(1-a2);

%     W2grad=W2grad+delta3*a2';

%     b2grad=b2grad+delta3;

%     W1grad=W1grad+delta2*a1';

%     b1grad=b1grad+delta2;

% end

%

% kl=sparsityParam*log(sparsityParam./rho)+(1-sparsityParam)*log((1-sparsityParam)./(1-rho));

% %kl=rho.^2;

% cost=cost/m;

% cost=cost+sum(sum(W1.^2))*lambda/2.0+sum(sum(W2.^2))*lambda/2.0+beta*sum(kl);

% W2grad=W2grad./m+lambda*W2;

% b2grad=b2grad./m;

% W1grad=W1grad./m+lambda*W1;

% b1grad=b1grad./m;

%-------------------------------------------------------------------

% After computing the cost and gradient, we will convert the gradients back

% to a vector format (suitable for minFunc).  Specifically, we will unroll

% your gradient matrices into a vector.

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

end

%-------------------------------------------------------------------

% Here's an implementation of the sigmoid function, which you may find useful

% in your computation of the costs and the gradients.  This inputs a (row or

% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)

    sigm = 1 ./ (1 + exp(-x));

end

%sigmoid函数的逆向求导函数

function sigmInv = sigmoidInv(x)

    sigmInv = sigmoid(x).*(1-sigmoid(x));

end

computeNumericalGradient.m:

function numgrad = computeNumericalGradient(J, theta)

% numgrad = computeNumericalGradient(J, theta)

% theta: a vector of parameters

% J: a function that outputs a real-number. Calling y = J(theta) will return the

% function value at theta. 

% Initialize numgrad with zeros

numgrad = zeros(size(theta));

%% ---------- YOUR CODE HERE --------------------------------------

% Instructions:

% Implement numerical gradient checking, and return the result in numgrad.

% (See Section 2.3 of the lecture notes.)

% You should write code so that numgrad(i) is (the numerical approximation to) the

% partial derivative of J with respect to the i-th input argument, evaluated at theta.

% I.e., numgrad(i) should be the (approximately) the partial derivative of J with

% respect to theta(i).

%

% Hint: You will probably want to compute the elements of numgrad one at a time. 

epsilon = 1e-4;

n = size(theta,1);

E = eye(n);

for i = 1:n

    delta = E(:,i)*epsilon;

    numgrad(i) = (J(theta+delta)-J(theta-delta))/(epsilon*2.0);

end

% n=size(theta,1);

% E=eye(n);

% epsilon=1e-4;

% for i=1:n

%     dtheta=E(:,i)*epsilon;

%     numgrad(i)=(J(theta+dtheta)-J(theta-dtheta))/epsilon/2.0;

% end

%% ---------------------------------------------------------------

end

　　checkNumericalGradient.m:

function [] = checkNumericalGradient()

% This code can be used to check your numerical gradient implementation

% in computeNumericalGradient.m

% It analytically evaluates the gradient of a very simple function called

% simpleQuadraticFunction (see below) and compares the result with your numerical

% solution. Your numerical gradient implementation is incorrect if

% your numerical solution deviates too much from the analytical solution.

% Evaluate the function and gradient at x = [4; 10]; (Here, x is a 2d vector.)

x = [4; 10];

[value, grad] = simpleQuadraticFunction(x);

% Use your code to numerically compute the gradient of simpleQuadraticFunction at x.

% (The notation "@simpleQuadraticFunction" denotes a pointer to a function.)

numgrad = computeNumericalGradient(@simpleQuadraticFunction, x);

% Visually examine the two gradient computations.  The two columns

% you get should be very similar.

disp([numgrad grad]);

fprintf('The above two columns you get should be very similar.\n(Left-Your Numerical Gradient, Right-Analytical Gradient)\n\n');

% Evaluate the norm of the difference between two solutions.

% If you have a correct implementation, and assuming you used EPSILON = 0.0001

% in computeNumericalGradient.m, then diff below should be 2.1452e-12

diff = norm(numgrad-grad)/norm(numgrad+grad);

disp(diff);

fprintf('Norm of the difference between numerical and analytical gradient (should be < 1e-9)\n\n');

end

function [value,grad] = simpleQuadraticFunction(x)

% this function accepts a 2D vector as input.

% Its outputs are:

%   value: h(x1, x2) = x1^2 + 3*x1*x2

%   grad: A 2x1 vector that gives the partial derivatives of h with respect to x1 and x2

% Note that when we pass @simpleQuadraticFunction(x) to computeNumericalGradients, we're assuming

% that computeNumericalGradients will use only the first returned value of this function.

value = x(1)^2 + 3*x(1)*x(2);

grad = zeros(2, 1);

grad(1)  = 2*x(1) + 3*x(2);

grad(2)  = 3*x(1);

end

　　display_network.m:

function [h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor)

% This function visualizes filters in matrix A. Each column of A is a

% filter. We will reshape each column into a square image and visualizes

% on each cell of the visualization panel.

% All other parameters are optional, usually you do not need to worry

% about it.

% opt_normalize: whether we need to normalize the filter so that all of

% them can have similar contrast. Default value is true.

% opt_graycolor: whether we use gray as the heat map. Default is true.

% cols: how many columns are there in the display. Default value is the

% squareroot of the number of columns in A.

% opt_colmajor: you can switch convention to row major for A. In that

% case, each row of A is a filter. Default value is false.

warning off all

%exist(A),测试A是否存在，'var'表示只检测变量

if ~exist('opt_normalize', 'var') || isempty(opt_normalize)

    opt_normalize= true;

end

if ~exist('opt_graycolor', 'var') || isempty(opt_graycolor)

    opt_graycolor= true;

end

if ~exist('opt_colmajor', 'var') || isempty(opt_colmajor)

    opt_colmajor = false;

end

% rescale

A = A - mean(A(:));

%colormap(gray)表示用灰度场景

if opt_graycolor, colormap(gray); end

% compute rows, cols

[L M]=size(A);

sz=sqrt(L);

buf=1;

if ~exist('cols', 'var')%没有给定列数的情况下

    if floor(sqrt(M))^2 ~= M %M不是平方数时

        n=ceil(sqrt(M));

        while mod(M, n)~=0 && n<1.2*sqrt(M), n=n+1; end

        m=ceil(M/n);%m是最终要的小patch图像的尺寸大小

    else

        n=sqrt(M);

        m=n;

    end

else

    n = cols;

    m = ceil(M/n);

end

array=-ones(buf+m*(sz+buf),buf+n*(sz+buf));

if ~opt_graycolor

    array = 0.1.* array;

end

if ~opt_colmajor

    k=1;

    for i=1:m

        for j=1:n

            if k>M,

                continue;

            end

            clim=max(abs(A(:,k)));

            if opt_normalize

                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;

            else

                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/max(abs(A(:)));

            end

            k=k+1;

        end

    end

else

    k=1;

    for j=1:n

        for i=1:m

            if k>M,

                continue;

            end

            clim=max(abs(A(:,k)));

            if opt_normalize

                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;

            else

                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz);

            end

            k=k+1;

        end

    end

end

if opt_graycolor

    h=imagesc(array,'EraseMode','none',[-1 1]);%这里讲EraseMode设置为none,表示重绘时不擦除任何像素点

else

    h=imagesc(array,'EraseMode','none',[-1 1]);

end

axis image off

drawnow;

warning on all

　　实验总结：

　　实验结果显示的那些权值图像代表什么呢？参考了内容Visualizing a Trained Autoencoder可以知道，如果输入的特征满足二泛数小于1的约束，即满足：

那么可以证明只有当输入的x中的每一维满足：时，其对隐含层的active才最大，也就是说最容易是隐含层的节点输出为1，可以看出，输入值和权值应该是正相关的。

　　2013.5.6补：

　　以前博文中在用vector的方式写sparseAutoencoderCost.m文件时，一直不成功，现已经解决该问题了，解决方法是：把以前的Iweight换成Jweight即可。

　　参考资料：

Exercise:Sparse Autoencoder

Deep learning：八(Sparse Autoencoder)

Autoencoders and Sparsity

Visualizing a Trained Autoencoder

UFLDL练习(Sparse Autoencoder)

http://code.google.com/p/nlsbook/source/browse/trunk/nlsbook/cs294ps1/starter/?r=28

作者：tornadomeet 出处：http://www.cnblogs.com/tornadomeet 欢迎转载或分享，但请务必声明文章出处。（新浪微博：tornadomeet,欢迎交流！）

三层神经网络自编码算法推导和MATLAB实现（转载）的更多相关文章

神经网络和BP算法推导
注意:绘画太难了,因为他们画,本文中的所有插图来自基本算法饺子机类.请勿转载 1.习模型: 事实上,基本上全部的基本机器学习模型都能够概括为下面的特征:依据某个函数,将输入计算并输出. 图形化表示为下 ...
BP神经网络算法推导及代码实现笔记zz
一. 前言: 作为AI入门小白,参考了一些文章,想记点笔记加深印象,发出来是给有需求的童鞋学习共勉,大神轻拍! [毒鸡汤]:算法这东西,读完之后的状态多半是 --> “我是谁,我在哪?” 没事的 ...
人工神经网络反向传播算法（BP算法）证明推导
为了搞明白这个没少在网上搜,但是结果不尽人意,最后找到了一篇很好很详细的证明过程,摘抄整理为 latex 如下. (原文:https://blog.csdn.net/weixin_41718085/a ...
deep learning 自编码算法详细理解与代码实现（超详细）
在有监督学习中,训练样本是有类别标签的.现在假设我们只有一个没有带类别标签的训练样本集合 ,其中 .自编码神经网络是一种无监督学习算法,它使用了反向传播算法,并让目标值等于输入值,比如 .下图是一个自 ...
BP神经网络模型及算法推导
一,什么是BP "BP(Back Propagation)网络是1986年由Rumelhart和McCelland为首的科学家小组提出,是一种按误差逆传播算法训练的多层前馈网络,是目前应用最 ...
传统神经网络ANN训练算法总结
传统神经网络ANN训练算法总结学习/训练算法分类神经网络类型的不同,对应了不同类型的训练/学习算法.因而根据神经网络的分类,总结起来,传统神经网络的学习算法也可以主要分为以下三类: 1)前馈型神经 ...
传统神经网络ANN训练算法总结参考。以后研究
http://blog.163.com/yuyang_tech/blog/static/21605008320146451352506/ 传统神经网络ANN训练算法总结 2014-07-04 17:1 ...
从 0 开始机器学习 - 神经网络反向 BP 算法！
最近一个月项目好忙,终于挤出时间把这篇 BP 算法基本思想写完了,公式的推导放到下一篇讲吧. 一.神经网络的代价函数神经网络可以看做是复杂逻辑回归的组合,因此与其类似,我们训练神经网络也要定义代价函 ...
机器学习 —— 基础整理（八）循环神经网络的BPTT算法步骤整理；梯度消失与梯度爆炸
网上有很多Simple RNN的BPTT(Backpropagation through time,随时间反向传播)算法推导.下面用自己的记号整理一下. 我之前有个习惯是用下标表示样本序号,这里不能再 ...

随机推荐

object-c 常用判断null的宏定义，如果是null直接返回@""
#define checkNull(__X__) (__X__) == [NSNull null] || (__X__) == nil ? @"" : [NSString stri ...
虚拟机与Linux
VirtualBox与Ubuntu的下载对于VirtualBox的下载,网络上的资源非常之多,并且软件也并不是很大,所以并没有耗费很多时间.但是对于Ubuntu的下载来说,一个操作系统,正版的下载肯 ...
20155231 2016-2017-2 《Java程序设计》第3周学习总结
20155231 2016-2017-2 <Java程序设计>第3周学习总结教材学习内容总结学习目标区分基本类型与类类型理解对象的生成与引用的关系掌握String类和数组理解封 ...
Oracle下如何设置 log_archive_dest
一:存在 DB_RECOVERY_FILE_DEST 时,如何设置 LOG_ARCHIVE_DEST: SQL> archive log listデータベース・ログ・モードアーカイブ・モード自 ...
QT-2D编程
QT-[转]2D编程 Qt中提供了强大的2D绘图系统,可以使用相同的API在屏幕上和绘图·设备上进行绘制,主要基于QPainter.QPainterDevice和QPainterEngine这3个类. ...
L018-课前练习以及知识巩固笔记
L018-课前练习以及知识巩固笔记 OK,今天课前做了几道题,算是对以往知识的巩固. 1.请描述下列路径的内容是做什么的?/etc/sysctl.conf/etc/rc.local/etc/hosts ...
使用SCSS扩展Bootstrap4
摘要因为打算写一个小网站,而个人时间又不是那么充裕,所以没有选择前后端分离的架构. 对于非前后端分离应用来说,Bootstrap应该是目前的最佳前端框架之一了. 而Bootstrap4,是Boots ...
javaweb(二十四)——jsp传统标签开发
一.标签技术的API 1.1.标签技术的API类继承关系二.标签API简单介绍 2.1.JspTag接口 JspTag接口是所有自定义标签的父接口,它是JSP2.0中新定义的一个标记接口,没有任何属 ...
TeamViewer卡在正在初始化显示参数
在windows的mstsc远程桌面中打开teamviewer,远程桌面开着的时候可以连接teamviewer,但是当我断开mstsc之后,再用teamviewer连就连接不上了,一直都是正在初始化显 ...
python环境通过selenium实现自动化web登陆及终端邀请
自动化主要的就是识别对象,可以在网上搜到各种各样的方法,自行百度.下面仅附上一个简单的例子. 环境搭建参考如下链接: https://www.cnblogs.com/hepeilinnow/p/101 ...

三层神经网络自编码算法推导和MATLAB实现 （转载）

三层神经网络自编码算法推导和MATLAB实现 （转载）的更多相关文章

随机推荐

热门专题

三层神经网络自编码算法推导和MATLAB实现（转载）

三层神经网络自编码算法推导和MATLAB实现（转载）的更多相关文章