Rupesh Kumar Srivastava (邮箱:RUPESH@IDSIA.CH)
Klaus Greff (邮箱:KLAUS@IDSIA.CH)
J¨ urgen Schmidhuber (邮箱:JUERGEN@IDSIA.CH)
The Swiss AI Lab IDSIA(瑞士AI实验室IDSIA
Istituto Dalle Molle di Studi sull’Intelligenza Artificiale(IDSIA:institute of studies on intelligence artificiale springs)
Universit` a della Svizzera italiana (USI大学)
Scuola universitaria professionale della Svizzera italiana (SUPSI)
Galleria 2, 6928 Manno-Lugano, Switzerland

Abstract摘要

There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.

有很多的理论和经验证明神经网络的深度是成功的一个重要的因素。但是,随着深度的增加网络训练变得更困难,而超深网络的训练依然是一个公开的问题。在这个扩展的概述中,我们介绍了一个新的结构,它能够减轻超长神经网络中基于梯度的训练。我们将这种结构的网络称为高速公路神经网络,因为它们允许在信息高速公路中的多个层之间畅通无阻的信息流动。这种结构的特点是使用门控单元,它们学会调控网络中的信息流。高速公路神经网络具有几百个层,可以直接使用随机梯度下降和各种激活函数训练,开辟了研究极深且有效的结构的可能性。

Note: A full paper extending this study is available at http://arxiv.org/abs/1507.06228, with additional references, experiments and analysis.

注:关于本研究的拓展全文可以在http://arxiv.org/abs/1507.06228获取,带有额外的参考文献、实验和分析。

1. Introduction簡介

Many recent empirical breakthroughs in supervised machine learning have been achieved through the application of deep neural networks. Network depth (referring to the number of successive computation layers) has played perhaps the most important role in these successes. For instance, the top-5 image classification accuracy on the 1000-class ImageNet dataset has increased from 84% (Krizhevsky et al., 2012) to 95% (Szegedy et al., 2014;Simonyan & Zisserman, 2014) through the use of ensembles of deeper architectures and smaller receptive fields (Ciresan et al., 2011a;b; 2012) in just a few years.

很多最近的監督機器學習方面的理論突破都已經由於深度學習的應用而取得了突破。網絡深度(指的是連續的計算層的數量)已經扮演了成功中最重要的角色。例如,前5名的針對1000-類的ImageNet數據集的圖像分類的精確率已經從84%增長到95%,即是使用了更深的結構和更小的感受野。所以僅僅幾年間就增長了這麼多。

On the theoretical side, it is well known that deep networks can represent certain function classes exponentially more efficiently than shallow ones (e.g. the work of H˚ astad(1987); H˚ astad & Goldmann (1991) and recently of Montufar et al. (2014)). As argued by Bengio et al. (2013), the use of deep networks can offer both computational and statistical efficiency for complex tasks.

在理論方面,廣爲人知:深度的網絡能夠比淺的網絡呈指數地更有效地表示某些函數類。如Bengio等人所說的那樣,使用深的網絡能夠爲複雜的任務提供計算上和統計上的效率。

However, training deeper networks is not as straightforward as simply adding layers. Optimization of deep networks has proven to be considerably more difficult, leading to research on initialization schemes (Glorot & Bengio, 2010; Saxe et al., 2013; He et al., 2015), techniques of training networks in multiple stages (Simonyan & Zisserman, 2014; Romero et al., 2014) or with temporary companion loss functions attached to some of the layers(Szegedy et al., 2014; Lee et al., 2015).

但是,訓練更深的網絡不是簡單地添加層那麼直接。深度的網絡的優化被證明是相當程度地更難,從而導致了對初始化方案的研究,訓練不同階段的網絡的技巧,或者是附加在一些層中的臨時協同損失函數。

In this extended abstract, we present a novel architecture that enables the optimization of networks with virtually arbitrary depth. This is accomplished through the use of a learned gating mechanism for regulating information flow which is inspired by Long Short Term Memory recurrent neural networks (Hochreiter & Schmidhuber, 1995). Due to this gating mechanism, a neural network can have paths along which information can flow across several layers without attenuation. We call such paths information highways, and such networks highway networks.

在該拓展的摘要里,我們提出了一個新的框架,對於幾乎任意的深度的網絡的優化。这是通过使用一个学习的闸门机制来调节信息,这是由长短期记忆递归神经网络(HoChret&SmithHubor,1995)启发的。由於這種門閥機制,神經網絡可以有多個路徑,沿着這些路徑信息可以流動穿過多個層而沒有損失。我們將這種路徑稱爲信息高速路,並且把這類網絡稱爲高速路神經網絡。

In preliminary experiments, we found that highway networks as deep as 900 layers can be optimized using simple Stochastic Gradient Descent (SGD) with momentum. For up to 100 layers we compare their training behavior to that of traditional networks with normalized initialization (Glorot & Bengio, 2010; He et al., 2015). We show that optimization of highway networks is virtually independent of depth, while for traditional networks it suffers significantly as the number of layers increases. We also show that architectures comparable to those recently presented by Romero et al. (2014) can be directly trained to obtain similar test set accuracy on the CIFAR-10 dataset without the need for a pre-trained teacher network.

在初步的实验中,我们发现深度达900层的高速路网络可以使用简单的SGD方法进行优化,带有衰减。对于多达100层的情况,我们对比它们与传统网络(带有归一初始化)的训练表现。我们显示高速路网络的优化独立于深度,而传统的网络则显著地遭受着图层的增加。我们还发现这些最近Romero呈现的结构可以直接训练得到类似的测试集精确度。

1.1. Notation符号

We use boldface letters for vectors and matrices, and italicized capital letters to denote transformation functions. 0 and 1 denote vectors of zeros and ones respectively, and I denotes an identity matrix. The function σ(x) is defined as σ(x) = 1/(1+e-x) ; x∈R.

我们使用黑体字表示向量和矩阵,和斜体大写字表示转换方程。0和1表示零向量和一向量。I表示一个自身矩阵。方程σ(x)定义为σ(x) = 1/(1+e-x) ,其中x∈R。

2. Highway Networks高速路网络

A plain feedforward neural network typically consists of L layers where the lth layer (l∈{1, 2, ...,L}) applies a nonlinear transform H (parameterized by WH,l) on its input xl to produce its output yl. Thus, x1 is the input to the network and yL is the network’s output. Omitting the layer index and biases for clarity,

一个平坦的前向神经网络典型地包含L层,其中第l层(l∈{1,2,…,L})应用一个非线性转换H(参数化为WH,l)在它的输入xl上,产生一个输出yl。因此,xl是网络的输入,而yl是网络的输出。忽略输出图层的索引和偏移,那么声明为:

y=H(x, WH)                                                                                  (1)

H is usually an affine transform followed by a non-linear activation function, but in general it may take other forms.

H通常是一个affine(仿射)转换,跟着一个非线性激活函数[說的是傳統的方法],但是总体上来说它可能有其他形式[下面要講到的]。

For a highway network, we additionally define two non-linear transforms T(x, WT) and C(x, WC) such that

对于一个高速路网络,我们还定义两个非线性转换T(x, WT)和C(x, WC),从而:

y = H(x,WH)·T(x,WT) + x·C(x,WC)                                                             (2)

We refer to T as the transform gate and C as the carry gate,since they express how much of the output is produced by transforming the input and carrying it, respectively. For simplicity, in this paper we set C = 1 - T, giving

我们将T当作转换门,将C当作卷携门,因为它们表示了结果是如何通过对输入进行转换和卷携而产生的。为了简化,在本文中我们设置C=1-T,给出

y = H(x,WH)·T(x,WT) + x·(1 - T(x,WT))                                                       (3)

The dimensionality of x, y, H(x, WH) and T(x, WT) must be the same for Equation (3) to be valid. Note that this re-parametrization of the layer transformation is much more flexible than Equation (1). In particular, observe that

x,y,H(x,WH)和T(x,WT)的维度必须相同,以使得等式(3)有效。注意图层转换的这种重参数化是比等式(1)更灵活的。尤其,观察到:

                                                                                             ………………………………………………(4)

Similarly, for the Jacobian of the layer transform,

类似地,对于图层转换的雅克比矩阵,

………………………………………………(5)

Thus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior between that of a plain layer and that of a layer which simply passes its inputs through. Just as a plain layer consists of multiple computing units such that the ith unit computes yi = Hi(x), a highway network consists of multiple blocks such that the ith block computes a block state Hi(x) and transform gate output Ti(x). Finally, it produces the block output yi = Hi(x) *Ti(x) + xi*(1 - Ti(x)), which is connected to the next layer.

因此,根据转换门的输出,一个高速路图层可以光滑地在普通图层和简单地传输自身之间改变它的行为。就像一个平滑的图层包含多个计算单元从而导致第i个单元计算yi=Hi(x),高速路网络包含多个块从而第i个块计算一个块状态Hi(x),以及转换门输出Ti(x)。最终,它产生块输出yi=Hi(x)*Ti(x)+xi*(1-Ti(x)),它将连接到下一个图层。

2.1. Constructing Highway Networks 构建高速路网络

As mentioned earlier, Equation (3) requires that the dimensionality of x, y,H(x,WH) and T(x,WT) be the same. In cases when it is desirable to change the size of the representation, one can replace x with ^x obtained by suitably sub-sampling or zero-padding x. Another alternative is to use a plain layer (without highways) to change dimensionality and then continue with stacking highway layers. This is the alternative we use in this study.

如之前提到的,等式(3)需要x,y,H(x,WH)和T(x,WT)的维度相同。在当需要改变表示尺寸的情况,可以通过将经过合适的子采样或者0填充x得到的^x替换x。另一个替代选择是使用平坦层(不带有高速路)改变维度,然后继续叠加高速路层。我们本研究使用的是替代选择。

Convolutional highway layers are constructed similar to fully connected layers. Weight-sharing and local receptive fields are utilized for both H and T transforms. We use zero-padding to ensure that the block state and transform gate feature maps are the same size as the input.

卷积高速路图层是类似于全连接图层构建的。权重-分享和局部感受野都同時用在H和T轉換中。我們使用0填充來確保塊狀態和變換閥門特徵地圖與輸入是相同的尺寸。

2.2. Training Deep Highway Networks訓練深度神經網絡

For plain deep networks, training with SGD stalls at the beginning unless a specific weight initialization scheme is used such that the variance of the signals during forward and backward propagation is preserved initially (Glorot & Bengio, 2010; He et al., 2015). This initialization depends on the exact functional form of H.

對於普通的深度神經網絡,使用SGD攤位在一開始進行訓練,除非是特殊的權重初始化方案,從而信號的方差在前向和後向的傳播在一開始保存。這種初始化取決於H的具體函數形式。

For highway layers, we use the transform gate defined as T(x) = σ(WTT x+bT), where WT is the weight matrix and bT the bias vector for the transform gates. This suggests a simple initialization scheme which is independent of the nature of H: bT can be initialized with a negative value (e.g. -1, -3 etc.) such that the network is initially biased towards carry behavior. This scheme is strongly inspired by the proposal of Gers et al. (1999) to initially bias the gates in a Long Short-Term Memory recurrent network to help bridge long-term temporal dependencies early in learning. Note that σ(x) ∈ (0, 1); ∀x ∈ R, so the conditions in Equation (4) can never be exactly true.

對於高速路層,我們使用轉換門T(x) = σ(WTT x+bT),其中WT是權重矩陣,而bT是轉換門的偏移矢量。這建議了一個簡單的初始化方案,它獨立於H的天性:bT可以是用負數初始化的,從而網絡初始化是偏向於携带行为的。該方案是嚴重受到Gers的方案的激發的,他是最初在长期记忆记忆网络中对门进行偏置,以幫助在學習早期建立长期的时间依赖性。注意σ(x) ∈ (0, 1); ∀x ∈ R,所以等式(4)中的條件不會永遠精確的爲真。

In our experiments, we found that a negative bias initialization was sufficient for learning to proceed in very deep networks for various zero-mean initial distributions of WH and different activation functions used by H. This is significant property since in general it may not be possible to find effective initialization schemes for many choices of H.

在我們的實驗中,我們發現負的偏移初始化是充分的,對於學習学习在非常深的网络中进行WH的各种零均值初始分布和H所使用的不同激活函数。這是一個重要的屬性,因爲總體來說可能不可能發現所有的對於H的選擇的有效的初始化方案。

3. Experiments 實驗
3.1. Optimization 最優化
Very deep plain networks become difficult to optimize even if using the variance-preserving initialization scheme form(He et al., 2015). To show that highway networks do not suffer from depth in the same way we train run a series of experiments on the MNIST digit classification dataset.We measure the cross entropy error on the training set, to investigate optimization, without conflating them with generalization issues.

非常深的普通網絡將變得很難最優化,儘管使用了方差-保存初始化方案。爲了顯示高速路網絡並不遭受深度的困擾,我們在MNIST分類數據集上運行了一系列的試驗。我們測量了訓練集上的交叉熵誤差,調查最優化,並且不將它們與泛化問題混爲一談。

We train both plain networks and highway networks with the same architecture and varying depth. The first layer is always a regular fully-connected layer followed by 9, 19,49, or 99 fully-connected plain or highway layers and a single softmax output layer. The number of units in each layer is kept constant and it is 50 for highways and 71 for plain networks. That way the number of parameters is roughly the same for both. To make the comparison fair we run a random search of 40 runs for both plain and highway networks to find good settings for the hyperparameters. We optimized the initial learning rate, momentum, learning rate decay rate, activation function for H (either ReLU or tanh) and, for highway networks, the value for the transform gate bias (between -1 and -10). All other weights were initialized following the scheme introduced by (He et al., 2015).

我們使用相同的結構和變化的深度同時訓練普通的網絡和高速路網絡。第一層總是一個規則的全鏈接層,然後跟着9個、19個、49個或者99個全連接普通層或者高速路層,以及一個softmax輸出層。每一層的單元數目保持爲常數,對於高速路爲50,對於普通層爲71。这样,两个参数的数量大致相同。为了使比较公平,我们运行随机搜索40个运行平原和公路网络,找到超参数的良好的设置。我們優化了初始化學習率、勢頭、學習率衰減率、H的激活函數,對於高速路網絡來說的轉換閥門偏移值(在-1和-10之間)。所有其他權重依照He介紹的方案初始化。

The convergence plots for the best performing networks for each depth can be seen in Figure 1. While for 10 layers plain network show very good performance, their performance significantly degrades as depth increases. Highway networks on the other hand do not seem to suffer from an increase in depth at all. The final result of the 100 layer highway network is about 1 order of magnitude better than the 10 layer one, and is on par with the 10 layer plain network. In fact, we started training a similar 900 layer highway network on CIFAR-100 which is only at 80 epochs as of now, but so far has shown no signs of optimization difficulties. It is also worth pointing out that the highway networks always converge significantly faster than the plain ones.

图1中可以看到每个深度的最佳执行网络的收敛曲线。而对于10层纯网络表现出很好的性能,其性能随着深度的增加而显著降低。另一方面,公路网似乎根本不受深度的影响。100层公路网的最终结果比10层公路网好1个数量级,与10层平原网络相当。事实上,我们开始在CiOW100上训练一个类似的900层公路网络,到目前为止只有80个时代,但到目前为止还没有显示出优化困难的迹象。值得指出的是,公路网总的收敛速度要比平原网快得多。

3.2. Comparison to Fitnets 與Fitnets對比
Deep highway networks are easy to optimize, but are they also beneficial for supervised learning where we are interested in generalization performance on a test set? To address this question, we compared highway networks to the thin and deep architectures termed Fitnets proposed recently by Romero et al. (2014) on the CIFAR-10 dataset augmented with random translations. Results are summarized in Table 1.

深度高速路網絡容易最優化,但是但是,他们也有利于监督学习,我们感兴趣的泛化性能测试集?

Romero et al. (2014) reported that training using plain backpropogation was only possible for maxout networks with depth up to 5 layers when number of parameters was limited to ∼250K and number of multiplications to ∼30M.Training of deeper networks was only possible through the use of a two-stage training procedure and addition of soft targets produced from a pre-trained shallow teacher network (hint-based training). Similarly it was only possible to train 19-layer networks with a budget of 2.5M parameters using hint-based training.

We found that it was easy to train highway networks with number of parameters and operations comparable to fitnets directly using backpropagation. As shown in Table 1,Highway 1 and Highway 4, which are based on the architecture of Fitnet 1 and Fitnet 4 respectively obtain similar or higher accuracy on the test set. We were also able to train thinner and deeper networks: a 19-layer highway network with ∼1.4M parameters and a 32-layer highway network with ∼1.25M parameter both perform similar to the teacher network of Romero et al. (2014).

4. Analysis
In Figure 2 we show some inspections on the inner workings of the best 1 50 hidden layer fully-connected highway networks trained on MNIST (top row) and CIFAR-100 (bottom row). The first three columns show, for each transform gate, the bias, the mean activity over 10K random samples, and the activity for a single random sample respectively. The block outputs for the same single sample are displayed in the last column.

The transform gate biases of the two networks were initialized to -2 and -4 respectively. It is interesting to note that contrary to our expectations most biases actually decreased further during training. For the CIFAR-100 network the biases increase with depth forming a gradient. Curiously this gradient is inversely correlated with the average activity of the transform gates as seen in the second column. This indicates that the strong negative biases at low depths are not used to shut down the gates, but to make them more selective. This behavior is also suggested by the fact that the transform gate activity for a single example (column 3) is very sparse. This effect is more pronounced for the CIFAR-100 network, but can also be observed to a lesser extent in the MNIST network.

The last column of Figure 2 displays the block outputs and clearly visualizes the concept of “information highways”. Most of the outputs stay constant over many layers forming a pattern of stripes. Most of the change in outputs happens in the early layers (≈ 10 for MNIST and ≈ 30 for CIFAR-100). We hypothesize that this difference is due to the higher complexity of the CIFAR-100 dataset.
In summary it is clear that highway networks actually utilize the gating mechanism to pass information almost unchanged through many layers. This mechanism serves not just as a means for easier training, but is also heavily used to route information in a trained network. We observe very selective activity of the transform gates, varying strongly in reaction to the current input patterns.

5. Conclusion
Learning to route information through neural networks has helped to scale up their application to challenging problems by improving credit assignment and making training easier (Srivastava et al., 2015). Even so, training very deep
networks has remained difficult, especially without considerably increasing total network size.
Highway networks are novel neural network architectures which enable the training of extremely deep networks using simple SGD. While the traditional plain neural architectures become increasingly difficult to train with increas-
ing network depth (even with variance-preserving initialization), our experiments show that optimization of highway networks is not hampered even as network depth increases to a hundred layers.

The ability to train extremely deep networks opens up the possibility of studying the impact of depth on complex problems without restrictions. Various activation functions which may be more suitable for particular problems but for which robust initialization schemes are unavailable can be used in deep highway networks. Future work will also attempt to improve the understanding of learning in highway networks.

Acknowledgments
This research was supported by the by EU project “NASCENCE” (FP7-ICT-317662). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPUs used for this research.

Highway Networks(高速路神经网络)的更多相关文章

  1. 基于pytorch实现HighWay Networks之Train Deep Networks

    (一)Highway Networks 与 Deep Networks 的关系 理论实践表明神经网络的深度是至关重要的,深层神经网络在很多方面都已经取得了很好的效果,例如,在1000-class Im ...

  2. Highway Networks

    一 .Highway Networks 与 Deep Networks 的关系 深层神经网络相比于浅层神经网络具有更好的效果,在很多方面都已经取得了很好的效果,特别是在图像处理方面已经取得了很大的突破 ...

  3. Highway Networks Pytorch

    导读 本文讨论了深层神经网络训练困难的原因以及如何使用Highway Networks去解决深层神经网络训练的困难,并且在pytorch上实现了Highway Networks. 一 .Highway ...

  4. 基于pytorch实现HighWay Networks之Highway Networks详解

    (一)简述---承接上文---基于pytorch实现HighWay Networks之Train Deep Networks 上文已经介绍过Highway Netwotrks提出的目的就是解决深层神经 ...

  5. (转载)Convolutional Neural Networks卷积神经网络

    Convolutional Neural Networks卷积神经网络 Contents 一:前导 Back Propagation反向传播算法 网络结构 学习算法 二:Convolutional N ...

  6. Convolutional Neural Networks卷积神经网络

    转自:http://blog.csdn.net/zouxy09/article/details/8781543 9.5.Convolutional Neural Networks卷积神经网络 卷积神经 ...

  7. Paper | Highway Networks

    目录 1. 网络结构 2. 分析 解决的问题:在当时,人们认为 提高深度 是 提高精度 的法宝.但是网络训练也变得很困难.本文旨在解决深度网络训练难的问题,本质是解决梯度问题. 提出的网络:本文提出的 ...

  8. 【论文笔记】Training Very Deep Networks - Highway Networks

    目标: 怎么训练很深的神经网络 然而过深的神经网络会造成各种问题,梯度消失之类的,导致很难训练 作者利用了类似LSTM的方法,通过增加gate来控制transform前和transform后的数据的比 ...

  9. 【论文笔记】Progressive Neural Networks 渐进式神经网络

    Progressive NN Progressive NN是第一篇我看到的deepmind做这个问题的.思路就是说我不能忘记第一个任务的网络,同时又能使用第一个任务的网络来做第二个任务. 为了不忘记之 ...

随机推荐

  1. 让maven生成可运行jar包

    平时项目大多用到的是war包,今天实现了一个简单功能,无需部署到web服务器上,只需本地跑java代码即可,因此只要生成一个jar包.那么怎么让maven项目打成一个可以使用java命令跑的jar包呢 ...

  2. Ubuntu14.04安装Matlab2014a

    尝试在Ubuntu14.04下安装matlab2014a已经有两天了,即便是照着csdn上的步骤进行安装依然出现了不少问题,究其原因是对linux下的命令不理解,下面就自己摸索结合博客内容总结一下安装 ...

  3. linux(6)

    第十五单元 软件包的管理 [本节内容]1. 使用RPM安装及移除软件(详见linux系统管理P374)1) 掌握RPM的定义:RPM就是Red Hat Package Manger(红帽软件包管理工具 ...

  4. FIR滤波器和IIR滤波器的区别

    数字滤波器广泛应用于硬件电路设计,在离散系统中尤为常见,一般可以分为FIR滤波器和IIR滤波器,那么他们有什么区别和联系呢. FIR滤波器 定义: FIR滤波器是有限长单位冲激响应滤波器,又称为非递归 ...

  5. jave获取音频时长

    本文转载自:http://blog.csdn.net/ntotl/article/details/50419983 下载 jave-1.0.2.jar File source =new File('d ...

  6. 多线程使用信号量sem_init,sem_wait,sem_post

    信号量的主要函数有: int sem_init(sem_t *sem,int pshared,unsigned int value); int sem_wait(sem_t *sem); int se ...

  7. 杂项-公司-百科:华特·迪士尼-un

    ylbtech-杂项-公司-百科:华特·迪士尼 华特·迪士尼(Walt Disney,全名Walter Elias Disney,又译沃尔特·迪士尼,1901年12月5日—1966年12月15日),出 ...

  8. Python print format() 格式化内置函数

    Python2.6 开始,新增了一种格式化字符串的函数 str.format(),它增强了字符串格式化的功能. 基本语法是通过 {} 和 : 来代替以前的 % . format 函数可以接受不限个参数 ...

  9. php对数组中的值进行排序

    案例 <?php $a = array('1124','1125','1126'); $s1 = 1124; $s2 = 1125; $ks1 = array_search($s1,$a); $ ...

  10. AD芯片的基准参考电压问题

    基准参考电压的精度一般非常高的! AD芯片 : AD9226的基准参考电压  误差一般是  千分之一! 我之前用万用表测量AD9226的参考电压大概是1.89V(这款AD的正确参考电压应该是2V),所 ...