Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Microsoft Research

{kahe, v-xiangz, v-shren, jiansun}@microsoft.com

Abstract摘要

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8×deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

更深的神经网络更难以训练，我们提供了一种残差学习框架来减轻更深网络的训练。我们明确地将层重新定义为相对于输入层的学习残差函数，而不是学习未经引用的函数。我们提供了全面的经验证据表明，这些残差网络更容易优化，而且能从显著增加的深度上提高精度。在ImageNet数据集上我们测试了152层残差网络，比VGG网络深8倍，但是仍然具有低的复杂度。这些残差网络集在ImageNet测试集上达到了3.57%的误差。该结果在ILSVRC2015分类任务中赢得了第1名。我们还展示了在CIFAR-10上100和1000层网络的分析。

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions 1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

表示的深度对于许多视觉识别任务来说是至关重要的。仅由于我们极其深的表示，我们获得了28%的相对改善COCO对象检测数据集。深度残差网络是我们提交给ILSVRC&COCO2015竞赛1的基础，其中我们在ImageNet检测、ImageNet定位、COCO检测和COCO分割中赢得了第1名。

1. Introduction 简介

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classiﬁcation [21,50, 40]. Deep networks naturally integrate low/mid/high-level features [50] and classiﬁers in an end-to-end multi-layer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other non-trivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly beneﬁted from very deep models.

深度卷积神经网络〔22, 21〕为图像分类带来了一系列突破[21，50，40 ]。深度网络自然地融合了低/中/高级的特征和分类器，以一种端-到-端的多层方式，而特征的“级别”可以通过叠加层(深度)得到丰富。最近的证据（41, 44）揭示了网络深度是至关重要的，并且在挑战性IMANET数据集（36）上的领先结果[ 41, 44, 13，16 ]都利用了“非常深”的[41 ]模型，深度为十六（41）到三十（16）。许多其他非平凡视觉识别任务（8, 12, 7，32, 27）也从非常深的模型中得到极大的启发。

Driven by the signiﬁcance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

在深度概念的驱动下，一个问题出现了：学习更好的网络就像堆叠更多的层一样简单吗？回答这个问题的一个障碍是消失/爆炸梯度（1, 9）的一个臭名昭著的问题，它从一开始就阻碍收敛。然而，这个问题主要通过归一化初始化[ 23, 9, 37，13 ]和中间归一化层[16 ]来解决，这使得具有几十层的网络开始与反向传播[S]的随机梯度下降（SGD）汇聚[22 ]。

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overﬁtting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly veriﬁed by our experiments. Fig. 1 shows a typical example.

当更深的网络能够开始收敛时，一个退化问题已经暴露出来：随着网络深度的增加，精度变得饱和（这可能并不令人惊讶），然后迅速退化。出乎意料的是，这样的退化不是由过度引起的，并且在适当的深度模型中添加更多的层导致更高的训练误差，如在我们的实验中（11, 42）和完全veri所示。图1示出了一个典型的例子。

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to ﬁnd solutions that are comparably good or better than the constructed solution(or unable to do so in feasible time).

(训练精度的)退化揭示了并不是所有的系统都是容易优化的。让我们考虑一个浅的结构和它的更深的同类物(添加更多层到里面)。在更深层次的模型中存在一个解决方案：添加层是身份映射，而其他层是从学习的较浅模型复制的。这个构造的解决方案的存在表明更深的模型不应该产生比它较浅的对应物更高的训练误差。但是实验表明，我们目前的求解器无法解决比构造好的解决方案好或更好的解决方案（或者在可行的时间内不能这样做）。

In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly ﬁt a desired underlying mapping, we explicitly let these layers ﬁt a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers ﬁt another mapping of F(x) := H(x) x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to ﬁt an identity mapping by a stack of nonlinear layers.

在本文中，我们通过引入一个深刻的残差学习框架来解决退化问题。而不是希望每一个堆叠层直接的期望的基础映射，我们明确地让这些层的残差映射。形式上，将期望的基础映射表示为H（x），我们让堆叠的非线性层成为f（x）＝h（x）x的另一映射。将原始映射重铸成f（x）+x。我们假设优化残差映射比优化原始的、未引用的映射更容易。在极端情况下，如果一个身份映射是最优的，那么将残差推到零比由一堆非线性层进行身份映射更容易。

The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2).Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

We present comprehensive experiments on ImageNet [36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difﬁculties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.

On the ImageNet classiﬁcation dataset [36], we obtain excellent results by extremely deep residual nets. Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classiﬁcation competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization,COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.

2. Related Work 相关工作

Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classiﬁcation [4, 48]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.

残差表示。在图像识别中，VLAD是

In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a ﬁner scale. An alternative to Multigrid is hierarchical basis preconditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3, 45, 46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classiﬁers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an “inception” layer is composed of a shortcut branch and a few deeper branches.

Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).

Deep Residual Learning for Image Recognition的更多相关文章

Deep Residual Learning for Image Recognition这篇文章
作者:何凯明等,来自微软亚洲研究院: 这篇文章为CVPR的最佳论文奖:(conference on computer vision and pattern recognition) 在神经网络中,常遇 ...
论文笔记——Deep Residual Learning for Image Recognition
论文地址:Deep Residual Learning for Image Recognition ResNet--MSRA何凯明团队的Residual Networks,在2015年ImageNet ...
[论文理解]Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition 简介这是何大佬的一篇非常经典的神经网络的论文,也就是大名鼎鼎的ResNet残差网络,论文主要通过构建了一种新 ...
Deep Residual Learning for Image Recognition (ResNet)
目录主要内容代码 He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]. computer vi ...
[论文阅读] Deep Residual Learning for Image Recognition(ResNet)
ResNet网络,本文获得2016 CVPR best paper,获得了ILSVRC2015的分类任务第一名. 本篇文章解决了深度神经网络中产生的退化问题(degradation problem). ...
Deep Residual Learning for Image Recognition论文笔记
Abstract We present a residual learning framework to ease the training of networks that are substant ...
Deep Residual Learning for Image Recognition(残差网络)
深度在神经网络中有及其重要的作用,但越深的网络越难训练. 随着深度的增加,从训练一开始,梯度消失或梯度爆炸就会阻止收敛,normalized initialization和intermediate n ...
Deep Residual Learning for Image Recognition（MSRA-深度残差学习）
转自:http://blog.csdn.net/solomonlangrui/article/details/52455638 ABSTRACT: 神经网络的训练因其层次加深而 ...
Paper | Deep Residual Learning for Image Recognition
目录 1. 故事 2. 残差学习网络 2.1 残差块 2.2 ResNet 2.3 细节 3. 实验 3.1 短连接网络与plain网络 3.2 Projection解决短连接维度不匹配问题 3.3 ...

随机推荐

解决VS2008 调试启动特别慢
Resolving Very Slow Symbol Loading with VS 2008 during debugging Recently, I was encountering insane ...
mysql实战优化之三：表优化
对于大多数的数据库引擎来说,硬盘操作可能是最重大的瓶颈.所以,把你的数据变得紧凑会对这种情况非常有帮助,因为这减少了对硬盘的访问. 如果一个表只会有几列罢了(比如说字典表,配置表),那么,我们就没有理 ...
2016女生赛 HDU 5710 Digit-Sum（数学，思维题）
Digit-Sum Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65535/32768 K (Java/Others)Total S ...
thinkphp中配置信息的二维数组设置与使用
有时候配置信息是二维数组 1.配置 <?php return array ( // 阿里大鱼短信配置 'dayu_appkey'=>'xxx', 'dayu_secretKey'=> ...
小程序WXML基本使用
数据绑定  <view> {{message}} </view> // page.js Page({ data: { message: 'He ...
Py修行路 python基础（五）三元运算字符编码元组集合三级菜单优化！
三元运算条件判断不能加冒号: a=3 b=5 c=a if a<b else b oct() 转成八进制的简写:16进制标志:BH为后缀或是0x为前缀hex() 转成16进制元组跟列表是 ...
从excel、txt、dict中取data，预期值
一:从excel中取data excel中放入预期值,上报data数据 excel中第一行是data数据,第二行是预期值在每个class中,取data数据上报到接口中,具体代码如下: def get ...
Debug 的使用
R 命令:查看.修改寄存器的内容 -r:查看寄存器的内容 CS=0AF9,IP=0100,也就是说内存 0AF9:0100 处的指令为 CPU 当前要读取.执行的指令 Debug 也列出了 CS:IP ...
python's twenty day for me 继承和 super()方法
super(): 在单继承中就是单纯的寻找父类. 在多继承中就是根据子节点所在图的mro顺序,找寻下一个类. 遇到多继承和super(): 对象.方法 1,找到这个对象对应的类. 2,将这个类的所有 ...
【树莓派】RASPBIAN镜像初始化配置
[树莓派]如何烧录镜像详细版接上一节,系统已经烧录完毕了,将其放置于树莓派然后运行起来我是直接接显示器了,若有需要转接头的自行淘宝搜索购买~~电源使用的是5V 2.5A的首次开机会时间较长且有 ...

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition的更多相关文章

随机推荐

热门专题