Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
Abstract摘要
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8×deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
更深的神经网络更难以训练,我们提供了一种残差学习框架来减轻更深网络的训练。我们明确地将层重新定义为相对于输入层的学习残差函数,而不是学习未经引用的函数。我们提供了全面的经验证据表明,这些残差网络更容易优化,而且能从显著增加的深度上提高精度。在ImageNet数据集上我们测试了152层残差网络,比VGG网络深8倍,但是仍然具有低的复杂度。这些残差网络集在ImageNet测试集上达到了3.57%的误差。该结果在ILSVRC2015分类任务中赢得了第1名。我们还展示了在CIFAR-10上100和1000层网络的分析。
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions 1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
表示的深度对于许多视觉识别任务来说是至关重要的。仅由于我们极其深的表示,我们获得了28%的相对改善COCO对象检测数据集。深度残差网络是我们提交给ILSVRC&COCO2015竞赛1的基础,其中我们在ImageNet检测、ImageNet定位、COCO检测和COCO分割中赢得了第1名。
1. Introduction 简介
Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21,50, 40]. Deep networks naturally integrate low/mid/high-level features [50] and classifiers in an end-to-end multi-layer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other non-trivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.
深度卷积神经网络〔22, 21〕为图像分类带来了一系列突破[21,50,40 ]。深度网络自然地融合了低/中/高级的特征和分类器,以一种端-到-端的多层方式,而特征的“级别”可以通过叠加层(深度)得到丰富。最近的证据(41, 44)揭示了网络深度是至关重要的,并且在挑战性IMANET数据集(36)上的领先结果[ 41, 44, 13,16 ]都利用了“非常深”的[41 ]模型,深度为十六(41)到三十(16)。许多其他非平凡视觉识别任务(8, 12, 7,32, 27)也从非常深的模型中得到极大的启发。
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].
在深度概念的驱动下,一个问题出现了:学习更好的网络就像堆叠更多的层一样简单吗?回答这个问题的一个障碍是消失/爆炸梯度(1, 9)的一个臭名昭著的问题,它从一开始就阻碍收敛。然而,这个问题主要通过归一化初始化[ 23, 9, 37,13 ]和中间归一化层[16 ]来解决,这使得具有几十层的网络开始与反向传播[S]的随机梯度下降(SGD)汇聚[22 ]。
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.
当更深的网络能够开始收敛时,一个退化问题已经暴露出来:随着网络深度的增加,精度变得饱和(这可能并不令人惊讶),然后迅速退化。出乎意料的是,这样的退化不是由过度引起的,并且在适当的深度模型中添加更多的层导致更高的训练误差,如在我们的实验中(11, 42)和完全veri所示。图1示出了一个典型的例子。
The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution(or unable to do so in feasible time).
(训练精度的)退化揭示了并不是所有的系统都是容易优化的。让我们考虑一个浅的结构和它的更深的同类物(添加更多层到里面)。在更深层次的模型中存在一个解决方案:添加层是身份映射,而其他层是从学习的较浅模型复制的。这个构造的解决方案的存在表明更深的模型不应该产生比它较浅的对应物更高的训练误差。但是实验表明,我们目前的求解器无法解决比构造好的解决方案好或更好的解决方案(或者在可行的时间内不能这样做)。
In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x) x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.
在本文中,我们通过引入一个深刻的残差学习框架来解决退化问题。而不是希望每一个堆叠层直接的期望的基础映射,我们明确地让这些层的残差映射。形式上,将期望的基础映射表示为H(x),我们让堆叠的非线性层成为f(x)=h(x)x的另一映射。将原始映射重铸成f(x)+x。我们假设优化残差映射比优化原始的、未引用的映射更容易。在极端情况下,如果一个身份映射是最优的,那么将残差推到零比由一堆非线性层进行身份映射更容易。
The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2).Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.
We present comprehensive experiments on ImageNet [36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.
Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.
On the ImageNet classification dataset [36], we obtain excellent results by extremely deep residual nets. Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization,COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.
2. Related Work 相关工作
Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 48]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.
残差表示。在图像识别中,VLAD是
In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3, 45, 46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.
Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an “inception” layer is composed of a shortcut branch and a few deeper branches.
Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).
Deep Residual Learning for Image Recognition的更多相关文章
- Deep Residual Learning for Image Recognition这篇文章
作者:何凯明等,来自微软亚洲研究院: 这篇文章为CVPR的最佳论文奖:(conference on computer vision and pattern recognition) 在神经网络中,常遇 ...
- 论文笔记——Deep Residual Learning for Image Recognition
论文地址:Deep Residual Learning for Image Recognition ResNet--MSRA何凯明团队的Residual Networks,在2015年ImageNet ...
- [论文理解]Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition 简介 这是何大佬的一篇非常经典的神经网络的论文,也就是大名鼎鼎的ResNet残差网络,论文主要通过构建了一种新 ...
- Deep Residual Learning for Image Recognition (ResNet)
目录 主要内容 代码 He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]. computer vi ...
- [论文阅读] Deep Residual Learning for Image Recognition(ResNet)
ResNet网络,本文获得2016 CVPR best paper,获得了ILSVRC2015的分类任务第一名. 本篇文章解决了深度神经网络中产生的退化问题(degradation problem). ...
- Deep Residual Learning for Image Recognition论文笔记
Abstract We present a residual learning framework to ease the training of networks that are substant ...
- Deep Residual Learning for Image Recognition(残差网络)
深度在神经网络中有及其重要的作用,但越深的网络越难训练. 随着深度的增加,从训练一开始,梯度消失或梯度爆炸就会阻止收敛,normalized initialization和intermediate n ...
- Deep Residual Learning for Image Recognition(MSRA-深度残差学习)
转自:http://blog.csdn.net/solomonlangrui/article/details/52455638 ABSTRACT: 神经网络的训练因其层次加深而 ...
- Paper | Deep Residual Learning for Image Recognition
目录 1. 故事 2. 残差学习网络 2.1 残差块 2.2 ResNet 2.3 细节 3. 实验 3.1 短连接网络与plain网络 3.2 Projection解决短连接维度不匹配问题 3.3 ...
随机推荐
- linux下安装ZipArchive扩展和libzip扩展
在项目开发的时候,由于要下载多个录音文件,我就需要打包下载这个功能 学习源头: https://www.landui.com/help/show-8079 https://www.aliyun.com ...
- 验证DataGridView单元格的值
private void gridPurchaseOrderDetail_CellValidating(object sender, DataGridViewCellValidatingEventAr ...
- JAVA面试(5)
这里列出10条JAVA编程经验 1 字符串常量放在前面 把字符串常量放在equals()比较项的左侧来防止偶然的NullPointerException. // Bad if (variable.eq ...
- AOP(面向切面编程概念,本文为翻译)
AOP是什么 AOP为Aspect Oriented Programming的缩写.AOP是OOP的延续,是软件开发中的一个热点,也是Spring框架中的一个重要内容,是函数式编程的一种衍生范型.利用 ...
- mybatis 动态sql语句(3)
mybatis 的动态sql语句是基于OGNL表达式的.可以方便的在 sql 语句中实现某些逻辑. 总体说来mybatis 动态SQL 语句主要有以下几类: 1. if 语句 (简单的条件判断) 2. ...
- Halcon学习之二:摄像头获取图像和相关参数
1.close_all_framegrabbers ( : : : ) 关闭所有图像采集设备. 2.close_framegrabber ( : : AcqHandle : ) 关闭Handle为Ac ...
- Mycat实战之离散分片
1 枚举分片(customer表) #### 1.1 修改配置信息加载配置文件 datanode hash-int vi partition-hash-int.txt db1=0 db2=1 [roo ...
- 如何编写Word文档 多级编号
每次都忘 选择这个带标题的 这样可以根据标题来指定分级
- GBK/ UTF-8/ UNICODE(字符编码)
在python2中:如果执行程序,在编译器中,因为默认的编码是ASCII码(英文),所以如果输入中文就会出现乱码,因此为了避免这种乱码的情况发生,在输入中文字符串之后,必须进行手动转码,将GBK/ U ...
- UIRect中的Anchor组件
[UIRect中的Anchor组件] Anchor用于实现粘着功能,寄存于UIRect类中.Anchor的类型有三种: 1.None:不使用跟随功能. 2.Unified:四条边使用相同的Target ...