论文链接:Going deeper with convolutions


  • Abstract
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network.
By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a layers deep network, the quality of which is assessed in the context of classification and detection
  • Introduction
In the last three years, our object classification and detection capabilities have dramatically improved due to advances in deep learning and convolutional networks [].One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas,algorithms and improved network architectures.
No new data sources were used, for example, by the top entries in the ILSVRC competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC actually uses times fewer parameters than the winning architecture of Krizhevsky et al [] from two years ago, while being significantly more accurate.
#例如,截止到ILSVRC14中的顶级作品,除了比赛中用于检测的分类数据集,没有使用新的数据.事实上我们在ILSVRC14挑战赛中提交的作品中所使用的模型参数少于两年前Krizhevsky et al[9]使用的获胜模型的1/12,却在准确率上有了显著的提升.
On the object detection front, the biggest gains have not come from naive application of bigger and bigger deep networks, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [].
Girshick et al[6]提出的R-CNN算法.
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers.
For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.
In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [] in conjunction with the famous “we need to go deeper” internet meme []. In our case, the word “deep” is used in two different meanings: first of all,in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth.
#在这篇文章中,我们将侧重于构建一个适用于计算机视觉的高效深度神经网络框架,称为Inception,取自Lin et al[12]的Network in Network卢文,与著名的互联网基因"we need to go deeper"[1]相呼应.在我们的例子中,"deep"包含了两层含义:首先在于我们以"Inception module"的形式引入了一个新组织,更直接地在于增加网络的深度.
In general, one can view the Inception model as a logical culmination of [] while taking inspiration and guidance from the theoretical work by Arora et al []. The benefits of the architecture are experimentally verified on the ILSVRC classification and detection challenges, where it significantly outperforms the current state of the art.

#通常,人们谈论起Arora et al理论作品中的inspiration and guidance,便会把Inception模块当作是逻辑的顶峰.框架的优点在ILSVRC 2014分类和检测挑战赛中得到了验证,并且超越了当时的state-of-the-art.
  • Related Work
Starting with LeNet-[], convolutional neural networks (CNN) have typically had a standard structure stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [, ].
For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [] and layer size [, ], while using dropout [] to address the problem of overfitting.
Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [] has also been successfully employed for localization [, ], object detection [, , , ] and human pose estimation [].
Inspired by a neuroscience model of the primate visual cortex, Serre et al. [] used a series of fixed Gabor filters of different sizes to handle multiple scales. We use a similar strategy here.
#受灵长类动物视觉拼成的神经学模型启发,Serre et al[15]使用了一系列不同固定尺寸的Gavor滤波器来处理多尺度.这里我们使用了一种近似的策略.
However, contrary to the fixed -layer deep model of [], all filters in the Inception architecture are learned. Furthermore, Inception layers are repeated many times, leading to a -layer deep model in the case of the GoogLeNet model.
Network-in-Network is an approach proposed by Lin et al. [] in order to increase the representational power of neural networks. In their model, additional  ×  convolutional layers are added to the network, increasing its depth.We use this approach heavily in our architecture.
etwork-in-Network是Lin et al.[12]为了增加神经网络的表达能力提出的一种方法.在他们的模型中,在网络上额外增加一层1*1的卷积层,可以增加它的深度。我们的框架中大量应用这种方式.
However,in our setting, × convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty.
Finally, the current state of the art for object detection is the Regions with Convolutional Neural Networks (R-CNN) method by Girshick et al. [].R-CNN decomposes the over all detection problem into two subproblems: utilizing low  level cues such as color and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations.
#最终,当前目标检测的state-of-the-art是Girshick et al.[6]提出的基于区域的卷积神经网络(R-CNN).R-CNN将检测问题总体上分为两个子问题:利用颜色和纹理等低层次特征,以跨类别的方式产生目标候选区域,随后使用CNN分类器来区分这些位置上的物体类别.
Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.

#这样一个分为两个阶段的方法利用了低层次特征分割边框的准确性,并利用了state-of-the-art CNNs强大的分类能力.我们在检测作品中引入了类似的框架,但是在这个阶段都作了拓展和改善,例如用于高目标边框recall的multi-box[5]预测以及融合更加准确的边框候选分类方法.
  • Motivation and High Level Considerations
The most straightforward way of improving the performance of deep neural networks is by increasing their size.This includes both increasing the depth – the number of network levels – as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks.

Figure 1: Two distinct classes from the 1000 classes of the ILSVRC 2014 classification challenge. Domain knowledge is required to distinguish between these classes.
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This is a major bottleneck as strongly labeled datasets are laborious and expensive to obtain, often requiring expert human raters to distinguish between various fine-grained visual categories such as those in ImageNet(even in the -class ILSVRC subset) as shown in Figure 1.
The other drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation.
If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then much of the computation is wasted. As the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of performance.
A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [].
#解决上述问题的一个基础方案是引入稀疏,使用稀疏层替换全连接层,即使在卷积层内部.除了模仿生物系统,这个方案还有更加坚实的理论基础.由于Arora et al[2]的开创性工作.
Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network,then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle–neurons that fire together, wire together–suggests that the underlying idea is applicable even under less strict conditions, in practice.
Unfortunately, today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by ×, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off.
The gap is widened yet further by the use of steadily improving and highly tuned numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [, ]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure.
Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer.
ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [] in order to break the symmetry and improve learning, yet the trend changed back to full connections with [] in order to further optimize parallel computation. Current state-of-the-art architectures for computer vision have uniform structure. The large number of filters and greater batch size allows for the efficient use of dense computation.
#自从[11]以来,卷积神经网络传统上在特征维度上使用随机稀疏连接表以破坏对称性并改善学习,然而又趋向于使用[9]中的全连接层以实现并行计算的进一步优化.当前使用于计算机视觉的state-of-the-art结构拥有均匀的结构,大量的滤波器和更大的batch size实现稠密运算的高效利用.
This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. []) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.
The Inception architecture started out as a case study for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, modest gains were observed early on when compared with reference networks based on [12].
With a bit of tuning the gap widened and Inception proved to be especially useful in the context of localization and object detection as the base network for [] and []. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly in separation, they turned out to be close to optimal locally.
One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction. Making sure of this would require a much more thorough analysis and verification.
  • Architectural Details
The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks.
All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [] suggests a layer-by layer construction where one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation.
Arora et al.[2]提出一种分析前层的统计学相关信息,并将它们聚类成高度相关的逐层构建方式。
These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. Thus, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of × convolutions in the next layer, as suggested in []. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes ×, × and ×; this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure (a)).
As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease. This suggests that the ratio of × and × convolutions should increase as we move to higher layers.
        (a) Inception module, naı̈ve version
One big problem with the above modules, at least in this naı̈ve form, is that even a modest number of × convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: the number of output filters equals to the number of filters in the previous stage.
The merging of output of the pooling layer with outputs of the convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.
This leads to the second idea of the Inception architecture: judiciously reducing dimension wherever the computational requirements would increase too much otherwise.This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to process. The representation should be kept sparse at most places (as required by the conditions of []) and compress the signals only whenever they have to be aggregated en masse.
That is,× convolutions are used to compute reductions before the expensive × and × convolutions. Besides being used as reductions, they also include the use of rectified linear activation making them dual-purpose. The final result is depicted in Figure (b).
#就是在3X3和5X5卷积层之前使用1X1卷积来减少计算量。除了用作减少运算量,他们也包含rectified linear激活,使他具有双重作用。最终的结果呈现在表2(b)。
In general, an Inception network is a network consisting of modules of the above type stacked upon each other,with occasional max-pooling layers with stride  to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.
    (b) Inception module with dimensionality reduction
A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolutions with larger patch sizes.
Furthermore, the design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.
The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. One can utilize the Inception architecture to create slightly inferior, but computationally cheaper versions of it.
We have found that all the available knobs and levers allow for a controlled balancing of computational resources resulting in networks that are − × faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.

  • GoogLeNet
By the “GoogLeNet” name we refer to the particular incarnation of the Inception architecture used in our submission for the ILSVRC  competition. We also used one deeper and wider Inception network with slightly superior quality, but adding it to the ensemble seemed to improve the results only marginally.
#通过GoogLeNet这个名字我们指的是ILSVRC 2014挑战赛上提交作品中的Inception框架的特定体现。我们也用了另一个更深更宽性能稍微改善的Inception网络,但是把它加进融合中似乎只改善边缘的结果
We omit the details of that network,as empirical evidence suggests that the influence of the exact architectural parameters is relatively minor. Table illustrates the most common instance of Inception used in the competition. This network (trained with different image patch sampling methods) was used for out of the models in our ensemble.
                    Table 1: GoogLeNet incarnation of the Inception architecture
All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is × in the RGB color space with zero mean. “#× reduce” and “#× reduce” stands for the number of × filters in the reduction layer used before the × and × convolutions. One can see the number of × filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.
#所有卷积,包括那些位于Inception内部的,使用ReLU激活函数。我们的网络中RGB零均值颜色空间的感受野尺寸是224X224。 “#3×3 reduce” 和 “#5×5 reduce”代表3×3以及5×5卷积层前降维层的1X1滤波器数量。可以在pool proj这列看到最大池化层后的预测层中1X1滤波器数量。所有这些降维层/预测层也都使用ReLU激活函数。
The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint.The network is  layers deep when counting only layers with parameters (or  layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about . The exact number depends on how layers are counted by the machine learning infrastructure.
The use of average pooling before the classifier is based on [], although our implementation has an additional linear layer. The linear layer enables us to easily adapt our networks to other label sets, however it is used mostly for convenience and we do not expect it to have a major effect. We found that a move from fully connected layers to average pooling improved the top- accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.
Given relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. The strong performance of shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, discrimination in the lower stages in the classifier was expected.
This was thought to combat the vanishing gradient problem while providing regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3).
At inference time, these auxiliary networks are discarded. Later control experiments have shown that the effect of the auxiliary networks is relatively minor (around 0.5%) and that it required only one of them to achieve the same effect.

The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:
# 边上包含辅助分类器的额外网络准确结构如下所示:
1. An average pooling layer with × filter size and stride , resulting in an ×× output for the (4a),and ×× for the (4d) stage.
# 1.使用卷积尺寸为5*5,stride为3的平均池化层,产生了(4a)阶段中的4x4x512输出以及(4d)阶段中的4x4x528输出
2. A × convolution with filters for dimension reduction and rectified linear activation.
# 2.一个拥有128个滤波器用于降维的1x1卷积层和RLU激活层
3. A fully connected layer with units and rectified linear activation.
# 3.一个拥有1024个单元的全连接层和RLU激活层
4. A dropout layer with % ratio of dropped outputs.
# 4.一个丢弃率为70%的dropout层
5. A linear layer with softmax loss as the classifier (predicting the same classes as the main classifier, but removed at inference time).
# 5.一个使用softmax损失函数的线性层作为分类器(作为主分类器预测相同的1000种类物体,但是在推理阶段移除)
A schematic view of the resulting network is depicted in Figure 3
  • Training Methodology
GoogLeNet networks were trained using the DistBelief [] distributed machine learning system using modest amount of model and data-parallelism. Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage.
Our training used asynchronous stochastic gradient descent with 0.9 momentum [], fixed learning rate schedule (decreasing the learning rate by % every epochs). Polyak averaging [] was used to create the final model used at inference time.
Image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, such as dropout and the learning rate. Therefore, it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [].
Still, one prescription that was verified to work very well after the competition, includes sampling of various sized patches of the image whose size is distributed evenly between % and % of the image area with aspect ratio constrained to the interval [ , ]. Also, we found that the photometric distortionsof Andrew Howard [] were useful to combat overfitting to the imaging conditions of training data.
#竞赛结束后仍然验证了一个方法的有效性,包括采样不同尺寸的图像块,甚至图像面积分布在8%到100%并保持宽高比在3/4到4/3之间。我们同时发现Andrew Howard[8]的亮度失真在处理训练过程中的过拟合非常有效。
  • ILSVRC 2014 Classification Challenge Setup and Results
The ILSVRC  classification challenge involves the task of classifying the image into one of  leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, , for validation and , images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions.
#ILSVRC 2014分类挑战赛包含将图像分成ImageNet分级中1000个之类中的其中一类。训练中使用了120万张图像,验证中使用了5万张图像,测试中使用了10万张图像。每张图像都与一个真是类别相关,性能都是基于最高得分分类器预测结果进行衡量的。
Two numbers are usually reported: the top- accuracy rate, which compares the ground truth against the first predicted class,and the top- error rate, which compares the ground truth against the first predicted classes: an image is deemed Figure correctly classified if the ground truth is among the top-,regardless of its rank in them. The challenge uses the top- error rate for ranking purposes,
We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we describe next.
# 我们在比赛中没有使用多余的数据进行训练。除了本文中提到的训练技巧外,我们在测试阶段引入了一系列技巧来获取更高的性能,将在下文中阐述。
.We independently trained versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them.These models were trained with the same initialization (even with the same initial weights, due to an oversight) and learning rate policies. They differed only in sampling methodologies and the randomized input image order.
# 1.我们独立训练了7个版本相同的GoogLeNet模型(包括一个相对更宽的版本),并将它们的预测结果进行融合。这些模型使用相同的初始化进行训练(甚至是相同的初始化权值,由于一个疏忽)以及学习率策略。它们只在采样方法和输入图像随机顺序上有差异。
.During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. []. Specifically, we resized the image to scales where the shorter dimension (height or width) is , , and respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares).For each square, we then take the corners and the center × crop as well as the square resized to ×, and their mirrored versions.
# 2.在测试阶段,我们引入了一个比Krizhevsky et al.[9]更加激进的裁剪方法.特别是,我们将图像分别放大了4个比例,其中具有代表性的短边(高度或宽度)分别为256,288,320以及352,取缩放后图像的左边,中间和右边(在人像例子中,我们取了上边,中间和底边)。对于每条边,我们随后取4个角以及中间224*224图像块以及缩放到224*224的方块图像和它的镜像。
This leads to ××× = crops per image. A similar approach was used by Andrew Howard [] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on)
# 这使得每张图片中有4x3x6x2 = 144 个图片块。Andrew Howard[8]在去年挑战赛中使用了类似的方法,后来被证明稍微逊于我们所提出的机制。我们注意到这种激进的分块操作在实际应用中并非必要,在达到合理数量的图像块后(我们稍后会说明),使用更多分块所带来的边际效应微乎其微。
.The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers,but they lead to inferior performance than the simple averaging。
# 3.softmax概率是通过在各种图像块以及所有独立分类器上取平均以获取最终的预测结果。在我们的实验中,我们分析了验证集上的可选方法,例如在分块上的最大池化操作以及分类器上取平均,但是它们带来的效果并没有单一平均来得好。
In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.
Our final submission to the challenge obtains a top- error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in ,and about % relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. Table shows the statistics of some of the top-performing approaches over the past years.

          Table 2: Classification performance.

      Table 3: GoogLeNet classification performance break down.
We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in Table .When we use one model, we chose the one with the lowest top- error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.
  • ILSVRC 2014 Detection Challenge Setup and Results
The ILSVRC detection task is to produce bounding boxes around objects in images among  possible classes.Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least % (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary. Results are reported using the mean average precision (mAP).
#ILSVRC检测任务要求在图像内的200种可能的物体周边产生边界框.被测物体如果与真实类别匹配且他们的边界框与实际位置相交超过50%(使用Jaccard index)则认定为正确.无关的检测则被认定为错误的并且给予惩罚.相对于分类任务,每个图像可能包含很多物体或者不包含,而且他们的尺寸可能变化比较大.结构使用平均准确率(mAP)进行汇报.
The approach taken by GoogLeNet for detection is similar to the R-CNN by [], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the selective search [] approach with multibox [] predictions for higher object bounding box recall.In order to reduce the number of false positives, the superpixel size was increased by ×. This halves the proposals coming from the selective search algorithm.
#GoogLeNet在检测任务中使用的方法与[6]所述R-CNN类似,但使用Inception模块作为区域分类器进行扩展.额外地,区域检测步骤混合使用selective search [20]方法和multibox [5]预测方法进行改善,为了达到更高的边界框recall.为了减少错误分类数量,超分辨率尺寸增大为原来的2倍.这将selective search算法中的建议减半.
We added back region proposals coming from multi-box [] resulting, in total, in about % of the proposals used by [], while increasing the coverage from % to %. The overall effect of cutting the number of proposals with increased coverage is a % improvement of the mean average precision for the single model case. Finally, we use an ensemble of GoogLeNets when classifying each region. This leads to an increase in accuracy from % to 43.9%. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.
We first report the top detection results and show the progress since the first edition of the detection task. Compared to the  result, the accuracy has almost doubled.The top performing teams all use convolutional networks.We report the official scores in Table  and common strategies for each team: the use of external data, ensemble models or contextual models. 

      Table 4: Comparison of detection performances. Unreported values are noted with question marks
The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNet entry did not use the localization data for pretraining.
In Table , we compare results using a single model only.The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of  models while the GoogLeNet obtains significantly stronger results with the ensemble.
#在表5中,我们只将结果与一个模型进行比较.表现最好的模型是Deep Insight实现的,令人惊讶的是它使用3个模型的融合仅提高了0.3个百分点,而GoogLeNet使用融合后取得明显更强的结果.

      Table 5: Single model performance for detection.
  • Conclusions
Our results yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and narrower architectures.
Our object detection work was competitive despite not utilizing context nor performing bounding box regression,suggesting yet further evidence of the strengths of the Inception architecture.
For both classification and detection, it is expected that similar quality of result can be achieved by much more expensive non-Inception-type networks of similar depth and width.Still, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest future work towards creating sparser and more refined structures in automated ways on the basis of [], as well as on applying the insights of the Inception architecture to other domains


Going Deeper with Convolutions阅读摘要的更多相关文章

  1. 解读(GoogLeNet)Going deeper with convolutions

    (GoogLeNet)Going deeper with convolutions Inception结构 目前最直接提升DNN效果的方法是increasing their size,这里的size包 ...

  2. 图像分类(一)GoogLenet Inception_V1:Going deeper with convolutions

    论文地址 在该论文中作者提出了一种被称为Inception Network的深度卷积神经网络,它由若干个Inception modules堆叠而成.Inception的主要特点是它能提高网络中计算资源 ...

  3. Going deeper with convolutions 这篇论文

    致网友:如果你不小心检索到了这篇文章,请不要看,因为很烂.写下来用于作为我的笔记. 2014年,在LSVRC14(large-Scale Visual Recognition Challenge)中, ...

  4. Going Deeper with Convolutions (GoogLeNet)

    目录 代码 Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]. computer vision and pattern ...

  5. 论文阅读笔记四十二:Going deeper with convolutions (Inception V1 CVPR2014 )

    论文原址:https://arxiv.org/pdf/1409.4842.pdf 代码连接:https://github.com/titu1994/Inception-v4(包含v1,v2,v4)   ...

  6. [论文阅读]Going deeper with convolutions(GoogLeNet)

    本文采用的GoogLenet网络(代号Inception)在2014年ImageNet大规模视觉识别挑战赛取得了最好的结果,该网络总共22层. Motivation and High Level Co ...

  7. 【CV论文阅读】Going deeper with convolutions(GoogLeNet)

    目的: 提升深度神经网络的性能. 一般方法带来的问题: 增加网络的深度与宽度. 带来两个问题: (1)参数增加,数据不足的情况容易导致过拟合 (2)计算资源要求高,而且在训练过程中会使得很多参数趋向于 ...

  8. Inception——Going deeper with convolutions

    1. 摘要 作者提出了一个代号为 Inception 的卷积神经网络架构,这也是作者在 2014 年 ImageNet 大规模视觉识别挑战赛中用于分类和检测的新技术. 通过精心的设计,该架构提高了网络 ...

  9. Going Deeper with Convolutions(Inception v1)笔记

    目录 Abstract Introduction First of All Inception Depth Related Work Motivation and High Level Conside ...


  1. Django—admin系统:admin的使用及源码剖析

    admin组件使用 Django 提供了基于 web 的管理工具. Django 自动管理工具是 django.contrib 的一部分.你可以在项目的 settings.py 中的 INSTALLE ...

  2. Socket问题


  3. ABP 集成 nswag 到 VUE 项目, 自动生成操作类代码

    记录日期: 2019-9-22 23:12:39 原文链接:https://www.cnblogs.com/Qbit/p/11569906.html 集成记录: npm install nswag - ...

  4. c++中lambda表达式的用法

    #include <iostream> using namespace std; int main(){ ; auto func1 = [=](;}; auto func2 = [& ...

  5. vue_03 练习

    1.有以下广告数据(实际数据命名可以略做调整) ad_data = { tv: [ {img: 'img/tv/001.png', title: 'tv1'}, {img: 'img/tv/002.p ...

  6. 【Android-ListView控件】显示信息

    效果图 布局文件 layout - activity_main.xml 在主布局添加一个listview控件 <?xml version="1.0" encoding=&qu ...

  7. Java 集合存储都返回什么?

    1.抛出一个类 package com.math.spring; import com.google.common.collect.Lists; import com.google.common.co ...

  8. CI环境搭建-创建git

    添加如下配置:  上图说的需每次启动的即下面这个文件: 默认用户名.密码是admin/admin  创建代码仓库:  选择版本库:  使用方法: 1,创建一个文件夹  选择版本库地址:  也可以通过c ...

  9. hover([over,]out)

    hover([over,]out) 概述 一个模仿悬停事件(鼠标移动到一个对象上面及移出这个对象)的方法.这是一个自定义的方法,它为频繁使用的任务提供了一种“保持在其中”的状态. 当鼠标移动到一个匹配 ...

  10. word文档的图片怎么保存到ueditor上

    word图片转存,是指UEditor为了解决用户从word中复制了一篇图文混排的文章粘贴到编辑器之后,word文章中的图片数据无法显示在编辑器中,也无法提交到服务器上的问题而开发的一个操作简便的图片转 ...