You Only Look Once:Unified, Real-Time Object Detection
Abstract
Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background.
Introduction
A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
1) YOLO is extremely fast. Since we frame detection as a regression problem we don't need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. YOLO achieves more than twice the mean average precision of other real-time systems.
2) YOLO reasons globally about the image when making predictions. YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. YOLO makes less than half the number of background errors compared to Fast R-CNN.
3) YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
但是YOLO在accuracy上仍然落后于state-of-the-art的检测系统,但是它可以快速并努力更准确的定位图像中的物体,特别是较小的物体。
Unified Detection
Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously.
此系统将输入图像resize到S x S的grid。如果一个object的中心落在一个grid中,那么这个grid理所当然要detect这个object。
每一个grid预测B个bounding boxes和对应的confidence scores。这些confidence scores反应了模型对box包含object和它预测的box的精度的confidence。Confidence可以表示为:
如果Cell里面没有object,我们希望confidence score为0,否则我们希望confidence score等于预测的box和ground truth之间IOU的交集。
每个bounding box有5个预测:x, y, w, h, confidence。其中(x, y)代表与grid cell边界相关的box的中心坐标。(w, h)即预测的宽度与高度与整副图像相关。预测的confidence代表了预测框box与ground truth box之间的IOU。
注意(x, y)的预测,论文中这么说的:
Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
当时没搞懂这句话的意思,直到看到YOLO9000文章中的这一句:
Instead of predicting offsets we follow the approach of YOLO and predict location coordinates relative to the location of the grid cell
即(x, y)预测的其实是相对于每一个grid cell位置的偏移量。
每个grid也会预测C个有条件的class概率:
这些概率取决于包含object的grid cell。每个grid cell只预测一套class概率,而不管boxes B的数量。
综上,yolo的输出维度应该是: output_size = (cell_size * cell_size) * (num_class + boxes_per_cell * 5)。即每个grid cell有num_class + boxes_per_cell * 5个预测。
At test time we multiply the conditional class probabilities and the individual box confidence predictions,
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.
两种网络结构设计
1. YOLO的网络结构设计
此网络基于卷积网络,前面的卷积层用来extract features,后面的全连接层用来预测输出概率和坐标。并使用了1x1的reduction layers在3x3的卷积层前面。该网络共有24层卷积层,2层全连接层。完整网络见Figure 3.
2. Fast YOLO
具有更少的卷积层(9层),且在这些层中具有更少的filter。除了这些,其他包括train和testing paramerters都与YOLO相同。
网络的最终输出都是7x7x30的tensor的预测。
训练
关于网络
卷积层在有1000-class的ImageNet上预训练。预训练的时候使用Figure 3中所示网络的前20层卷积层,后面接着一个average-pooling层,然后是一个全连接层。We achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set。所有的training和inference都使用Darknet。
注意了,作者预训练的时候只用了20个卷积层和一个全连接层,为啥??
因为后面说了,Ren等人在文献[29]中证明了在预训练的网络上添加卷积层和全连接层可以提高性能。所以作者在预训练的网络后面加了4个卷积层和两个全连接层,并且随机初始化了weights!
还有一点就是预训练的时候的input image的resolution是224x224,作者在应用自己的Detection网络时将input resolution从224x224增加到了448x448. 为啥??因为Detection often requires fine-grained visual information。
关于参数
将bounding box的width和height归一化到(0,1]之间,x和y作为特定grid cell的位置偏移量也在(0,1]之间。
最后一层使用线性激活函数,其他层后面使用leaky relu作为激活函数:
loss function:
此模型的损失函数如上式(3)所示。
问题来了,作者为啥不用sum-squared error?为啥??
因为:
it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
怎么解决?
增加来自bounding box coordinate的loss,减少不包含object的boxes的confidence预测的loss。这两个操作是通过两个参数完成的:λcoord和λnoobj,并将它们设为:λcoord=5,λnoobj=5。
还有一点要注意,Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly。直接预测bounding box的宽度和高度的平方根,而不是宽度和高度。
Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell)
超参数:
epochs: 135
batch size: 64
momentum: 0.9
decay: 0.0005
learning rate schedule is as follows:
For the first epochs we slowly raise the learning rate from 10-3 to 10-2. We continue training with 10-2 for 75 epochs, then 10-3 for 30 epochs, and finally 10-4 for 30 epochs.
学习率慢慢减小容易理解,但是为什么一开始的时候是先小后大?因为如果一开始学习率比较高,模型会因为梯度不稳定而偏移:If we start at a high learning rate our model often diverges due to unstable gradients.
防止过拟合的两个方法:
Dropout rate: 0.5. After the first connected layer prevents co-adaptation between layers [18].
data augmentation: 我们引入了随机缩放和最多20%原始图像大小的翻译。 我们还在HSV色彩空间中随机调整图像的曝光和饱和度,最高可达1:5.
Limitations of YOLO
很强的空间约束。每个grid cell只能预测两个bounding box,且只能有一个class。这种空间约束限制了模型预测靠的比较近的object的数量。当有一群小物体密集的时候模型很难正确预测bounding box。
因为结构中有不少downsampling layers,模型也会用较粗糙的features来预测bounding boxes。
loss function会将小bounding box和大bounding box的error同等对待。一个小的error对大的box影响不大,但是对小的box的IOU有较大的影响。
Experiments
我们可以看到,YOLO在保持较高的准确率的情况下还能有45帧的处理速度。
Conclusion
We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.
- YOLO V1论文理解
摘要 作者提出了一种新的物体检测方法YOLO.YOLO之前的物体检测方法主要是通过region proposal产生大量的可能包含待检测物体的 potential bounding box,再用分类器 ...
- 目标检测:YOLO(v1 to v3)——学习笔记
前段时间看了YOLO的论文,打算用YOLO模型做一个迁移学习,看看能不能用于项目中去.但在实践过程中感觉到对于YOLO的一些细节和技巧还是没有很好的理解,现学习其他人的博客总结(所有参考连接都附于最后 ...
- 目标检测论文解读5——YOLO v1
背景 之前热门的目标检测方法都是two stage的,即分为region proposal和classification两个阶段,本文是对one stage方法的初次探索. 方法 首先看一下模型的网络 ...
- Object Detection(RCNN, SPPNet, Fast RCNN, Faster RCNN, YOLO v1)
RCNN -> SPPNet -> Fast-RCNN -> Faster-RCNN -> FPN YOLO v1-v3 Reference RCNN: Rich featur ...
- 论文笔记(2):A fast learning algorithm for deep belief nets.
论文笔记(2):A fast learning algorithm for deep belief nets. 这几天继续学习一篇论文,Hinton的A Fast Learning Algorithm ...
- 目标检测之YOLO V1
前面介绍的R-CNN系的目标检测采用的思路是:首先在图像上提取一系列的候选区域,然后将候选区域输入到网络中修正候选区域的边框以定位目标,对候选区域进行分类以识别.虽然,在Faster R-CNN中利用 ...
- YOLO V2论文理解
概述 YOLO(You Only Look Once: Unified, Real-Time Object Detection)从v1版本进化到了v2版本,作者在darknet主页先行一步放出源代码, ...
- 论文笔记:CNN经典结构1(AlexNet,ZFNet,OverFeat,VGG,GoogleNet,ResNet)
前言 本文主要介绍2012-2015年的一些经典CNN结构,从AlexNet,ZFNet,OverFeat到VGG,GoogleNetv1-v4,ResNetv1-v2. 在论文笔记:CNN经典结构2 ...
- YOLO V1、V2、V3算法 精要解说
前言 之前无论是传统目标检测,还是RCNN,亦或是SPP NET,Faste Rcnn,Faster Rcnn,都是二阶段目标检测方法,即分为“定位目标区域”与“检测目标”两步,而YOLO V1,V2 ...
随机推荐
- jsp自己主动编译机制
总的来说,Jasper的自己主动检測实现的机制比較简单,依靠某后台线程不断检測JSP文件与编译后的class文件的最后改动时间是否同样,若同样则觉得没有改动.但倘若不同则须要又一次编译.实际上因为在T ...
- PHPExcel 对xls表格的文件读取
//先去http://phpexcel.codeplex.com/releases/view/119187 下载文件包最后解压Classes到当前目录header("Content-type ...
- 批量Linux、Windows管理工具BatchShell 1.2(最新版)
简介: BatchShell是什么: BatchShell是一款基于SSH2的批量文件传输及命令执行工具,它可以同时传输文件到多台远程服务器以及同时对多台远程服务器执行命令.具备以下主要功能: ...
- Atitit。 沉思录 与it软件开发管理中的总结 读后感
Atitit. 沉思录 与it软件开发管理中的总结 读后感 1. <沉思录>,古罗马唯一一位哲学家皇帝马可·奥勒留所著 2 2. 沉思录与it软件开发管理中的总结 2 2.1. 要有自己的 ...
- 【Objective-C】02-Objective-C学习及iOS开发的准备
前言 由于大部分朋友对iOS开发还是有比较多的疑问,比如是不是一定要买iPhone和苹果笔记本才能做iOS开发?因此,这讲就为广大有意学习iOS开发的朋友们进行解惑. 一.什么是iOS开发 iOS是由 ...
- 进程隐藏与进程保护(SSDT Hook 实现)(三)
文章目录: 1. 引子: 2. 获取当前系统下所有进程: 3. 服务管理(安装,启动,停止,卸载): 4. 应用程序和内核程序通信: 5. 小结: 1. 引子: 关于这个 SSDT Hook 实现进程 ...
- Compiler Error C2872: ambiguous symbol
参考资料:http://blog.csdn.net/greytree/article/details/354530 刚才写的程序报错ERROR C2872(CL.exe)原因很简单 ZThread有定 ...
- 【持续更新】总结:C++开发时积累的一些零碎的东西
Makefile template update: 1.调整了顺序,把经常编辑的部分集中在了Makefile的下半部分 2.进行了一些重构实现更多的代码复用,见红色高亮部分,LIBPATH_ALL和L ...
- ExtPager ,分页
package cn.edu.hbcf.common.vo; public class ExtPager { private Integer start; private Integer limit; ...
- 转载:ffmpeg 音视频合成分割
http://blog.csdn.net/jixiuffff/article/details/5709976 当然先安装了 gentoo 下一条命令搞定 emerge ffmpeg 格式转换 (将f ...