Why are Eight Bits Enough for Deep Neural Networks?

Deep learning is a very weird technology. It evolved over decades on a very different track than the mainstream of AI, kept alive by the efforts of a handful of believers. When I started using it a few years ago, it reminded me of the first time I played with an iPhone – it felt like I’d been handed something that had been sent back to us from the future, or alien technology.

One of the consequences of that is that my engineering intuitions about it are often wrong. When I came across im2col, the memory redundancy seemed crazy, based on my experience with image processing, but it turns out it’s an efficient way to tackle the problem. While there are more complex approaches that can yield better results, they’re not the ones my graphics background would have predicted.

Another key area that seems to throw a lot of people off is how much precision you need for the calculations inside neural networks. For most of my career, precision loss has been a fairly easy thing to estimate. I almost never needed more than 32-bit floats, and if I did it was because I’d screwed up my numerical design and I had a fragile algorithm that would go wrong pretty soon even with 64 bits. 16-bit floats were good for a lot of graphics operations, as long as they weren’t chained together too deeply. I could use 8-bit values for a final output for display, or at the end of an algorithm, but they weren’t useful for much else.

It turns out that neural networks are different. You can run them with eight-bit parameters and intermediate buffers, and suffer no noticeable loss in the final results. This was astonishing to me, but it’s something that’s been re-discovered over and over again. My colleague Vincent Vanhoucke has the only paper I’ve found covering this result for deep networks, but I’ve seen with my own eyes how it holds true across every application I’ve tried it on. I’ve also had to convince almost every other engineer who I tell that I’m not crazy, and watch them prove it to themselves by running a lot of their own tests, so this post is an attempt to short-circuit some of that!

How does it work?

You can see an example of a low-precision approach in the Jetpac mobile framework, though to keep things simple I keep the intermediate calculations in float and just use eight bits to compress the weights. Nervana’s NEON library also supports fp16, though not eight-bit yet. As long as you accumulate to 32 bits when you’re doing the long dot products that are the heart of the fully-connected and convolution operations (and that take up the vast majority of the time) you don’t need float though, you can keep all your inputs and output as eight bit. I’ve even seen evidence that you can drop a bit or two below eight without too much loss! The pooling layers are fine at eight bits too, I’ve generally seen the bias addition and activation functions (other than the trivial relu) done at higher precision, but 16 bits seems fine even for those.

I’ve generally taken networks that have been trained in full float and down-converted them afterwards, since I’m focused on inference, but training can also be done at low precision. Knowing that you’re aiming at a lower-precision deployment can make life easier too, even if you train in float, since you can do things like place limits on the ranges of the activation layers.

Why does it work?

I can’t see any fundamental mathematical reason why the results should hold up so well with low precision, so I’ve come to believe that it emerges as a side-effect of a successful training process. When we are trying to teach a network, the aim is to have it understand the patterns that are useful evidence and discard the meaningless variations and irrelevant details. That means we expect the network to be able to produce good results despite a lot of noise. Dropout is a good example of synthetic grit being thrown into the machinery, so that the final network can function even with very adverse data.

The networks that emerge from this process have to be very robust numerically, with a lot of redundancy in their calculations so that small differences in input samples don’t affect the results. Compared to differences in pose, position, and orientation, the noise in images is actually a comparatively small problem to deal with. All of the layers are affected by those small input changes to some extent, so they all develop a tolerance to minor variations. That means that the differences introduced by low-precision calculations are well within the tolerances a network has learned to deal with. Intuitively, they feel like weebles that won’t fall down no matter how much you push them, thanks to an inherently stable structure.

At heart I’m an engineer, so I’ve been happy to see it works in practice without worrying too much about why, I don’t want to look a gift horse in the mouth! What I’ve laid out here is my best guess at the cause of this property, but I would love to see a more principled explanation if any researchers want to investigate more thoroughly? [Update – here’s a related paper from Matthieu Courbariaux, thanks Scott!]

What does this mean?

This is very good news for anyone trying to optimize deep neural networks. On the general CPU side, modern SIMD instruction sets are often geared towards float, and so eight bit calculations don’t offer a massive computational advantage on recent x86 or ARM chips. DRAM access takes a lot of electrical power though, and is slow too, so just reducing the bandwidth by 75% can be a very big help. Being able to squeeze more values into fast, low-power SRAM cache and registers is a win too.

GPUs were originally designed to take eight bit texture values, perform calculations on them at higher precisions, and then write them back out at eight bits again, so they’re a perfect fit for our needs. They generally have very wide pipes to DRAM, so the gains aren’t quite as straightforward to achieve, but can be exploited with a bit of work. I’ve learned to appreciate DSPs as great low-power solutions too, and their instruction sets are geared towards the sort of fixed-point operations we need. Custom vision chips like Movidius’ Myriad are good fits too.

Deep networks’ robustness means that they can be implemented efficiently across a very wide range of hardware. Combine this flexibility with their almost-magical effectiveness at a lot of AI tasks that have eluded us for decades, and you can see why I’m so excited about how they will alter our world over the next few years!

Why are Eight Bits Enough for Deep Neural Networks?的更多相关文章

  1. 为什么深度神经网络难以训练Why are deep neural networks hard to train?

    Imagine you're an engineer who has been asked to design a computer from scratch. One day you're work ...

  2. Training Deep Neural Networks

    http://handong1587.github.io/deep_learning/2015/10/09/training-dnn.html  //转载于 Training Deep Neural ...

  3. On Explainability of Deep Neural Networks

    On Explainability of Deep Neural Networks « Learning F# Functional Data Structures and Algorithms is ...

  4. Introduction to Deep Neural Networks

    Introduction to Deep Neural Networks Neural networks are a set of algorithms, modeled loosely after ...

  5. 深度神经网络入门教程Deep Neural Networks: A Getting Started Tutorial

    Deep Neural Networks are the more computationally powerful cousins to regular neural networks. Learn ...

  6. Classifying plankton with deep neural networks

    Classifying plankton with deep neural networks The National Data Science Bowl, a data science compet ...

  7. (Deep) Neural Networks (Deep Learning) , NLP and Text Mining

    (Deep) Neural Networks (Deep Learning) , NLP and Text Mining 最近翻了一下关于Deep Learning 或者 普通的Neural Netw ...

  8. Must Know Tips/Tricks in Deep Neural Networks

    Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)   Deep Neural Networks, especially C ...

  9. Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Initialization)

    声明:所有内容来自coursera,作为个人学习笔记记录在这里. Initialization Welcome to the first assignment of "Improving D ...

随机推荐

  1. OpenCV学习笔记——图像平滑处理

    1.blur 归一化滤波器Blurs an image using the normalized box filter.C++: void blur(InputArray src, OutputArr ...

  2. 【第一周】第一周工作统计(psp)

    项目:词频统计 项目类型:个人项目 项目完成情况:已完成 项目改进:未变更 项目日期:2016.9.3-2016.9.4 3号 类别c 内容c 开始时间s 结束e 中断I 净时间T 项目实践 构思   ...

  3. PAT L1 - 046 整除光棍

    https://pintia.cn/problem-sets/994805046380707840/problems/994805084284633088 这里所谓的“光棍”,并不是指单身汪啦~ 说的 ...

  4. selenium webdriver 表格的定位方法练习

    selenium webdriver 表格的定位方法 html 数据准备 <html> <body> <div id="div1"> <i ...

  5. 使用fabric1.14.0和fabric2.4.0

    fabric1.14.0(支持Python2.5-2.7版本): from  fabric.api import * env.gateway = '192.168.181.2'            ...

  6. p2 休眠模式

    如有错误,忘请指出. 才入手p2.p2有全局休眠模式,和钢体体眠模式.钢体能控制 body.allowSleep world.NO_SLEEPING 不允许休眠world.BODY_SLEEPING ...

  7. 虚拟机中安装 centOS,本地安装 SSH 连接 - 01

    下面把自己安装 centOS 的过程记录下,选取的版本是 centOS6.8 ,下载地址在脚本之家 down 的 : 阿里云 x64 http://mirrors.aliyun.com/centos/ ...

  8. Java7 Fork-Join 框架:任务切分,并行处理

    概要 现代的计算机已经向多CPU方向发展,即使是普通的PC,甚至现在的智能手机.多核处理器已被广泛应用.在未来,处理器的核心数将会发展的越来越多.虽然硬件上的多核CPU已经十分成熟,但是很多应用程序并 ...

  9. 【bzoj2600】[Ioi2011]ricehub 双指针法

    题目描述 给出数轴上坐标从小到大的 $R$ 个点,坐标范围在 $1\sim L$ 之间.选出一段连续的点,满足:存在一个点,使得所有选出的点到其距离和不超过 $B$ .求最多能够选出多少点. $R\l ...

  10. P4169 [Violet]天使玩偶/SJY摆棋子

    题目背景 感谢@浮尘ii 提供的一组hack数据 题目描述 Ayu 在七年前曾经收到过一个天使玩偶,当时她把它当作时间囊埋在了地下.而七年后 的今天,Ayu 却忘了她把天使玩偶埋在了哪里,所以她决定仅 ...