Deep Learning in a Nutshell: Core Concepts
Deep Learning in a Nutshell: Core Concepts
This post is the first in a series I’ll be writing for Parallel Forall that aims to provide an intuitive and gentle introduction todeep learning. It covers the most important deep learning concepts and aims to provide an understanding of each concept rather than its mathematical and theoretical details. While the mathematical terminology is sometimes necessary and can further understanding, these posts use analogies and images whenever possible to provide easily digestible bits comprising an intuitive overview of the field of deep learning.
I wrote this series in a glossary style so it can also be used as a reference for deep learning concepts.
Part 1 focuses on introducing the main concepts of deep learning. Future posts will provide historical background and delve into the training procedures, algorithms and practical tricks that are used in training for deep learning.
Core Concepts
MACHINE LEARNING
In machine learning we (1) take some data, (2) train a model on that data, and (3) use the trained model to make predictions on new data. The process of training a model can be seen as a learning process where the model is exposed to new, unfamiliar data step by step. At each step, the model makes predictions and gets feedback about how accurate its generated predictions were. This feedback, which is provided in terms of an error according to some measure (for example distance from the correct solution), is used to correct the errors made in prediction.
The learning process is often a game of back-and-forth in the parameter space: If you tweak a parameter of the model to get a prediction right, the model may have in such that it gets a previously correct prediction wrong. It may take many iterations to train a model with good predictive performance. This iterative predict-and-adjust process continues until the predictions of the model no longer improve.
FEATURE ENGINEERING
Feature engineering is the art of extracting useful patterns from data that will make it easier forMachine Learning models to distinguish between classes. For example, you might take the number of greenish vs. bluish pixels as an indicator of whether a land or water animal is in some picture. This feature is helpful for a machine learning model because it limits the number of classes that need to be considered for a good classification.
Feature engineering is the most important skill when you want to achieve good results for most predictions tasks. However, it is difficult to learn and master since different data sets and different kinds of data require different feature engineering approaches. Only crude guidelines exist, which makes feature engineering more of an art than a science. Features that are usable for one data set often are not usable for other data sets (for example the next image data set only contains land animals). The difficulty of feature engineering and the effort involved is the main reason to seek algorithms that can learn features; that is, algorithms that automatically engineer features.
While many tasks can be automated by Feature Learning (like object and speech recognition), feature engineering remains the single most effective technique to do well in difficult tasks (like most tasks in Kaggle machine learning competitions).
FEATURE LEARNING
Feature learning algorithms find the common patterns that are important to distinguish between classes and extract them automatically to be used in a classification or regression process. Feature learning can be thought of as Feature Engineering done automatically by algorithms. In deep learning, convolutional layers are exceptionally good at finding good features in images to the next layer to form a hierarchy of nonlinear features that grow in complexity (e.g. blobs, edges -> noses, eyes, cheeks -> faces). The final layer(s) use all these generated features for classification or regression (the last layer in a convolutional net is, essentially, multinomial logistic regression).

Figure 1: Learned hierarchical features from a deep learning algorithm. Each feature can be thought of as a filter, which filters the input image for that feature (a nose). If the feature is found, the responsible unit or units generate large activations, which can be picked up by the later classifier stages as a good indicator that the class is present. Image by Honglak Lee and colleagues (2011) as published in “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”.
Figure 1 shows features generated by a deep learning algorithm that generates easily interpretable features. This is rather unusual. Features are normally difficult to interpret, especially in deep networks like recurrent neural networks and LSTMs or very deep convolutional networks.
DEEP LEARNING
In hierarchical Feature Learning, we extract multiple layers of non-linear features and pass them to a classifier that combines all the features to make predictions. We are interested in stacking such very deep hierarchies of non-linear features because we cannot learn complex features from a few layers. It can be shown mathematically that for images the best features for a single layer are edges and blobs because they contain the most information that we can extract from a single non-linear transformation. To generate features that contain more information we cannot operate on the inputs directly, but we need to transform our first features (edges and blobs) again to get more complex features that contain more information to distinguish between classes.
It has been shown that the human brain does exactly the same thing: The first hierarchy of neurons that receives information in the visual cortex are sensitive to specific edges and blobs while brain regions further down the visual pipeline are sensitive to more complex structures such as faces.
While hierarchical feature learning was used before the field deep learning existed, these architectures suffered from major problems such as the vanishing gradient problem where the gradients became too small to provide a learning signal for very deep layers, thus making these architectures perform poorly when compared to shallow learning algorithms (such as support vector machines).
The term deep learning originated from new methods and strategies designed to generate these deep hierarchies of non-linear features by overcoming the problems with vanishing gradients so that we can train architectures with dozens of layers of non-linear hierarchical features. In the early 2010s, it was shown that combining GPUs with activation functions that offered better gradient flow was sufficient to train deep architectures without major difficulties. From here the interest in deep learning grew steadily.
Deep learning is not associated just with learning deep non-linear hierarchical features, but also with learning to detect very long non-linear time dependencies in sequential data. While most other algorithms that work on sequential data only have a memory of the last 10 time steps, long short-term memory (LSTM) recurrent neural networks (invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997) allow the network to pick up on activity hundreds of time-steps in the past to make accurate predictions. While LSTM networks have been mostly ignored in the past 10 years, their usage has grown rapidly since 2013 and together with convolutional nets they form one of two major success stories of deep learning.
Fundamental Concepts
ARTIFICIAL NEURAL NETWORK
An artificial neural network (1) takes some input data, and (2) transforms this input data by calculating a weighted sum over the inputs and (3) applies a non-linear function to this transformation to calculate an intermediate state. The three steps above constitute what is known as a layer, and the transformative function is often referred to as a unit. The intermediate states—often termed features—are used as the input into another layer.
Through repetition of these steps, the artificial neural network learns multiple layers of non-linear features, which it then combines in a final layer to create a prediction.
The neural network learns by generating an error signal that measures the difference between the predictions of the network and the desired values and then using this error signal to change the weights (or parameters) so that predictions get more accurate.
UNIT
A unit often refers to the activation function in a layer by which the inputs are transformed via a nonlinear activation function (for example by the logistic sigmoid function). Usually, a unit has several incoming connections and several outgoing connections. However, units can also be more complex, like long short-term memory (LSTM) units, which have multiple activation functions with a distinct layout of connections to the nonlinear activation functions, or maxout units, which compute the final output over an array of nonlinearly transformed input values. Pooling, convolution, and other input transforming functions are usually not referred to as units.
ARTIFICIAL NEURON
The term artificial neuron—or most often just neuron—is an equivalent term to unit, but implies a close connection to neurobiology and the human brain while deep learning has very little to do with the brain (for example, it is now thought that biological neurons are more similar to entire multilayer perceptrons rather than a single unit in a neural network). The term neuron was encouraged after the last AI winter to differentiate the more successful neural network from the failing and abandoned perceptron. However, since the wild successes of deep learning after 2012, the media often picked up on the term “neuron” and sought to explain deep learning as mimicry of the human brain, which is very misleading and potentially dangerous for the perception of the field of deep learning. Now the term neuron is discouraged and the more descriptive term unit should be used instead.
LAYER
A layer is the highest-level building block in deep learning. A layer is a container that usually receives weighted input, transforms it with a set of mostly non-linear functions and then passes these values as output to the next layer. A layer is usually uniform, that is it only contains one type of activation function, pooling, convolution etc. so that it can be easily compared to other parts of the network. The first and last layers in a network are called input and output layers, respectively, and all layers in between are called hidden layers.
Convolutional Deep Learning
CONVOLUTION
Convolution is a mathematical operation which describes a rule of how to mix two functions or pieces of information: (1) The feature map (or input data) and (2) the convolution kernel mix together to form (3) a transformed feature map. Convolution is often interpreted as a filter, where the kernel filters the feature map for information of a certain kind (for example one kernel might filter for edges and discard other information).
Figure 2: Convolution of an image with an edge detector convolution kernel. Sources: 1 2.
Convolution is important in physics and mathematics as it defines a bridge between the spatial and time domains (pixel with intensity 147 at position (0,30)) and the frequency domain (amplitude of 0.3, at 30Hz, with 60-degree phase) through the convolution theorem. This bridge is defined by the use of Fourier transforms: When you use a Fourier transform on both the kernel and the feature map, then the convolution operation is simplified significantly (integration becomes mere multiplication). Some of the fastest GPU implementations of convolutions (for example some implementations in the NVIDIA cuDNN library) currently make use of Fourier transforms.
Figure 3: Calculating convolution by sliding image patches over the entire image. One image patch (yellow) of the original image (green) is multiplied by the kernel (red numbers in the yellow patch), and its sum is written to one feature map pixel (red cell in convolved feature). Image source: 1.
Convolution can describe the diffusion of information, for example, the diffusion that takes place if you put milk into your coffee and do not stir can be accurately modeled by a convolution operation (pixels diffuse towards contours in an image). In quantum mechanics, it describes the probability of a quantum particle being in a certain place when you measure the particle’s position (average probability for a pixel’s position is highest at contours). In probability theory, it describes cross-correlation, which is the degree of similarity for two sequences that overlap (similarity high if the pixels of a feature (e.g. nose) overlap in an image (e.g. face)). In statistics, it describes a weighted moving average over a normalized sequence of input (large weights for contours, small weights for everything else). Many other interpretations exist.
While it is unknown which interpretation of convolution is correct for deep learning, the cross-correlation interpretation is currently the most useful: convolutional filters can be interpreted as feature detectors, that is, the input (feature map) is filtered for a certain feature (the kernel) and the output is large if the feature is detected in the image. This is exactly how you interpret cross-correlation for an image.
Figure 4: Cross-correlation for an image. Convolution can be transformed to cross-correlation by reversing the kernel (upside-down image). The kernel can then be interpreted as a feature detector where a detected feature results in large outputs (white) and small outputs if no feature is present (black). Images are taken from Steven Smith’s excellent free online book about digital signal processing.
Additional material: Understanding Convolution in Deep Learning
POOLING / SUBSAMPLING
Pooling is a procedure that takes input over a certain area and reduces that to a single value (subsampling). In convolutional neural networks, this concentration of information has the useful property that outgoing connections usually receive similar information (the information is “funneled” into the right place for the input feature map of the next convolutional layer). This provides basic invariance to rotations and translations. For example, if the face on an image patch is not in the center of the image but slightly translated, it should still work fine because the information is funneled into the right place by the pooling operation so that the convolutional filters can detect the face.
The larger the size of the pooling area, the more information is condensed, which leads to slim networks that fit more easily into GPU memory. However, if the pooling area is too large, too much information is thrown away and predictive performance decreases.
Additional material: Neural networks [9.5]: Computer vision – pooling and subsampling
CONVOLUTIONAL NEURAL NETWORK (CNN)
A convolutional neural network, or preferably convolutional network or convolutional net (the term neural is misleading; see also artificial neuron), uses convolutional layers (see convolution) that filter inputs for useful information. These convolutional layers have parameters that are learned so that these filters are adjusted automatically to extract the most useful information for the task at hand (see Feature Learning). For example, in a general object recognition task it might be most useful to filter information about the shape of an object (objects usually have very different shapes) while for a bird recognition task it might be more suitable to extract information about the color of the bird (most birds have a similar shape, but different colors; here color is more useful to distinguish between birds). Convolutional networks adjust automatically to find the best feature for these tasks.
Usually, multiple convolutional layers are used that filter images for more and more abstract information after each layer (see hierarchical features).
Convolutional networks usually also use pooling layers (see pooling) for limited translation and rotation invariance (detect the object even if it appears at some unusual place). Pooling also reduces the memory consumption and thus allows for the usage of more convolutional layers.
More recent convolutional networks use inception modules (see inception) which use 1×1 convolutional kernels to reduce the memory consumption further while speeding up the computation (and thus training).
Figure 5: An image of a traffic sign is filtered by 4 5×5 convolutional kernels which create 4 feature maps, these feature maps are subsampled by max pooling. The next layer applies 10 5×5 convolutional kernels to these subsampled images and again we pool the feature maps. The final layer is a fully connected layer where all generated features are combined and used in the classifier (essentially logistic regression). Image by Maurice Peemen.
Additional material: Coursera: Neural Networks for Machine Learning: Object Recognition with Neural Nets.
INCEPTION
Inception modules in convolutional networks were designed to allow for deeper and largerconvolutional layers while at the same time allowing for more efficient computation. This is done by using 1×1 convolutions with small feature map size, for example, 192 28×28 sized feature maps can be reduced to 64 28×28 feature maps through 64 1×1 convolutions. Because of the reduced size, these 1×1 convolutions can be followed up with larger convolutions of size 3×3 and 5×5. In addition to 1×1 convolution, max pooling may also be used to reduce dimensionality.
In the output of an inception module, all the large convolutions are concatenated into a big feature map which is then fed into the next layer (or inception module).
Additional material: Going Deeper with Convolutions
Conclusion to Part 1
This concludes part one of this crash course on deep learning. Please check back soon for the next two parts of the series. In part 2, I’ll provide a brief historical overview followed by an introduction to training deep neural networks.
Meanwhile, you might be interested in learning about cuDNN, DIGITS, Computer Vision with Caffe,Natural Language Processing with Torch, Neural Machine Translation, the Mocha.jl deep learning framework for Julia, or other Parallel Forall posts on deep learning.
Deep Learning in a Nutshell: Core Concepts的更多相关文章
- (转) Deep Learning in a Nutshell: Core Concepts
Deep Learning in a Nutshell: Core Concepts Share: Posted on November 3, 2015by Tim Dettmers 7 Comm ...
- Deep Learning in a Nutshell: History and Training
Deep Learning in a Nutshell: History and Training This series of blog posts aims to provide an intui ...
- (转) Deep Learning in a Nutshell: Reinforcement Learning
Deep Learning in a Nutshell: Reinforcement Learning Share: Posted on September 8, 2016by Tim Dettm ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- (转) The major advancements in Deep Learning in 2016
The major advancements in Deep Learning in 2016 Pablo Tue, Dec 6, 2016 in MACHINE LEARNING DEEP LEAR ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料汇总 (上)
转载:http://dataunion.org/8463.html?utm_source=tuicool&utm_medium=referral <Brief History of Ma ...
- 机器学习(Machine Learning)与深度学习(Deep Learning)资料汇总
<Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.D ...
- The Promise of Deep Learning
The Promise of Deep Learning By Yoshua Bengio Humans have long dreamed of creating machines that thi ...
随机推荐
- 加密算法使用(一):用CRC32来压缩32uuid字符串
CRC32相比MD5重复率较高, 不过我们仍然可以使用CRC32然后转长整形的方式将32位的UUID字符串压缩成更短的整形唯一标识. /** * * @Title: getCRC32Value * @ ...
- IBatis.Net学习笔记十三:在IBatis.Net中调用存储过程
其实调用方式比较简单,主要也就是两种类型的存储过程:1.更新类型的存储过程2.查询类型的存储过程下面就来看看具体的调用方式:1.更新类型的存储过程sp_InsertAccount: CREATE PR ...
- 高性能网站性能优化与系统架构(ZT)
转载请保留出处:俊麟 Michael’s blog (http://space.itpub.net/7311285/viewspace-97) 我在CERNET做过拨号接入平台的搭建,而后在Yahoo ...
- pandas 数据索引与选取
我们对 DataFrame 进行选择,大抵从这三个层次考虑:行列.区域.单元格.其对应使用的方法如下:一. 行,列 --> df[]二. 区域 --> df.loc[], df.ilo ...
- 1从零开始学习Xamarin.iOS安装篇
安装和配置xamarin.ios 最近.net 开源新闻很火呀,于是想学习xamarin,早1年前就了解过这个东西,但是一直没有时间来学习,我这里装的是MAC上面的版本,废话不多说开始第一步安装. 概 ...
- 熟悉linux开发环境(实验)
北京电子科技学院(BESTI) 实验报告 课程: 深入理解计算机系统 班级: 1353班 姓名:张若嘉 杨舒雯 学号:20135330 20135324 成绩: 指导教师:娄嘉鹏 实验日期:2015. ...
- 20145215《Java程序设计》第5周学习总结
20145215<Java程序设计>第五周学习总结 教材学习内容总结 异常处理 语法与继承架构 异常就是程序在运行时出现不正常情况,异常的由来是因为Java把出现的问题封装成了对象,换句话 ...
- unity3d 扩展NGUI Tweener —— TweenFillAmount
好久没写博客了,上一篇是在今年上班之前写的 从年初到现在一篇没写过,每天都在加班,实在太忙了 上班半年多了,学到不少东西 今天分享一下刚写的小功能 TweenFillAmount 用过NGUI Twn ...
- 【MPI学习6】MPI并行程序设计模式:具有不连续数据发送的MPI程序设计
基于都志辉老师<MPI并行程序设计模式>第14章内容. 前面接触到的MPI发送的数据类型都是连续型的数据.非连续类型的数据,MPI也可以发送,但是需要预先处理,大概有两类方法: (1)用户 ...
- 第三章 Js变量的作用域和匿名函数
3.1 先看下面的事例: ①var temp=0; ②temp=0; 当js解析器检测到①这种情况的时候,解析器会为这个变量开辟一个内存空间,如果前面已经存在这个变量,就会把这个变量覆盖掉. 当解析器 ...