CNN for NLP

卷积神经网络在自然语言处理任务中的应用。参考链接：Understanding Convolutional Neural Networks for NLP（2015.11）

　Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. Typically, these vectors are word embeddings (low-dimensional representations) like word2vec or GloVe, but they could also be one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-dimensional embedding we would have a 10×100 matrix as our input. That’s our “image”.

　　In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words). Thus, the “width” of our filters is usually the same as the width of the input matrix. The height, or region size, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this:

　　What about the nice intuitions we had for Computer Vision? Location Invariance and local Compositionality made intuitive sense for images, but not so much for NLP. You probably do care a lot where in the sentence a word appears. Pixels close to each other are likely to be semantically related (part of the same object), but the same isn’t always true for words. In many languages, parts of phrases could be separated by several other words. The compositional aspect isn’t obvious either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually “mean” isn’t as obvious as in the Computer Vision case.

　　Given all this, it seems like CNNs wouldn’t be a good fit for NLP tasks. Recurrent Neural Networks make more intuitive sense. They resemble how we process language (or at least how we think we process language): Reading sequentially from left to right. Fortunately, this doesn’t mean that CNNs don’t work. All models are wrong, but some are useful. It turns out that CNNs applied to NLP problems perform quite well. The simple Bag of Words model is an obvious oversimplification with incorrect assumptions, but has nonetheless been the standard approach for years and lead to pretty good results.

　　A big argument for CNNs is that they are fast. Very fast. Convolutions are a central part of computer graphics and implemented on a hardware level on GPUs. Compared to something liken-grams, CNNs are also efficient in terms of representation. With a large vocabulary, computing anything more than 3-grams can quickly become expensive. Even Google doesn’t provide anything beyond 5-grams. Convolutional Filters learn good representations automatically, without needing to represent the whole vocabulary. It’s completely reasonable to have filters of size larger than 5. I like to think that many of the learned filters in the first layer are capturing features quite similar (but not limited) to n-grams, but represent them in a more compact way.

CNN Hyperparameters

　　Before explaining at how CNNs are applied to NLP tasks, let’s look at some of the choices you need to make when building a CNN. Hopefully this will help you better understand the literature in the field.

　　Narrow vs. Wide convolution

　　When I explained convolutions above I neglected a little detail of how we apply the filter. Applying a 3×3 filter at the center of the matrix works fine, but what about the edges? How would you apply the filter to the first element of a matrix that doesn’t have any neighboring elements to the top and left? You can use zero-padding. All elements that would fall outside of the matrix are taken to be zero. By doing this you can apply the filter to every element of your input matrix, and get a larger or equally sized output. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution. An example in 1D looks like this:

　　You can see how wide convolution is useful, or even necessary, when you have a large filter relative to the input size. In the above, the narrow convolution yields an output of size $(7-5) + 1=3$ , and a wide convolution an output of size $(7+2*4 - 5) + 1 =11$ . More generally, the formula for the output size is $n_{out}=(n_{in} + 2*n_{padding} - n_{filter}) + 1$ .

　　Stride Size

　　Another hyperparameter for your convolutions is the stride size, defining by how much you want to shift your filter at each step. In all the examples above the stride size was 1, and consecutive applications of the filter overlapped. A larger stride size leads to fewer applications of the filter and a smaller output size. The following from the Stanford cs231 website shows stride sizes of 1 and 2 applied to a one-dimensional input:

　　In the literature we typically see stride sizes of 1, but a larger stride size may allow you to build a model that behaves somewhat similarly to a Recursive Neural Network, i.e. looks like a tree.

　　Pooling Layers

　　A key aspect of Convolutional Neural Networks are pooling layers, typically applied after the convolutional layers. Pooling layers subsample their input. The most common way to do pooling is to apply a $max$ operation to the result of each filter. You don’t necessarily need to pool over the complete matrix, you could also pool over a window. For example, the following shows max pooling for a 2×2 window (in NLP we typically are apply pooling over the complete output, yielding just a single number for each filter):

　　Why pooling? There are a couple of reasons. One property of pooling is that it provides a fixed size output matrix, which typically is required for classification. For example, if you have 1,000 filters and you apply max pooling to each, you will get a 1000-dimensional output, regardless of the size of your filters, or the size of your input. This allows you to use variable size sentences, and variable size filters, but always get the same output dimensions to feed into a classifier.

　　Pooling also reduces the output dimensionality but (hopefully) keeps the most salient information. You can think of each filter as detecting a specific feature, such as detecting if the sentence contains a negation like “not amazing” for example. If this phrase occurs somewhere in the sentence, the result of applying the filter to that region will yield a large value, but a small value in other regions. By performing the max operation you are keeping information about whether or not the feature appeared in the sentence, but you are losing information about where exactly it appeared. But isn’t this information about locality really useful? Yes, it is and it’s a bit similar to what a bag of n-grams model is doing. You are losing global information about locality (where in a sentence something happens), but you are keeping local information captured by your filters, like “not amazing” being very different from “amazing not”.

　　In imagine recognition, pooling also provides basic invariance to translating (shifting) and rotation. When you are pooling over a region, the output will stay approximately the same even if you shift/rotate the image by a few pixels, because the max operations will pick out the same value regardless.

　　Channels

　　The last concept we need to understand are channels. Channels are different “views” of your input data. For example, in image recognition you typically have RGB (red, green, blue) channels. You can apply convolutions across channels, either with different or equal weights. In NLP you could imagine having various channels as well: You could have a separate channels for different word embeddings (word2vec and GloVe for example), or you could have a channel for the same sentence represented in different languages, or phrased in different ways.

Convolutional Neural Networks applied to NLP

　　Let’s now look at some of the applications of CNNs to Natural Language Processing. I’ll try it to summarize some of the research results. Invariably I’ll miss many interesting applications (do let me know in the comments), but I hope to cover at least some of the more popular results.

　　The most natural fit for CNNs seem to be classifications tasks, such as Sentiment Analysis, Spam Detection or Topic Categorization. Convolutions and pooling operations lose information about the local order of words, so that sequence tagging as in PoS Tagging or Entity Extraction is a bit harder to fit into a pure CNN architecture (though not impossible, you can add positional features to the input).

　　[1] Evaluates a CNN architecture on various classification datasets, mostly comprised of Sentiment Analysis and Topic Categorization tasks. The CNN architecture achieves very good performance across datasets, and new state-of-the-art on a few. Surprisingly, the network used in this paper is quite simple, and that’s what makes it powerful.The input layer is a sentence comprised of concatenated word2vec word embeddings. That’s followed by a convolutional layer with multiple filters, then a max-pooling layer, and finally a softmax classifier. The paper also experiments with two different channels in the form of static and dynamic word embeddings, where one channel is adjusted during training and the other isn’t. A similar, but somewhat more complex, architecture was previously proposed in [2]. [6] Adds an additional layer that performs “semantic clustering” to this network architecture.

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification

　　[4] Trains a CNN from scratch, without the need for for pre-trained word vectors like word2vec or GloVe. It applies convolutions directly to one-hot vectors. The author also proposes a space-efficient bag-of-words-like representation for the input data, reducing the number of parameters the network needs to learn. In [5] the author extends the model with an additional unsupervised “region embedding” that is learned using a CNN predicting the context of text regions. The approach in these papers seems to work well for long-form texts (like movie reviews), but their performance on short texts (like tweets) isn’t clear. Intuitively, it makes sense that using pre-trained word embeddings for short texts would yield larger gains than using them for long texts.

　　Building a CNN architecture means that there are many hyperparameters to choose from, some of which I presented above: Input represenations (word2vec, GloVe, one-hot), number and sizes of convolution filters, pooling strategies (max, average), and activation functions (ReLU, tanh). [7] performs an empirical evaluation on the effect of varying hyperparameters in CNN architectures, investigating their impact on performance and variance over multiple runs. If you are looking to implement your own CNN for text classification, using the results of this paper as a starting point would be an excellent idea. A few results that stand out are that max-pooling always beat average pooling, that the ideal filter sizes are important but task-dependent, and that regularization doesn’t seem to make a big different in the NLP tasks that were considered. A caveat of this research is that all the datasets were quite similar in terms of their document length, so the same guidelines may not apply to data that looks considerably different.

　　[8] explores CNNs for Relation Extraction and Relation Classification tasks. In addition to the word vectors, the authors use the relative positions of words to the entities of interest as an input to the convolutional layer. This models assumes that the positions of the entities are given, and that each example input contains one relation. [9] and [10] have explored similar models.

　　Another interesting use case of CNNs in NLP can be found in [11] and [12], coming out of Microsoft Research. These papers describe how to learn semantically meaningful representations of sentences that can be used for Information Retrieval. The example given in the papers includes recommending potentially interesting documents to users based on what they are currently reading. The sentence representations are trained based on search engine log data.

　　Most CNN architectures learn embeddings (low-dimensional representations) for words and sentences in one way or another as part of their training procedure. Not all papers though focus on this aspect of training or investigate how meaningful the learned embeddings are. [13] presents a CNN architecture to predict hashtags for Facebook posts, while at the same time generating meaningful embeddings for words and sentences. These learned embeddings are then successfully applied to another task – recommending potentially interesting documents to users, trained based on clickstream data.

　　Character-Level CNNs

　　So far, all of the models presented were based on words. But there has also been research in applying CNNs directly to characters. [14] learns character-level embeddings, joins them with pre-trained word embeddings, and uses a CNN for Part of Speech tagging. [15][16] explores the use of CNNs to learn directly from characters, without the need for any pre-trained embeddings. Notably, the authors use a relatively deep network with a total of 9 layers, and apply it to Sentiment Analysis and Text Categorization tasks. Results show that learning directly from character-level input works very well on large datasets (millions of examples), but underperforms simpler models on smaller datasets (hundreds of thousands of examples). [17] explores to application of character-level convolutions to Language Modeling, using the output of the character-level CNN as the input to an LSTM at each time step. The same model is applied to various languages.

　　What’s amazing is that essentially all of the papers above were published in the past 1-2 years. Obviously there has been excellent work with CNNs on NLP before, as inNatural Language Processing (almost) from Scratch, but the pace of new results and state of the art systems being published is clearly accelerating.

Paper References

CNN for NLP的更多相关文章

CNN for NLP (CS224D)
斯坦福课程CS224d: Deep Learning for Natural Language Processing lecture13:Convolutional neural networks - ...
CNN for NLP（2）
参考链接: 卷积神经网络(CNN)在句子建模上的应用, 卷积神经网络CNN在自然语言处理中的应用, CNN在NLP中的应用.
【论文笔记】CNN for NLP
什么是Convolutional Neural Network(卷积神经网络)? 最早应该是LeCun(1998)年论文提出,其结果如下:运用于手写数字识别.详细就不介绍,可参考zouxy09的专栏, ...
用于NLP的CNN架构搬运：from keras0.x to keras2.x
本文亮点: 将用于自然语言处理的CNN架构,从keras0.3.3搬运到了keras2.x,强行练习了Sequential+Model的混合使用,具体来说,是Model里嵌套了Sequential. ...
理解NLP中的卷积神经网络（CNN）
此篇文章是Denny Britz关于CNN在NLP中应用的理解,他本人也曾在Google Brain项目中参与多项关于NLP的项目. · 翻译不周到的地方请大家见谅. 阅读完本文大概需要7分钟左右的时 ...
卷积神经网络(CNN)在句子建模上的应用
之前的博文已经介绍了CNN的基本原理,本文将大概总结一下最近CNN在NLP中的句子建模(或者句子表示)方面的应用情况,主要阅读了以下的文献: Kim Y. Convolutional neural n ...
NLP十大里程碑
NLP十大里程碑 2.1 里程碑一:1985复杂特征集复杂特征集(complex feature set)又叫做多重属性(multiple features)描写.语言学里,这种描写方法最早出现在语 ...
注意力机制（Attention Mechanism）应用——自然语言处理（NLP）
近年来,深度学习的研究越来越深入,在各个领域也都获得了不少突破性的进展.基于注意力(attention)机制的神经网络成为了最近神经网络研究的一个热点,下面是一些基于attention机制的神经网络在 ...
卷积神经网络CNN在自然语言处理中的应用
卷积神经网络(Convolution Neural Network, CNN)在数字图像处理领域取得了巨大的成功,从而掀起了深度学习在自然语言处理领域(Natural Language Process ...

随机推荐

DP经典问题—————（LCIS）最长公共上升子序列
这道题是LIS(最长上升子序列)与LCS(最长公共子序列)问题的综合版本,有关这两个问题可以看一下我的文章:https://www.cnblogs.com/myhnb/p/11305551.html ...
使用TensorFlow训练SSD（二）：数据的准备
在进行模型的训练之前,需要准备好相关的数据,相关的数据还需要进行标注.这篇博客将使用labelImg标注工具来进行数据的处理. 首先可从https://github.com/tzutalin/labe ...
bzoj 4736: 温暖会指引我们前行（LCT 维护最大生成树）
链接:https://www.lydsy.com/JudgeOnline/problem.php?id=4736 题面: 寒冬又一次肆虐了北国大地无情的北风穿透了人们御寒的衣物可怜虫们在冬夜中发出 ...
ArrayList与LinkedList的区别，如何减少嵌套循环的使用
如果要减少嵌套循环的使用: 我们可以将需要在二重循环里面判断的条件放在一个Map的key里面: 在判断的时候只需要进行key是否存在,然后操作接下来的步骤: 这样子就会减少二重循环了,不会发生循环n* ...
python线程间通信
#!/usr/bin/python # -*- coding:utf8 -*- from threading import Thread, Lock import random def test_th ...
Eclipse怎么改变@author 姓名
1 点击windows 然后选择点击进去选择搜索code Templates 点击选择出现下面的页面点开comments,里面有给方法,变量 ,类等加注释设置的模板如:点击Methods ...
将java文件编译成class文件
一般情况下,在myeclipse中保存java文件后会自动编译成class文件,但是这种情况只能编译当前工程的java文件,但是如果需要编译不是一个工程的java文件,比如在网上拷贝的java文件改如 ...
HLSL中constant variables的packing规则
HLSL中constant variables的packing规则参考MSDN上的官方文档.一般而言,HLSL将数据打包为4字节对齐,此外,它不允许数据跨16字节(即4个float的vector)的 ...
nginx之健康检查
正常情况下,nginx做反向代理,如果后端节点服务器宕掉的话,nginx默认是不能把这台realserver踢出upstream负载集群的,所以还会有请求转发到后端的这台realserver上面,这样 ...
帝国cms 此栏目暂无任何新增信息处理办法
在做一个新网站的时候不能保证每个栏目都能填充内容,当某个栏目没有内容填充的时候总会出现“此栏目暂无任何新增信息”看着挺不舒服. 其实想删除这行字也挺简单,只需要修改下语言包即可!如下: 找到语言包文件 ...

CNN for NLP

CNN Hyperparameters

Narrow vs. Wide convolution

Stride Size

Pooling Layers

Channels

Convolutional Neural Networks applied to NLP

Character-Level CNNs

Paper References

CNN for NLP的更多相关文章

随机推荐

热门专题

　　Narrow vs. Wide convolution

　　Stride Size

　　Pooling Layers

　　Channels

　　Character-Level CNNs