Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks（paper）

本文重点：

和一般形式的文本处理方式一样，并没有特别大的差异，文章的重点在于提出了一个相似度矩阵

计算过程介绍：

query和document中的首先通过word embedding处理后获得对应的表示矩阵

利用CNN网络进行处理获得各自的feature map，接着pooling后获得query对应的向量表示Xq和document的向量Xd

不同于传统的Siamense网络在这一步利用欧式距离或余弦距离直接对Xq和Xd进行相似性计算后预测结果，网络采用一个相似矩阵来计算Xq和Xd的相似度，然后将Xd，Xq和sim(Xq,Xd)进行连接，并添加了word overlap和IDF word overlap的特征后作为特征向量输入一个神经网络层 --计算句子相似度的方法~基于字重叠(Word Overlap)

神经网络层的输出经过一个全连接层，利用softmax函数得出预测结果、

questions and documents are limited to a single sentence.

The main building blocks of our architecture are two distributional sentence models based on convolutional neural networks. These underlying sentence models work in parallel,mapping queries and documents to their distributional vectors,which are then used to learn the semantic similarity between them.

our model encodes query-document pairs in a rich representation using not only their similarity score but also their intermediate representations; (iii) the architecture of our network makes it straightforward to include any additional similarity features to the mode。

However, their model operates only on unigram or bigrams, while our architecture learns to extract and compose n-grams of higher degrees, thus allowing for capturing longer range dependencies. Additionally, our architecture uses not only the intermediate representations of questions and answers to compute their similarity but also includes them in the final representation, which constitutes a much richer representation of the question-answer pairs.

pointwise：
it is enough to train a binary classifier
pairwise：
the model is explicitly trained to score correct pairs higher than
incorrect pairs with a certain margin
对比：
it requires to consider a larger number of training instances (potentially quadratic in the size of the candidate document set) than the pointwise method, which may lead to slower training times. Still both pointwise and pairwise approaches ignore the fact that ranking is a prediction task on a list of objects.

Most often, producing a better representation ψ() that encodes various aspects of similarity between the input querydocument pairs plays a far more important role in training an accurate
reranker than choosing between different ranking approaches.Hence, in this paper we adopt a simple pointwise method to reranking and focus on modelling a rich representation of query-document pairs using deep learning approaches which is described next.

Our network is composed of a single wide convolutional layer followed by a non-linearity and simple max pooling.--宽卷积网络

The range of allowed values for i defines two types of convolution: narrow and wide. The narrow type restricts i to be in the range [1, |s| − m + 1], which in turn restricts the filter width to be ≤ |s|. To compute the wide type of convolution i ranges from 1 to |s| and sets no restrictions on the size of m and s. The benefits of one type of convolution over the other when dealing with text are discussed in detail in [18]. In short, the wide convolution is able to better handle words at boundaries giving equal attention to all words in the sentence, unlike in narrow convolution, where words close to boundaries are seen fewer times.More importantly, wide convolution also guarantees to always yield valid values even when s is shorter than the filter size m。

It should be noted that an alternative way of computing a convolution was explored in[18],where a series of convolutions are computed between each row of a sentence matrix and a corresponding row of the filter matrix. Essentially, it is a vectorized form of 1d convolution applied between corresponding rows of S and F. As a result, the output feature map is a matrix C ∈ R。

Among the most common choices of activation functions are the following: sigmoid (or logistic), hyperbolic tangent tanh, and a rectified linear (ReLU) function defined as simply max(0, x) to ensure that feature maps are always positive.

Both average and max pooling methods exhibit certain disadvantages: in average pooling, all elements of the input are considered, which may weaken strong activation values. This is especially critical with tanh non-linearity, where strong positive and negative activations can cancel each other out. The max pooling is used more widely and does not suffer from the drawbacks of average pooling. However, as shown in [40], it can lead to strong overfitting on the training set and, hence, poor generalization on the test data.

Recently, max pooling has been generalized to k-max pooling [18], where instead of a single max value, k values are extracted in their original order. This allows for extracting several largest activation values from the input sentence.

Our architecture for matching text pairs：
Our sentence models based on ConvNets learn to map input sentences to vectors, which
can then be used to compute their similarity. These are then usedto compute a query-document similarity score, which together withthe query and document vectors are joined in a single representation.

query和document的相似度的度量：
In this model, we seek a transformation of the candidate document xd = Mxd that is the closest
to the input query xq. The similarity matrix M is a parameter of the network and is optimized during the training。

Adagrad scales the learning rate of SGD on each dimension based on the l2 norm of the history of the error gradient. Adadelta uses both the error gradient history like Adagrad and the weight update history. It has the advantage of not having to set a learning rate at all.

参数大小：
the width m of the convolution filters is set to 5 and the number of convolutional feature maps is 100. We use ReLU activation function and a simple max-pooling.
--
To train the network we use stochastic gradient descent with shuffled mini-batches. We eliminate the need to tune the learning rate by using the Adadelta update rule [39]. The batch size is set to 50 examples. The network is trained for 25 epochs with early stopping, i.e., we stop the training if no update to the best accuracy on the dev set has been made for the last 5 epochs. The accuracy computed on the dev set is the MAP score. At test time we use the parameters of the network that were obtained with the best MAP score on the development (dev) set, i.e., we compute the MAP score after each 10 mini-batch updates and save the network
parameters if a new best dev MAP score was obtained. In practice, the training converges after a few epochs. We set a value for L2 regularization term to 1e−5 for the parameters of convolutional layers and 1e − 4 for all the others. The dropout rate is set to p = 0.5.
--
we keep the word embeddings fixed and initialize the word matrix W from an unsupervised neural language model.
--
We choose the dimensionality of our word embeddings to be 50 to be on the line with the deep
learning model of [38].
--
Word embeddings. We initialize the word embeddings by running word2vec tool [20] on the English Wikipedia dump and the AQUAINT corpus4 containing roughly 375 million words.
To train the embeddings we use the skipgram model with window size 5 and filtering words with frequency less than 5. The resulting model contains 50-dimensional vectors for about 3.5 million words. Embeddings for words not present in the word2vec model are randomly
initialized with each component sampled from the uniform
distribution U[−0.25, 0.25]. We minimally preprocess the data only performing tokenization
and lowercasing all words. To reduce the size of the resulting vocabulary V , we also replace all digits with 0. The size of the word vocabulary V for experiments using TRAIN set is 17,023 with approximately 95% of words initialized using wor2vec embeddings and the remaining 5% words are initialized at random as described in Sec.
--
Additional features. Given that a certain percentage of the words in our word embedding matrix are initialized at random (about 15%for the TRAIN-ALL) and a relatively small number of QA pairs prevents the network to directly learn them from the training data, similarity matching performed by the network will be suboptimal between many question-answer pairs.

In particular, we compute word overlap measures between each question-answer pair and include it as an additional feature vector xfeat in our model. This feature vector contains only four features: word overlap and IDF-weighted word overlap computed between all words and only non-stop words. Computing these features is straightforward and does not require additional pre-processing or external resources。

评估：
MRR ：MRR is only looking at the rank of the first correct answer,hence it is more suitable in cases where for each question there is only a single correct answer.
MAP ：examines the ranks of all the correct answers. It is computed as the mean over the average。

Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks（paper）的更多相关文章

ImageNet Classification with Deep Convolutional Neural Networks（译文）转载
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geo ...
IMAGENT CLASSIFICATION WITH DEEP CONVOLUTIONAL NEURAL NETWORKS（翻译）
0 - 摘要我们训练了一个大型的.深度卷积神经网络用来将ImageNet LSVRC-2010竞赛中的120万高分辨率的图像分为1000个不同的类别.在测试集上,我们在top-1和top-5上的错 ...
吴恩达《深度学习》-课后测验-第一门课 (Neural Networks and Deep Learning)-Week 4 - Key concepts on Deep Neural Networks（第四周测验 – 深层神经网络）
Week 4 Quiz - Key concepts on Deep Neural Networks(第四周测验 – 深层神经网络) \1. What is the "cache" ...
(Deep) Neural Networks (Deep Learning) , NLP and Text Mining
(Deep) Neural Networks (Deep Learning) , NLP and Text Mining 最近翻了一下关于Deep Learning 或者普通的Neural Netw ...
（转）Understanding, generalisation, and transfer learning in deep neural networks
Understanding, generalisation, and transfer learning in deep neural networks FEBRUARY 27, 2017 Thi ...
This instability is a fundamental problem for gradient-based learning in deep neural networks. vanishing exploding gradient problem
The unstable gradient problem: The fundamental problem here isn't so much the vanishing gradient pro ...
【论文阅读】Clustering Convolutional Kernels to Compress Deep Neural Networks
文章:Clustering Convolutional Kernels to Compress Deep Neural Networks 链接:http://openaccess.thecvf.com ...
[C1W4] Neural Networks and Deep Learning - Deep Neural Networks
第四周:深层神经网络(Deep Neural Networks) 深层神经网络(Deep L-layer neural network) 目前为止我们学习了只有一个单独隐藏层的神经网络的正向传播和反向 ...
Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Initialization)
声明:所有内容来自coursera,作为个人学习笔记记录在这里. Initialization Welcome to the first assignment of "Improving D ...

随机推荐

Qt绘制字体并获取文本宽度
参考资料: https://blog.csdn.net/liang19890820/article/details/51227894 QString text("abc");QPa ...
vue2.x 父组件监听子组件事件并传回信息
利用vm.$emit 1.在父组件中引用子组件 <child @from-child-msg="listenChildMsg"></child > 2.子组 ...
使用STL的next_permutation函数
文章作者:姜南(Slyar) 文章来源:Slyar Home (www.slyar.com) 转载请注明,谢谢合作. 下午研究了一下全排列算法,然后发现C++的STL有一个函数可以方便地生成全排列,这 ...
201621123001 《Java程序设计》第12周学习总结
1. 本周学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结多流与文件相关内容. 字节流以字节为基本处理单位,字符流以字符为基本处理单位,以Reader和Writer为基础派生出的一系列类字 ...
mysql创建存储过程，定时任务，定时删除log
-- 创建存储过程清除30天前的日志create procedure deleteLog()BEGINdelete from contract_vlog where create_time<D ...
Nio Bio Netty Tomcat的NIO
socket():新建一个文件 bind():绑定到端口,第一个参数就是socket()方法产生的文件描述符 listen():确定新建的这个socket是一个服务器,被动等待网络其他进程链接,参数有 ...
adb devices连接不上设备
1.端口被占用解决办法:netstat -aon|findstr "5037",找到占用5037这个端口的进程,然后根据pid在任务管理器里面找到进程然后结束 2.插拔usb数据 ...
day 68 增删改查语法
1 普通正则 2 分组正则 url(r'/blog/(\d+)/(\d+)',views.blog) blog(request,arq1,arq2) 按照位置传参 3 分组命名 url(r'/ ...
python Django rest-framework 创建序列化工程步骤
11创建项目 2创建应用 3stting添加应用(apps)-添加制定数据库-修改显示汉字(zh-hans)-上海时区(Asia/Shanghai) 4主路由添加子路由 5应用里创建子路由 6创建数据 ...
利用python将数据转存入sqlite3
案例的目标是将存在文件中的json格式数据转存到sqlite数据库中.因此,需要利用python逐行读取json文件中数据,对数据进行解析和入库.具体操作步骤如下: 1.逐行读取json文件 for ...

Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks（paper）

Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks（paper）的更多相关文章

随机推荐

热门专题