最佳实践:深度学习用于自然语言处理(Deep Learning for NLP Best Practices) - 阅读笔记
https://www.wxnmh.com/thread-1528249.htm
https://www.wxnmh.com/thread-1528251.htm
https://www.wxnmh.com/thread-1528254.htm
Word embeddings
using pre-trained embeddings (Kim, 2014) [12]
使用预训练embedding
The optimal dimensionality of word embeddings is mostly task-dependent: a smaller dimensionality works better for more syntactic tasks such as named entity recognition (Melamud et al., 2016) [44] or part-of-speech (POS) tagging (Plank et al., 2016) [32], while a larger dimensionality is more useful for more semantic tasks such as sentiment analysis (Ruder et al., 2016) [45]
最优维度的word embedding和任务相关:对文法型任务(NER/命名实体识别、POS tagging/词性标注)低维更好,语义型任务(SA情感分析)高维更好。
Depth
use deep Bi-LSTMs, typically consisting of 3-4 layers, e.g. for POS tagging (Plank et al., 2016) and semantic role labelling (He et al., 2017) [33]. Models for some tasks can be even deeper, cf. Google's NMT model with 8 encoder and 8 decoder layers (Wu et al., 2016) [20]
用3-4层deep Bi-LSTM做POS tagging和SRL(语义角色标注)。Google的NMT模型用了8个encoder和decoder层。
performance improvements of making the model deeper than 2 layers are minimal (Reimers & Gurevych, 2017) [46]
多于两层的提升很微弱
For classification, deep or very deep models perform well only with character-level input
对于分类问题,仅在字级别的任务中深度模型效果更好
shallow word-level models are still the state-of-the-art (Zhang et al., 2015; Conneau et al., 2016; Le et al., 2017) [28, 29, 30]
浅层词级别模型仍然是最优解
Layer connections
vanishing gradient problem 梯度消失问题
Highway layers (Srivastava et al., 2015) [1] are inspired by the gates of an LSTM.
Highway层思想来自LSTM的门结构。
Highway layers have been used pre-dominantly to achieve state-of-the-art results for language modelling (Kim et al., 2016; Jozefowicz et al., 2016; Zilly et al., 2017) [2, 3, 4], but have also been used for other tasks such as speech recognition (Zhang et al., 2016) [5]
Highway层主要用于语言模型,但也用于语音识别。
Residual connections (He et al., 2016) [6] have been first proposed for computer vision and were the main factor for winning ImageNet 2016.
残差连接在研究视觉问题中提出,是ImageNet 2016比赛获胜的主要因素。
This simple modification mitigates the vanishing gradient problem, as the model can default to using the identity function if the layer is not beneficial.
这个简单改进显著减弱梯度消失问题,如果该层没用就直接将输入数值输出。
dense connections (Huang et al., 2017) [7] (best paper award at CVPR 2017) add direct connections from each layer to all subsequent layers.
dense连接将任何层都连接所有前面的层。
Dense connections have been successfully used in computer vision. They have also found to be useful for Multi-Task Learning of different NLP tasks (Ruder et al., 2017) [49], while a residual variant that uses summation has been shown to consistently outperform residual connections for neural machine translation (Britz et al., 2017) [27].
sense连接在视觉领域成功运用。在NLP多任务学习中也有用,NMT任务中残差求和方式比残差连接更好。
Dropout
While batch normalisation in computer vision has made other regularizers obsolete in most applications, dropout (Srivasta et al., 2014) [8] is still the go-to regularizer for deep neural networks in NLP.
大量视觉问题中batch正则替代所有其他正则方法。NLP问题中普遍对DNN做dropout。
A dropout rate of 0.5 has been shown to be effective in most scenarios (Kim, 2014).
dropout rate=0.5普遍认为较好。
The main problem hindering dropout in NLP has been that it could not be applied to recurrent connections, as the aggregating dropout masks would effectively zero out embeddings over time.
dropout的问题在于不能用于RNN,因为多次dropout会将embedding数据置零。
Recurrent dropout (Gal & Ghahramani, 2016) [11]...
Multi-task learning...
Attention...
Optimization
Adam (Kingma & Ba, 2015) [21] is one of the most popular and widely used optimization algorithms and often the go-to optimizer for NLP researchers. It is often thought that Adam clearly outperforms vanilla stochastic gradient descent (SGD).
Adam被普遍使用,尤其在NLP领域。一般认为Adam比vanilla SGD好。
while it converges much faster than SGD, it has been observed that SGD with learning rate annealing slightly outperforms Adam (Wu et al., 2016). Recent work furthermore shows that SGD with properly tuned momentum outperforms Adam (Zhang et al., 2017) [42].
Adam收敛比SGD更快,不过SGD使用逐渐减弱的学习率比Adam更好。
Ensembling
Ensembling is an important way to ensure that results are still reliable if the diversity of the evaluated models increases (Denkowski & Neubig, 2017).
Ensembling集成多个异质模型非常有用。
ensembling different checkpoints of a model has been shown to be effective (Jean et al., 2015; Sennrich et al., 2016) [51, 52]
集成多个版本的模型实际有效。
Hyperparameter optimization
simply tuning the hyperparameters of our model can yield significant improvements over baselines.
简单调整超参数就足够。
Automatic tuning of hyperparameters of an LSTM has led to state-of-the-art results in language modeling, outperforming models that are far more complex (Melis et al., 2017).
对LSTM自动调超参可获得最佳语言模型,比更复杂的模型更好。
LSTM tricks...
Task-specific best practices
Classification
CNNs have been popular for classification tasks in NLP.
CNN被广泛用于NLP的分类问题。
Combining filter sizes near the optimal filter size, e.g. (3,4,5) performs best (Kim, 2014; Kim et al., 2016).
在最佳尺度附近组合多个尺度的filter(比如3,4,5)实际更好。
The optimal number of feature maps is in the range of 50-600 (Zhang & Wallace, 2015) [59].
feature maps的最优值范围大致在50-600。
1-max-pooling outperforms average-pooling and k-max pooling (Zhang & Wallace, 2015).
1-max-pooling被认为由于average-pooling和k-max pooling。
Sequence labelling
Tagging scheme BIO, which marks the first token in a segment with a B- tag, all remaining tokens in the span with an I-tag, and tokens outside of segments with an O- tag
BIO方式:segment内:开头标B,其他标I;外部标O
IOBES, which in addition distinguishes between single-token entities (S-) and the last token in a segment (E-).
IOBES方式:segment内:开头标B,中间标I,末尾标E;外部标O;单个token的segment标S。
Using IOBES and BIO yield similar performance (Lample et al., 2017)
IOBES和BIO效果接近。
CRF output layer If there are any dependencies between outputs, such as in named entity recognition the final softmax layer can be replaced with a linear-chain conditional random field (CRF). This has been shown to yield consistent improvements for tasks that require the modelling of constraints (Huang et al., 2015; Max & Hovy, 2016; Lample et al., 2016) [60, 61, 62].
如果输出节点之间有依赖,比如NER问题,终端softmax层可以被linear-chain CRF替换。实际证明在对约束建模的各类问题中都能有提升。
Constrained decoding...
Natural language generation
many of the tips presented so far stem from advances in language modelling
现有提升许多来自语言模型的进步。
Modelling coverage A checklist can be used if it is known in advances, which entities should be mentioned in the output, e.g. ingredients in recipes (Kiddon et al., 2016) [63]
尽量考虑先验知识,比如领域实体词。
...
Neural machine translation
While neural machine translation (NMT) is an instance of NLG, NMT receives so much attention that many best practices or hyperparameter choices apply exclusively to it.
NMT是NLG的特例,但是受到额外关注。
- Embedding dimensionality
2048-dimensional embeddings yield the best performance, but only do so by a small margin.
2048维是最佳,但是领先有限。
Even 128-dimensional embeddings perform surprisingly well and converge almost twice as quickly (Britz et al., 2017).
128维表现很好,收敛更快。
- Encoder and decoder depth...
The encoder does not need to be deeper than 2-4 layers.
encoder不需要超过2-4层。
Deeper models outperform shallower ones, but more than 4 is not necessary for the decoder
深度模型比浅层好,但是decoder超过4层没必要。
- Directionality
Bidirectional encoders outperform unidirectional ones by a small margin...
双向encoder比无向更好,但是很微弱
- Beam search strategy
Medium beam sizes around 10 with length normalization penalty of 1.0 (Wu et al., 2016) yield the best performance (Britz et al., 2017).
平均长度10,结合正则惩罚系数1.0左右最佳。
Sub-word translation...
最佳实践:深度学习用于自然语言处理(Deep Learning for NLP Best Practices) - 阅读笔记的更多相关文章
- 深度学习编译与优化Deep Learning Compiler and Optimizer
深度学习编译与优化Deep Learning Compiler and Optimizer
- 深度学习国外课程资料(Deep Learning for Self-Driving Cars)+(Deep Reinforcement Learning and Control )
MIT(Deep Learning for Self-Driving Cars) CMU(Deep Reinforcement Learning and Control ) 参考网址: 1 Deep ...
- 用500行Julia代码开始深度学习之旅 Beginning deep learning with 500 lines of Julia
Click here for a newer version (Knet7) of this tutorial. The code used in this version (KUnet) has b ...
- TensorFlow入门之MNIST最佳实践-深度学习
在上一篇<TensorFlow入门之MNIST样例代码分析>中,我们讲解了如果来用一个三层全连接网络实现手写数字识别.但是在实际运用中我们需要更有效率,更加灵活的代码.在TensorFlo ...
- RESTful接口设计原则/最佳实践(学习笔记)
RESTful接口设计原则/最佳实践(学习笔记) 原文地址:http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-api 1 ...
- [翻译]深度学习的机器(The learning machines)
学习的机器 用大量的数据识别图像和语音,深度学习的计算机(deep-learning computers) 向真正意义上的人工智能迈出了一大步. Nicola Jones Computer Scien ...
- Deep Learning for NLP学习翻译笔记(2)
Deep Learning for NLP Deep Learning for NLP Lecture 2:Introduction to Teano enter link description h ...
- Deep Learning in NLP (一)词向量和语言模型
原文转载:http://licstar.net/archives/328 Deep Learning 算法已经在图像和音频领域取得了惊人的成果,但是在 NLP 领域中尚未见到如此激动人心的结果.关于这 ...
- Word2Vec之Deep Learning in NLP (一)词向量和语言模型
转自licstar,真心觉得不错,可惜自己有些东西没有看懂 这篇博客是我看了半年的论文后,自己对 Deep Learning 在 NLP 领域中应用的理解和总结,在此分享.其中必然有局限性,欢迎各种交 ...
随机推荐
- intellij idea中去除@Autowired注入对象的红色波浪线提示
idea中通过@Autowired注入的对象一直有下划线提示. 解决:改变@Autowired的检查级别即可. 快捷键:Ctrl+Alt+s,进入idea设置界面,输入inspections检索
- 打开远程桌面时总提示无法打开连接文件default.rdp
删除C:\Users\Administrator\Documents\default.rdp,再启动远程就好了 http://www.chahushequ.com/read-topic-94-2fa9 ...
- 算法中Amortised time的理解
ref:http://stackoverflow.com/questions/200384/constant-amortized-time 如果非要翻译成中文,我觉得摊算时间或均摊时间(注意,它和平均 ...
- open jdk卸载
//查找:open jdk # rpm -qa | grep java //卸载open jdk # rpm -e --nodeps 包 # source /etc/profile # java -v ...
- mybaits<set>标签的使用
使用set标签可以将动态的配置SET 关键字,和剔除追加到条件末尾的任何不相关的逗号 1.在接口中创建方法 public void updateEmp(Employee employee); 2在映射 ...
- SpringBoot -- 配置mysql、hibernate
# application.properties# Server settings (ServerProperties)server.port=8081server.address=127.0.0.1 ...
- c# 窗口关闭方法
背景:点击datagridview某条信息弹出信息详情窗口,当连续点击时需要关闭之前的详情窗口. 实现方式: 父窗口中 全局创建子窗口(MsgDetailFrm ): MsgDetailFrm de ...
- 使用Callable或DeferredResult实现springmvc的异步请求
使用Callable实现springmvc的异步请求 如果一个请求中的某些操作耗时很长,会一直占用线程.这样的请求多了,可能造成线程池被占满,新请求无法执行的情况.这时,可以考虑使用异步请求,即主线程 ...
- Flink容错机制
Flink的Fault Tolerance,是在在Chandy Lamport Algorithm的基础上扩展实现了一套分布式Checkpointing机制,这个机制在论文"Lightwei ...
- 【Spring】的【Bean】管理(注解)【四个相同功能的注解】
[Spring]的[Bean]管理(注解)[四个相同功能的注解] 注解:代码里面特殊的标记,使用注解也可以完成一些相关的功能. 注解写法:@注解名称(属性名称=属性值) 注解使用在类.方法.属性上面 ...