【NLP】Recurrent Neural Network and Language Models
0. Overview
What is language models?
A time series prediction problem.
It assigns a probility to a sequence of words,and the total prob of all the sequence equal one.
Many Natural Language Processing can be structured as (conditional) language modelling.
Such as Translation:
P(certain Chinese text | given English text)
Note that the Prob follows the Bayes Formula.
How to evaluate a Language Model?
Measured with cross entropy.
Three data sets:
1 Penn Treebank: www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
2 Billion Word Corpus: code.google.com/p/1-billion-word-language-modeling-benchmark/
3 WikiText datasets: Pointer Sentinel Mixture Models. Merity et al., arXiv 2016
Overview: Three approaches to build language models: Count based n-gram models: approximate the history of observed words with just the previous n words. Neural n-gram models: embed the same fixed n-gram history in a continues space and thus better capture correlations between histories. Recurrent Neural Networks: we drop the fixed n-gram history and compress the entire history in a fixed length vector, enabling long range correlations to be captured. |
1. N-Gram models:
Assumption:
Only previous history matters.
Only k-1 words are included in history
Kth order Markov model
2-gram language model:
The conditioning context, wi−1, is called the history
Estimate Probabilities:
(For example: 3-gram)
(count w1,w2,w3 appearing in the corpus)
Interpolated Back-Off:
That is , sometimes some certain phrase don’t appear in the corpus so the Prob of them is zero. To avoid this situation, we use Interpolated Back-off. That is to say, Interpolate k-gram models(k = n-1、n-2…1) into the n-gram models.
A simpal approach:
Summary for n-gram:
Good: easy to train. Fast.
Bad: Large n-grams are sparse. Hard to capture long dependencies. Cannot capture correlations between similary word distributions. Cannot resolve the word morphological problem.(running – jumping)
2. Neural N-Gram Language Models
Use A feed forward network like:
Trigram(3-gram) Neural Network Language Model for example:
Wi are hot-vectors. Pi are distributions. And shape is |V|(words in the vocabulary)
(a sampal:detail cal graph)
Define the loss:cross entopy:
Training: use Gradient Descent
And a sampal of taining:
Comparsion with Count based n-gram LMs:
Good: Better performance on unseen n-grams But poorer on seen n-grams.(Sol: direct(linear) n-gram fertures). Use smaller memory than Counted based n-gram.
Bad: The number of parameters in the models scales with n-gram size. There is a limit on the longest dependencies that an be captured.
3. Recurrent Neural Network LM
That is to say, using a recurrent neural network to build our LM.
Model and Train:
Algorithm: Back Propagation Through Time(BPTT)
Note:
Note that, the Gradient Descent depend heavily. So the improved algorithm is:
Algorithm: Truncated Back Propagation Through Time.(TBPTT)
So the Cal graph looks like this:
So the Training process and Gradient Descent:
Summary of the Recurrent NN LMs:
Good:
RNNs can represent unbounded dependencies, unlike models with a fixed n-gram order.
RNNs compress histories of words into a fixed size hidden vector.
The number of parameters does not grow with the length of dependencies captured, but they do grow with the amount of information stored in the hidden layer.
Bad:
RNNs are hard to learn and often will not discover long range dependencies present in the data(So we learn LSTM unit).
Increasing the size of the hidden layer, and thus memory, increases the computation and memory quadratically.
Mostly trained with Maximum Likelihood based objectives which do not encode the expected frequencies of words a priori.
Some blogs recommended:
Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks karpathy.github.io/2015/05/21/rnn-effectiveness/ Yoav Goldberg: The unreasonable effectiveness of Character-level Language Models nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139 Stephen Merity: Explaining and illustrating orthogonal initialization for recurrent neural networks. smerity.com/articles/2016/orthogonal_init.html |
【NLP】Recurrent Neural Network and Language Models的更多相关文章
- pytorch --Rnn语言模型(LSTM,BiLSTM) -- 《Recurrent neural network based language model》
论文通过实现RNN来完成了文本分类. 论文地址:88888888 模型结构图: 原理自行参考论文,code and comment: # -*- coding: utf-8 -*- # @time : ...
- Recurrent Neural Network系列1--RNN(循环神经网络)概述
作者:zhbzz2007 出处:http://www.cnblogs.com/zhbzz2007 欢迎转载,也请保留这段声明.谢谢! 本文翻译自 RECURRENT NEURAL NETWORKS T ...
- 【NLP】自然语言处理:词向量和语言模型
声明: 这是转载自LICSTAR博士的牛文,原文载于此:http://licstar.net/archives/328 这篇博客是我看了半年的论文后,自己对 Deep Learning 在 NLP 领 ...
- Recurrent Neural Network Language Modeling Toolkit代码学习
Recurrent Neural Network Language Modeling Toolkit 工具使用点击打开链接 本博客地址:http://blog.csdn.net/wangxingin ...
- 课程五(Sequence Models),第一 周(Recurrent Neural Networks) —— 1.Programming assignments:Building a recurrent neural network - step by step
Building your Recurrent Neural Network - Step by Step Welcome to Course 5's first assignment! In thi ...
- Recurrent Neural Network(循环神经网络)
Reference: Alex Graves的[Supervised Sequence Labelling with RecurrentNeural Networks] Alex是RNN最著名变种 ...
- Recurrent Neural Network系列2--利用Python,Theano实现RNN
作者:zhbzz2007 出处:http://www.cnblogs.com/zhbzz2007 欢迎转载,也请保留这段声明.谢谢! 本文翻译自 RECURRENT NEURAL NETWORKS T ...
- Recurrent Neural Network[survey]
0.引言 我们发现传统的(如前向网络等)非循环的NN都是假设样本之间无依赖关系(至少时间和顺序上是无依赖关系),而许多学习任务却都涉及到处理序列数据,如image captioning,speech ...
- (zhuan) Recurrent Neural Network
Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http:/ ...
随机推荐
- Ext 编辑 comobox编辑源只能选择一个
storePrType.DataSource = optsvc.Select("28").Where(r => r.OptionID == cmbEngineeringPrT ...
- Project Tungsten:让Spark将硬件性能压榨到极限(转载)
在之前的博文中,我们回顾和总结了2014年Spark在性能提升上所做的努力.本篇博文中,我们将为你介绍性能提升的下一阶段——Tungsten.在2014年,我们目睹了Spark缔造大规模排序的新世界纪 ...
- 2018年JavaScript现状报告
前言 JavaScript(后面统称JS)在过去五年得到飞速地增长,早期JS实现类似微博的“点赞”这样的功能都需要刷新一次页面. 后来开发者通过JS来制作SPA(单页面应用程序),在浏览器加载一次,后 ...
- MYSQL中SUM (IF())
今天一个朋友突然给我发过来一个sql语句,一下子问住我了. 我想,这种语法木有见过呀.我就查了查,才明白什么意思,原来是mysql里面的用法. SUM(IF(`hosts`.state = 0, 1, ...
- H5 57-文章界面
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
- iOS 传感器集锦
https://www.jianshu.com/p/5fc26af852b6 传感器集锦:指纹识别.运动传感器.加速计.环境光感.距离传感器.磁力计.陀螺仪 效果预览.gif 一.指纹识别 应用: ...
- JS_左边栏菜单
需求: 要求实现左边栏菜单点击一下就弹开,其他的隐藏.再点击一下就隐藏. 最多只能有一个菜单的详细内容会显示出来. 三个菜单实现联动效果. 代码如下: 1 <!DOCTYPE html> ...
- Python_内置函数之round的幺蛾子
pycharm运行结果 1 ret = round(0.5) print(ret) >>> 0 ret1 = round(1.5) print(ret1) >>> ...
- OpenStack 与 Rancher
OpenStack 与 Rancher 融合的新玩法 - Rancher - SegmentFault 思否https://segmentfault.com/a/1190000007965378 Op ...
- asp.net mvc Areas 母版页动态获取数据进行渲染
经常需要将一些通用的页面元素抽离出来制作成母版页,但是这里的元素一般都是些基本元素,即不需要 进行后台数据交换的基本数据,但是对于一些需要通过后台查询的数据,我们应该怎么传递给前台的母版页呢 这里描述 ...