RNNs and Language modeling in TensorFlow

From feed-forward to Recurrent Neural Networks (RNNs)

In the last few weeks, we've seen how feed-forward and convolutional neural networks have achieved incredible results. They perform on par with, even outperform, humans in many different tasks.

Despite their seemingly magical properties, these models are still very limited. Humans aren't built to just do linear or logistic regression, or recognize individual objects. We can understand, communicate, and create. The inputs we deal with aren't just singular data points, but sequences that are rich in information and complex in time dependencies. Languages we use are sequential. Music we listen to is sequential. TV shows we watch are sequential. The question is: how can we make our models capable of processing sequences of inputs with all their intricacies the way humans do.

RNNs were created with the aim of capturing the sequential information. The Simple Recurrent Network (SRN) was first introduced by Jeff Elman in a paper entitled "Finding structure in time" (Elman, 1990). As Professor James McClelland wrote in his book:

"The paper was groundbreaking for many cognitive scientists and psycholinguists, since it was the first to completely break away from a prior commitment to specific linguistic units (e.g. phonemes or words), and to explore the vision that these units might be emergent consequences of a learning process operating over the latent structure in the speech stream. Elman had actually implemented an earlier model in which the input and output of the network was a very low-level spectrogram-like representation, trained using a spectral information extracted from a recording of his own voice saying 'This is the voice of the neural network'."

RNNs are built on the same computational unit, known as neuron, as the feed-forward neural networks. However, they differ in the way these neurons are connected to one another. Feed forward neural networks are organized in layers: signals are passed in one direction only (from inputs to outputs) and loops aren't allowed. RNNs, on the contrary, allow neurons to connect to themselves. This allows for the notion of time to be taken into account, as the neuron from the previous step can affect the neuron at the current step.

Graph from "A survey on the application of recurrent neural networks to statistical language modeling." by De Mulder et al. Computer Speech & Language 30.1 (2015): 61-98.

Elman's SRN did exactly this. In his early model, the hidden layer at this current step is a function of both the input at that step, and the hidden layer from the previous step. A few years earlier than Elman, Jordan developed a similar network, but instead of taking in the hidden layer of the previous step as the input, the hidden layer of the current step takes in the output from the previous step. Here is a side by side comparison of the two early simple neural networks on Wikipedia.

People often illustrate RNNs as neurons connecting to themselves, but you might find it easier to think of those neurons as they are unfolded: each neuron corresponds to one time step. For example, in the context of Natural Language Processing (NLP), if your input is a sentence of 10 tokens, each time step would correspond to one token. All the time steps share the same weights (because they are essentially the same neuron), which can reduce the total number of parameters we have to use.

Graph by Nature

Most people think of RNNs in the context of NLP because languages are highly sequential. Indeed, the first RNNs were built for NLP tasks and many nowadays NLP tasks are solved using RNNs. However, they can also be used for tasks dealing with audio, images, videos. For example, you can train an RNN to do the object recognition task on the dataset MNIST, treating each image as a sequence of pixels.

Back-propagation through Time (BPTT)

In a feed forward or a convolutional neural network, errors are back-propagated from the loss to all the layers. These errors are used to update the parameters (weights, biases) according to the update rule that we specify (gradient descent, Adam, ...) to decrease the loss.

In a recurrent neural network, errors are back-propagated from the loss to all the timesteps. The two main differences are:

  1. Each layer in a feed-forward network has its own parameters while all the timesteps in a RNN share the same parameters. We use the sum of the gradients at all the timesteps to update the parameters for each training sample/batch.
  2. A feed-forward network has a fixed number of layers, while a RNN can have an arbitrary number of timesteps depending on the length of the sequence.

For the number 2, this means that if your sequence is long (say, 1000 steps corresponding to 1000 words in a text document), the process of back-propagating through all those steps is computationally expensive. Another problem is that this can lead to the gradients to be exponentially increasing or decreasing, depending on whether they are big or small, which lead to vanishing or exploding gradients. Denny Britz has a great blog post on the BPTT and exploding/vanishing gradients.

Graph by Denny Britz

To avoid having to do the full parameter update for all the timesteps, we often limit the number of timesteps, resulting in what is known as truncated BPTT. This speeds up computation at each update. The downside is that we can only back-propagate the errors back to a limited number of timesteps, and the network won't be able to learn the dependencies from the beginning of time (e.g., from the beginning of a text).

In TensorFlow, RNNs are created using the unrolled version of the network. In the non-eager mode of TensorFlow, this means that a fixed number of timesteps are specified before executing the computation, and it can only handle input with those exact number of timesteps. This can be problematic since we don't usually have inputs of the exact same length. For example, one paragraph might have 20 words while another might have 200. A common practice is to divide the data into different buckets, with samples of similar lengths into the same bucket. All samples in one bucket will either be padded with zero tokens or truncated to have the same length.

Gated Recurrent Unit (LSTM and GRU)

In practice, RNNs have proven to be really bad at capturing long-term dependencies.To address this drawback, people have been using Long Short-Term Memory (LSTM). The rise of LSTM in the last 3 years makes it seem like a new idea, but it's actually a pretty old concept. It was proposed in the mid-90s by two German researchers, Sepp Hochreiter and Jürgen Schmidhuber, as a solution to the vanishing gradient problem. Like many ideas in AI, LSTM has only become popular in the last few years thanks to the increasing computational power that allows it to work.

Google Trend, Feb 19, 2018

LSTM units use what's called a gating mechanism. They include 4 gates, generally denoted as i, o, f, and , corresponding to input, output, forget, and candidate/new memory gate.

It seems like everyone in academia has a different diagram to visualize the LSTM units, all are inevitably confusing. One of the diagrams that I find less confusing is the one created by Mohammadi et al. for CS224D's lecture note.

Intuitively, we can think of the gate as controlling what information to enter and emit from the cell at each timestep. All the gates have the same dimensions.

  • input gate: decides how much of the current input to let through.
  • forget gate: defines how much of the previous state to take into account.
  • output gate: defines how much of the hidden state to expose to the next timestep.
  • candidate gate: similar to the original RNN, this gate computes the candidate hidden state based on the previous hidden state and the current input.
  • final memory cell: the internal memory of the unit combines the candidate hidden state with the input/forget gate information. Final memory cell is then computed with the output gate to decide how much of the final memory cell is to be output as the hidden state for this current step.

LSTM is not the only gating mechanism aimed at improving capturing long term dependencies of RNNs. GRU (gated recurrent unit) uses similar mechanism with significant simplication. It combines LSTM's forget and input gates into a single "update gate." It also merges the candidate/new cell state and the hidden state. The resulting GRU is much simpler than the standard LSTM, while its performances have been shown to be on par with that of LSTM on several benchmark tasks. The simplicity of GRU also means that it requires less computation, which, in theory, should reduce computation time. However, as far as I know, there hasn't been observation of significant improvement in runtime of GRU compared to LSTM.

CS224D's lecture note

In TensorFlow, I prefer using GRU as it's a lot less cumbersome. GRU cells in TensorFlow output a hidden state for each layer, while LSTM cells output both candidate and hidden state.

Application: Language modeling

Given a sequence of words we want to predict the distribution of the next word given the all previous words. This ability to predict the next word gives us a generative model that allows us to generate new text by sampling from the output probabilities. Depending on what our training data is, we can generate all kinds of stuff. You can read Andrej Karpathy's blog postabout some of the funky results he got using a char-RNN, which is RNNs applied at character- instead of word-level.

When building a language model, our input is typically a sequence of words (or characters, as in the case with char-RNN, or something in between like subwords), and our output is the distribution for the next word.

In this exercise, we will build one char-RNN model on two datasets: Donald Trump's tweets and arvix abstracts. The arvix abstracts dataset consists of 20,466 unique abstracts, most in the range 500 - 2000 characters. The Donald Trump's tweets dataset consists of all his tweets up until Feb 15, 2018, with most of the retweets filtered out. There are 19,469 tweets in total, each of less than 140 characters. We did some minor data-preprocessing: replacing all URL with __HTTP__ (on hindsight, I should have used a shorter token, such as _URL_ or just _U_) and adding end token _E_.

Below are some of the outputs from our presidential tweet bot.

I will be interviewed on @foxandfriends tonight at 10:00 P.M. and the #1 to construct the @WhiteHouse tonight at 10:00 P.M. Enjoy __HTTP__

I will be interviewed on @foxandfriends at 7:00 A.M. and the only one that we will MAKE AMERICA GREAT AGAIN #Trump2016 __HTTP__ __HTTP__

No matter the truth and the world that the Fake News Media will be a great new book #Trump2016 __HTTP__ __HTTP__

Great poll thank you for your support of Monday at 7:30 A.M. on NBC at 7pm #Trump2016 #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__

The Senate report to our country is a total disaster. The American people who want to start like a total disaster. The American should be the security 5 star with a record contract to the American peop

.@BarackObama is a great president of the @ApprenticeNBC

No matter how the U.S. is a complete the ObamaCare website is a disaster.

Here's an abstract generated by our arvix abstract bot.

"Deep learning neural network architectures can be used to best developing a new architectures contros of the training and max model parametrinal Networks (RNNs) outperform deep learning algorithm is easy to out unclears and can be used to train samples on the state-of-the-art RNN more effective Lorred can be used to best developing a new architectures contros of the training and max model and state-of-the-art deep learning algorithms to a similar pooling relevants. The space of a parameter to optimized hierarchy the state-of-the-art deep learning algorithms to a simple analytical pooling relevants. The space of algorithm is easy to outions of the network are allowed at training and many dectional representations are allow develop a groppose a network by a simple model interact that training algorithms to be the activities to maximul setting, …"

For the code, please see on the class's GitHub. You can also refer to the lecture slides for more information on the code.

RNN and Language modeling in TensorFlow的更多相关文章

  1. Recurrent Neural Network Language Modeling Toolkit代码学习

    Recurrent Neural Network Language Modeling Toolkit  工具使用点击打开链接 本博客地址:http://blog.csdn.net/wangxingin ...

  2. 斯坦福大学自然语言处理第四课“语言模型(Language Modeling)”

    http://52opencourse.com/111/斯坦福大学自然语言处理第四课-语言模型(language-modeling) 一.课程介绍 斯坦福大学于2012年3月在Coursera启动了在 ...

  3. 课程五(Sequence Models),第一 周(Recurrent Neural Networks) —— 2.Programming assignments:Dinosaur Island - Character-Level Language Modeling

    Character level language model - Dinosaurus land Welcome to Dinosaurus Island! 65 million years ago, ...

  4. 【NLP】Conditional Language Modeling with Attention

    Review: Conditional LMs Note that, in the Encoder part, we reverse the input to the ‘RNN’ and it per ...

  5. A Language Modeling Approach to Predicting Reading Difficulty-paer

    Volume:Proceedings of the Human Language Technology Conference of the North American Chapter of the ...

  6. Language Modeling with Gated Convolutional Networks

    语言模型 所谓的语言模型,即是指在得知前面的若干个单词的时候,下一个位置上出现的某个单词的概率. 最朴素的方法是N-gram语言模型,即当前位置只和前面N个位置的单词相关.如此,问题便是,N小了,语言 ...

  7. NLP | 自然语言处理 - 语言模型(Language Modeling)

    转:http://blog.csdn.net/lanxu_yy/article/details/29918015 为什么需要语言模型? 想象“语音识别”这样的场景,机器通过一定的算法将语音转换为文字, ...

  8. 语言模型(Language Modeling)与统计语言模型

    1. n-grams 统计语言模型研究的是一个单词序列出现的概率分布(probability distribution).例如对于英语,全体英文单词构成整个状态空间(state space). 边缘概 ...

  9. Language Modeling with Gated Convolutional Networks(句子建模之门控CNN)--模型简介篇

    版权声明:本文为博主原创文章,遵循CC 4.0 by-sa版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/liuchonge/article/deta ...

随机推荐

  1. bzoj [JSOI2010]Group 部落划分 Group【二分+并查集】

    我是zz吗这么简单都写错-- 一眼二分,然后判断的话是枚举点,然后计算这个点到已有联通块的最小距离,如果这个点到一些联通块的距离小于当前二分的val,则把这些联通块合并起来,这里用并查集维护,最后看这 ...

  2. bzoj 1783: [Usaco2010 Jan]Taking Turns【贪心+dp】

    不知道该叫贪心还是dp 倒着来,记f[0][i],f[1][i]分别为先手和后手从n走到i的最大值.先手显然是取最大的,当后手取到比先手大的时候就交换 #include<iostream> ...

  3. bzoj 1690: [Usaco2007 Dec]奶牛的旅行【01分数规划+spfa】

    把add传参里的double写成int我也是石乐志-- 首先这个东西长得就很01分数规划 然后我不会证为什么没有8字环,我们假装他没有 那么设len为环长 \[ ans \leq \frac{\sum ...

  4. bzoj 1707: [Usaco2007 Nov]tanning分配防晒霜【贪心||最大流(?)】

    洛谷上能过的最大流bzoj上T了--但是贪心做法明明在洛谷上比最大流要慢啊--如果是最大流的话就是裸题了吧 说一下贪心,就按照防晒霜排序,然后对每一个防晒霜选一头可以使用的且r最小的牛 就,没了. 贪 ...

  5. 略微讲讲最近的 webpack 该如何加快编译

    首先假设 基础的环境是有 creat-react-app 所创建的 即所有基础的loader,插件的 cache 都已经缓存了 在这种情况下想加速,真是很难 不过,有一个插件是可以观察 各个模块所花的 ...

  6. Windows 和 Linux 上Redis的安装守护进程配置

    # Windows 和 Linux 上Redis的安装守护进程配置 Redis 简介 ​ Redis是目前最常用的非关系型数据库(NOSql)之一,常以Key-Value的形式存储.Redis读写速度 ...

  7. Win7上安装Oracle数据库

    由于ORACLE并没有FOR WIN7的版本,必须下载for vista_w2k8这个版本,将oralce 10G的安装镜像解压到硬盘,然后修改安装目录下的rehost.xml和oraparam.in ...

  8. js中将html文档写入静态界面当中

    1.静态界面当中: <div id="test"></div> 2.在js当中写入 $("#test").append(html文档内容 ...

  9. 专题八:P2P编程

    引言: 前面的介绍专题中有朋友向我留言说介绍下关于P2P相关的内容的,首先本人对于C#网络编程也不是什么大牛,因为能力的关系,也只能把自己的一些学习过程和自己的一些学习过程中的理解和大家分享下的,下面 ...

  10. jQuery四叶草菜单效果,跟360杀毒软件差不多

    首先,我们要在js,css文件夹中创建js跟css,然后在body中写入html代码 <main><!--标签是 HTML 5 中的新标签. 素中的内容对于文档来说应当是唯一的.它不 ...