From: Predicting Movie Review Sentiment with TensorFlow and TensorBoard

Ref: http://www.cnblogs.com/libinggen/p/6939577.html

Ref: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

使用LSTM的原因之一是: 解决RNN Deep Network的Gradient错误累积太多,以至于Gradient归零或者成为无穷大,所以无法继续进行优化的问题。

Thanks to Jürgen Schmidhuber


Using the data from an old Kaggle competition “Bag of Words Meets Bags of Popcorn

import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re, time
from nltk.corpus import stopwords
from collections import defaultdict
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from collections import namedtuple

Preprocessing

The data is formatted as .tsv

  • remove stopwords
  • Convert words to lower case
def clean_text(text, remove_stopwords=True):
'''Clean the text, with the option to remove stopwords''' # Convert words to lower case and split them
text = text.lower().split() # Optionally, remove stop words
if remove_stopwords:
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops] text = " ".join(text) # Clean the text
text = re.sub(r"<br />", " ", text)
text = re.sub(r"[^a-z]", " ", text)
text = re.sub(r" ", " ", text) # Remove any extra spaces
text = re.sub(r" ", " ", text) # Return a list of words
return(text)

Data clean

Tokenize

# Tokenize the reviews
all_reviews = train_clean + test_clean
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_reviews)
print("Fitting is complete.") train_seq = tokenizer.texts_to_sequences(train_clean)
print("train_seq is complete.") test_seq = tokenizer.texts_to_sequences(test_clean)
print("test_seq is complete")
word_index = tokenizer.word_index

NB: punctuation is useful!

[“The”, “cat”, “went”, “to”, “the”, “zoo”, “.”] --> [1, 2, 3, 4, 1, 5, 6]

Limiting your vocabulary

Your model should benefit from limiting your vocabulary to more common words

because it has seen each word in the text multiple times.

Reviews with the same length

I limited mine to 200 to increase the training speed of my model.

Build Graph with LSTM

def build_rnn(n_words, embed_size, batch_size, lstm_size, num_layers, dropout, learning_rate, multiple_fc, fc_units):
'''Build the Recurrent Neural Network''' tf.reset_default_graph() # Declare placeholders we'll feed into the graph
with tf.name_scope('inputs'):
inputs = tf.placeholder(tf.int32, [None, None], name='inputs') with tf.name_scope('labels'):
labels = tf.placeholder(tf.int32, [None, None], name='labels') keep_prob = tf.placeholder(tf.float32, name='keep_prob') # Create the embeddings
with tf.name_scope("embeddings"):
embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs) # Build the RNN layers
with tf.name_scope("RNN_layers"):
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers) # Set the initial state
with tf.name_scope("RNN_init_state"):
initial_state = cell.zero_state(batch_size, tf.float32) # Run the data through the RNN layers
with tf.name_scope("RNN_forward"):
outputs, final_state = tf.nn.dynamic_rnn(
cell,
embed,
initial_state=initial_state) # Create the fully connected layers
with tf.name_scope("fully_connected"): # Initialize the weights and biases
weights = tf.truncated_normal_initializer(stddev=0.1)
biases = tf.zeros_initializer() dense = tf.contrib.layers.fully_connected(outputs[:, -1],
num_outputs = fc_units,
activation_fn = tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases) dense = tf.contrib.layers.dropout(dense, keep_prob) # Depending on the iteration, use a second fully connected
layer
if multiple_fc == True:
dense = tf.contrib.layers.fully_connected(dense,
num_outputs = fc_units,
activation_fn = tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases) dense = tf.contrib.layers.dropout(dense, keep_prob) # Make the predictions
with tf.name_scope('predictions'):
predictions = tf.contrib.layers.fully_connected(dense,
num_outputs = 1,
activation_fn=tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases) tf.summary.histogram('predictions', predictions) # Calculate the cost
with tf.name_scope('cost'):
cost = tf.losses.mean_squared_error(labels, predictions)
tf.summary.scalar('cost', cost) # Train the model
with tf.name_scope('train'):
optimizer =
tf.train.AdamOptimizer(learning_rate).minimize(cost) # Determine the accuracy
with tf.name_scope("accuracy"):
correct_pred = tf.equal(tf.cast(tf.round(predictions),
tf.int32),
labels)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
tf.summary.scalar('accuracy', accuracy) # Merge all of the summaries
merged = tf.summary.merge_all() # Export the nodes
export_nodes = ['inputs', 'labels', 'keep_prob','initial_state',
'final_state','accuracy', 'predictions', 'cost',
'optimizer', 'merged']
Graph = namedtuple('Graph', export_nodes)
local_dict = locals()
graph = Graph(*[local_dict[each] for each in export_nodes]) return graph

Ref: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

这里提到了几种思路:

Simple LSTM for Sequence Classification

model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
 
1
2
3
4
5
6
7
Epoch 1/3
16750/16750 [==============================] - 107s - loss: 0.5570 - acc: 0.7149
Epoch 2/3
16750/16750 [==============================] - 107s - loss: 0.3530 - acc: 0.8577
Epoch 3/3
16750/16750 [==============================] - 107s - loss: 0.2559 - acc: 0.9019
Accuracy: 86.79%

LSTM For Sequence Classification With Dropout

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
  
1
2
3
4
5
6
7
Epoch 1/3
16750/16750 [==============================] - 108s - loss: 0.5802 - acc: 0.6898
Epoch 2/3
16750/16750 [==============================] - 108s - loss: 0.4112 - acc: 0.8232
Epoch 3/3
16750/16750 [==============================] - 108s - loss: 0.3825 - acc: 0.8365
Accuracy: 85.56%

LSTM and Convolutional Neural Network For Sequence Classification

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
 
1
2
3
4
5
6
7
Epoch 1/3
16750/16750 [==============================] - 58s - loss: 0.5186 - acc: 0.7263
Epoch 2/3
16750/16750 [==============================] - 58s - loss: 0.2946 - acc: 0.8825
Epoch 3/3
16750/16750 [==============================] - 58s - loss: 0.2291 - acc: 0.9126
Accuracy: 86.36%
 1D卷积code参考:http://spaces.ac.cn/archives/4195/
 

 
可以结合特征处理,进一步提高performence。

特征处理

在文本挖掘中做了很大的努力,比如提取关键词、情感分析、word embedding聚类之类都尝试过,但效果都不是很好,

对于文本的特征的建议还是去找出一些除了停用词以外的高频词汇,寻找与这个房屋分类问题的具体联系。

到了头疼的部分了,数据有了,我们得想办法从数据里面拿到有区分度的特征

  • 比如说Kaggle该问题的引导页提供的word2vec就是一种文本到数值域的特征抽取方式,
  • 比如说我们在第6小节提到的用户信息提取关键字也是提取特征的一种。
  • 比如说在这里,我们打算用在文本检索系统中非常有效的一种特征:TF-IDF(term frequency-interdocument frequency)向量。每一个电影评论最后转化成一个TF-IDF向量。

稍加解释一下,TF-IDF是一种统计方法,用以评估一字词(或者n-gram)对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。这是一个能很有效地判定对评论褒贬影响大的词或短语的方法。

那个…博主打算继续偷懒,把scikit-learn中TFIDF向量化方法直接拿来用,想详细了解的同学可以戳sklearn TFIDF向量类。对了,再多说几句我的处理细节,停用词被我掐掉了,同时我在单词的级别上又拓展到2元语言模型,恩,你可以再加3元4元语言模型…单机内存不够了,先就2元上,凑活用吧…

End.

[Tensorflow] RNN - 02. Movie Review Sentiment Prediction with LSTM的更多相关文章

  1. [Tensorflow] RNN - 03. MultiRNNCell for Digit Prediction

    Ref: http://blog.csdn.net/u014595019/article/details/52759104 Time: 2min Successfully downloaded tra ...

  2. [Tensorflow] RNN - 01. Spam Prediction with BasicRNNCell

    Ref: http://blog.csdn.net/mebiuw/article/details/60780813 Ref: https://medium.com/@erikhallstrm/hell ...

  3. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  4. tensorflow rnn 最简单实现代码

    tensorflow rnn 最简单实现代码 #!/usr/bin/env python # -*- coding: utf-8 -*- import tensorflow as tf from te ...

  5. TensorFlow (RNN)深度学习 双向LSTM(BiLSTM)+CRF 实现 sequence labeling 序列标注问题 源码下载

    http://blog.csdn.net/scotfield_msn/article/details/60339415 在TensorFlow (RNN)深度学习下 双向LSTM(BiLSTM)+CR ...

  6. TensorFlow RNN MNIST字符识别演示快速了解TF RNN核心框架

    TensorFlow RNN MNIST字符识别演示快速了解TF RNN核心框架 http://blog.sina.com.cn/s/blog_4b0020f30102wv4l.html

  7. Tensorflow进行POS词性标注NER实体识别 - 构建LSTM网络进行序列化标注

    http://blog.csdn.net/rockingdingo/article/details/55653279  Github下载完整代码 https://github.com/rockingd ...

  8. tensorflow RNN循环神经网络 (分类例子)-【老鱼学tensorflow】

    之前我们学习过用CNN(卷积神经网络)来识别手写字,在CNN中是把图片看成了二维矩阵,然后在二维矩阵中堆叠高度值来进行识别. 而在RNN中增添了时间的维度,因为我们会发现有些图片或者语言或语音等会在时 ...

  9. TensorFlow+Keras 02 深度学习的原理

    1 神经传递的原理 人类的神经元传递及其作用: 这里有几个关键概念: 树突 - 接受信息 轴突 - 输出信息 突触 - 传递信息 将其延伸到神经元中,示意图如下: 将上图整理成数学公式,则有 y = ...

随机推荐

  1. CentOS 与 Ubuntu 使用命令搭建 LAMP 环境

    LAMP指的Linux操作系统 + Apache服务器 + MariaDB/MySQL数据库软件 + PHP开发语言的第一个字母. ==================CentOS LAMP===== ...

  2. Java基础-多线程-③线程同步之synchronized

    使用线程同步解决多线程安全问题 上一篇 Java基础-多线程-②多线程的安全问题 中我们说到多线程可能引发的安全问题,原因在于多个线程共享了数据,且一个线程在操作(多为写操作)数据的过程中,另一个线程 ...

  3. 解决 Mac 的 Terminal 中,Java 乱码的问题

    在 .bash_profile 文件中,增加如下行: export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8 然后,重新加载该配置 source .bash_pr ...

  4. Hessian学习总结(二)——使用hessian上传文件

    hessian较早版本通过 byte[] 进行文件传输:4.0之后支持 InputStream 作为参数或返回值进行传输. 注意:hessian会读取整个文件,如果文件过大,会导致JVM内存溢出.可以 ...

  5. android:碎片的生命周期

    和活动一样,碎片也有自己的生命周期,并且它和活动的生命周期实在是太像了,我相 信你很快就能学会,下面我们马上就来看一下. 4.3.1    碎片的状态和回调 还记得每个活动在其生命周期内可能会有哪几种 ...

  6. 修改Unity中Lua文件的默认打开程序

    项目中引用了XLua,而Lua文件又是以txt文件结尾的,当修改系统的扩展脚本编辑器为vs后双击lua文件(xx.txt)默认也使用vs打开了,无提示的黑白文本编辑 昨办? -. 后来看到网上有写Un ...

  7. 用.Net打造一个移动客户端(Android/IOS)的服务端框架NHM(四)——Android端Http访问类(转)

    本章目的 在上一章中,我们利用Hibernate Tools完成了Android Model层的建立,依赖Hibernate Tools的强大功能,自动生成了Model层.在本章,我们将继续我们的项目 ...

  8. How to chain a command after sudo su?

    The idea is simple, for example: alias foo='sudo su foo && cd /tmp' However, it does not exe ...

  9. HTML和CSS中判断IE版本并实现相应HTML和CSS

    在编写网页代码时,各种浏览器的兼容性是个必须考虑的问题,有些时候无法找到适合所有浏览器的写法,就只能写根据浏览器种类区别的代码,这时就要用到判断代码了. 1.HTML代码中 经过本人测试,在HTML代 ...

  10. NPM慢怎么办 - nrm切换资源镜像

    1. 直接配置为taobao镜像 npm config set registry https://registry.npm.taobao.org 1. 使用NRM管理镜像 npm isntall -g ...