本教程在IMDB大型影评数据集上训练一个循环神经网络进行情感分类。

from __future__ import absolute_import, division, print_function, unicode_literals
# !pip install tensorflow-gpu==2.0.0-alpha0
import tensorflow_datasets as tfds
import tensorflow as tf

导入matplotlib并创建一个辅助函数来绘制图形

import matplotlib.pyplot as plt
def plot_graphs(history, string):
 plt.plot(history.history[string])
 plt.plot(history.history['val_'+string])
 plt.xlabel("Epochs")
 plt.ylabel(string)
 plt.legend([string, 'val_'+string])
 plt.show()

1. 设置输入管道

IMDB大型电影影评数据集是一个二元分类数据集，所有评论都有正面或负面的情绪标签。

使用TFDS下载数据集，数据集附带一个内置的子字标记器

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
 as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

由于这是一个子字标记器，它可以传递任何字符串，并且标记器将对其进行标记。

tokenizer = info.features['text'].encoder
print ('Vocabulary size: {}'.format(tokenizer.vocab_size))
 Vocabulary size: 8185
sample_string = 'TensorFlow is cool.'
tokenized_string = tokenizer.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
original_string = tokenizer.decode(tokenized_string)
print ('The original string: {}'.format(original_string))
assert original_string == sample_string
 Tokenized string is [6307, 2327, 4043, 4265, 9, 2724, 7975]
 The original string: TensorFlow is cool.

如果字符串不在字典中，则标记生成器通过将字符串分解为子字符串来对字符串进行编码。

for ts in tokenized_string:
 print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))
 6307 ----> Ten
 2327 ----> sor
 4043 ----> Fl
 4265 ----> ow
 9 ----> is
 2724 ----> cool
 7975 ----> .
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

2. 创建模型

构建一个tf.keras.Sequential模型并从嵌入层开始，嵌入层每个字存储一个向量，当被调用时，它将单词索引的序列转换为向量序列，这些向量是可训练的，在训练之后（在足够的数据上），具有相似含义的词通常具有相似的向量。

这种索引查找比通过tf.keras.layers.Dense层传递独热编码向量的等效操作更有效。

递归神经网络（RNN）通过迭代元素来处理序列输入，RNN将输出从一个时间步传递到其输入端，然后传递到下一个时间步。

tf.keras.layers.Bidirectional包装器也可以与RNN层一起使用。这通过RNN层向前和向后传播输入，然后连接输出。这有助于RNN学习远程依赖性。

model = tf.keras.Sequential([
 tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
 tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
 tf.keras.layers.Dense(64, activation='relu'),
 tf.keras.layers.Dense(1, activation='sigmoid')
])
# 编译Keras模型以配置训练过程：
model.compile(loss='binary_crossentropy',
 optimizer='adam',
 metrics=['accuracy'])

3. 训练模型

history = model.fit(train_dataset, epochs=10,
 validation_data=test_dataset)
 ...
 Epoch 10/10
 391/391 [==============================] - 70s 180ms/step - loss: 0.3074 - accuracy: 0.8692 - val_loss: 0.5533 - val_accuracy: 0.7873
test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
 391/Unknown - 19s 47ms/step - loss: 0.5533 - accuracy: 0.7873Test Loss: 0.553319326714
 Test Accuracy: 0.787320017815

上面的模型没有屏蔽应用于序列的填充。如果我们对填充序列进行训练，并对未填充序列进行测试，就会导致偏斜。理想情况下，模型应该学会忽略填充，但是正如您在下面看到的，它对输出的影响确实很小。

如果预测 >=0.5，则为正，否则为负。

def pad_to_size(vec, size):
 zeros = [0] * (size - len(vec))
 vec.extend(zeros)
 return vec
def sample_predict(sentence, pad):
 tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)
 if pad:
 tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)
 predictions = model.predict(tf.expand_dims(tokenized_sample_pred_text, 0))
 return (predictions)
# 对不带填充的示例文本进行预测 
sample_pred_text = ('The movie was cool. The animation and the graphics '
 'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)
 [[ 0.68914342]]
# 对带填充的示例文本进行预测 
sample_pred_text = ('The movie was cool. The animation and the graphics '
 'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)
 [[ 0.68634349]]
plot_graphs(history, 'accuracy')

plot_graphs(history, 'loss')

4. 堆叠两个或更多LSTM层

Keras递归层有两种可以用的模式，由return_sequences构造函数参数控制：

返回每个时间步的连续输出的完整序列（3D张量形状 (batch_size, timesteps, output_features)）。
仅返回每个输入序列的最后一个输出（2D张量形状 (batch_size, output_features)）。

model = tf.keras.Sequential([
 tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
 tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
 64, return_sequences=True)),
 tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
 tf.keras.layers.Dense(64, activation='relu'),
 tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',
 optimizer='adam',
 metrics=['accuracy'])
history = model.fit(train_dataset, epochs=10,
 validation_data=test_dataset)
 ...
 Epoch 10/10
 391/391 [==============================] - 154s 394ms/step - loss: 0.1120 - accuracy: 0.9643 - val_loss: 0.5646 - val_accuracy: 0.8070
test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
 391/Unknown - 45s 115ms/step - loss: 0.5646 - accuracy: 0.8070Test Loss: 0.564571284348
 Test Accuracy: 0.80703997612
# 在没有填充的情况下预测示例文本
sample_pred_text = ('The movie was not good. The animation and the graphics '
 'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)
 [[ 0.00393916]]
# 在有填充的情况下预测示例文本
sample_pred_text = ('The movie was not good. The animation and the graphics '
 'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)
 [[ 0.01098633]]
plot_graphs(history, 'accuracy')

plot_graphs(history, 'loss')

你可以查看其它现有的递归层，例如GRU层。

使用RNN对文本进行分类实践电影评论的更多相关文章

tensorflow 教程文本分类 IMDB电影评论
昨天配置了tensorflow的gpu版本,今天开始简单的使用一下主要是看了一下tensorflow的tutorial 里面的 IMDB 电影评论二分类这个教程教程里面主要包括了一下几个内容:下载 ...
kaggle之电影评论文本情感分类
电影文本情感分类 Github地址 Kaggle地址这个任务主要是对电影评论文本进行情感分类,主要分为正面评论和负面评论,所以是一个二分类问题,二分类模型我们可以选取一些常见的模型比如贝叶斯.逻辑回 ...
【项目实战】Kaggle电影评论情感分析
前言这几天持续摆烂了几天,原因是我自己对于Kaggle电影评论情感分析的这个赛题敲出来的代码无论如何没办法运行,其中数据变换的维度我无法把握好,所以总是在函数中传错数据.今天痛定思痛,重新写了一遍代 ...
『TensotFlow』RNN中文文本_下_暨研究生开学感想
承前接上节代码『TensotFlow』RNN中文文本_上, import numpy as np import tensorflow as tf from collections import Co ...
『TensotFlow』RNN中文文本_上
中文文字预处理流程文本处理读取+去除特殊符号按照字段长度排序辅助数据结构生成生成 {字符:出现次数} 字典生成按出现次数排序好的字符list 生成 {字符:序号} 字典生成序号list ...
基于Keras的imdb数据集电影评论情感二分类
IMDB数据集下载速度慢,可以在我的repo库中找到下载,下载后放到~/.keras/datasets/目录下,即可正常运行.)中找到下载,下载后放到~/.keras/datasets/目录下,即可正 ...
爬虫系列(十一) 用requests和xpath爬取豆瓣电影评论
这篇文章,我们继续利用 requests 和 xpath 爬取豆瓣电影的短评,下面还是先贴上效果图: 1.网页分析 (1)翻页我们还是使用 Chrome 浏览器打开豆瓣电影中某一部电影的评论进行分析 ...
机器学习朴素贝叶斯 SVC对新闻文本进行分类
朴素贝叶斯分类器模型(Naive Bayles) Model basic introduction: 朴素贝叶斯分类器是通过数学家贝叶斯的贝叶斯理论构造的,下面先简单介绍贝叶斯的几个公式: 先验概率: ...
kaggle——Bag of Words Meets Bags of Popcorn（IMDB电影评论情感分类实践）
kaggle链接:https://www.kaggle.com/c/word2vec-nlp-tutorial/overview 简介:给出 50,000 IMDB movie reviews,进行0 ...

随机推荐

Transformers 简介（上）
作者|huggingface 编译|VK 来源|Github Transformers是TensorFlow 2.0和PyTorch的最新自然语言处理库 Transformers(以前称为pytorc ...
多线程设计模式——Read-Write Lock模式和Future模式分析
目录多线程程序评价标准任何模式都有一个相同的"中心思想" Read-Write Lock 模式 RW-Lock模式特点冲突总结手搓RW Lock模式代码类图 Data类 ...
POJ 1797 最短路变形所有路径最小边的最大值
题意:卡车从路上经过,给出顶点 n , 边数 m,然后是a点到b点的权值w(a到b路段的承重),求卡车最重的重量是多少可以从上面经过. 思路:求所有路径中的最小的边的最大值.可以用迪杰斯特拉算法,只需 ...
File.Create(path)未关闭遇到的一点点问题
本人老菜鸟一枚,不是因为偶是菜鸟中的老手,而是偶是老了但是还是很菜的鸟╮(╯▽╰)╭,不过打今儿起偶想要腾飞…… 今天写文本文件编辑类时遇到一个小问题,下面先将问题描述一下: 1.写文本文件时都会习惯 ...
[HDU]1166敌兵布阵<静态线段树>
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=1166 题目大意:给出n个点,每个点有一个值,现在有三种操作, 1.在i点加上j 2.在i点减去j 3. ...
Python——铅球飞行计算问题
一. 1.IPO描述为:输入:铅球发射角度. 初始速度(m/s). 初始高度(m) 处理:模拟铅球飞行,时刻更新铅球在飞行中的位置输出:铅球飞行距离(m) 可以拆分小的时间段.任意时刻的位置,都是 ...
人生苦短，学用python
1. 我为什么开始学着用 python 啦? 扯扯网上疯传的一组图片.网上流传<人工智能实验教材>的图片,为幼儿园的小朋友们量身打造的实验教材,可谓是火了.甚至有网友调侃道:pytho ...
【tensorflow2.0】数据管道dataset
如果需要训练的数据大小不大,例如不到1G,那么可以直接全部读入内存中进行训练,这样一般效率最高. 但如果需要训练的数据很大,例如超过10G,无法一次载入内存,那么通常需要在训练的过程中分批逐渐读入. ...
LeetCode 题解 | 70. 爬楼梯
假设你正在爬楼梯.需要 n 阶你才能到达楼顶. 每次你可以爬 1 或 2 个台阶.你有多少种不同的方法可以爬到楼顶呢? 注意:给定 n 是一个正整数. 示例 1: 输入: 2 输出: 2 解释: 有两 ...
mybatis源码配置文件解析之一：解析properties标签
mybatis作为日常开发的常用ORM框架,在开发中起着很重要的作用,了解其源码对日常的开发有很大的帮助.源码版本为:3-3.4.x,可执行到github进行下载. 从这篇文章开始逐一分析mybati ...