tensorflow在文本处理中的使用—

代码来源于：tensorflow机器学习实战指南（曾益强译，2017年9月）——第七章：自然语言处理

代码地址：https://github.com/nfmcclure/tensorflow-cookbook

数据：http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz

问题：加载和使用预训练的嵌套，并使用这些单词嵌套进行情感分析，通过训练线性逻辑回归模型来预测电影的好坏

步骤如下：

必要包
声明模型参数
读取并转换文本数据集，划分训练集和测试集
构建图
训练

step1：必要包

import tensorflow as tf

import matplotlib.pyplot as plt

import numpy as np

import random

import os

import pickle

import string

import requests

import collections

import io

import tarfile

import urllib.request

import text_helpers

from nltk.corpus import stopwords

from tensorflow.python.framework import ops

ops.reset_default_graph()

os.chdir(os.path.dirname(os.path.realpath(__file__)))

# Start a graph session

sess = tf.Session()

step2：声明模型参数

# Declare model parameters

embedding_size = 200

vocabulary_size = 2000

batch_size = 100

max_words = 100

# Declare stop words

stops = stopwords.words('english')

step3：读取并转换本文数据集，划分训练集和测试集

参考：tensorflow在文本处理中的使用——辅助函数

# Load Data

print('Loading Data')

data_folder_name = 'temp'

texts, target = text_helpers.load_movie_data(data_folder_name)

# Normalize text

print('Normalizing Text Data')

texts = text_helpers.normalize_text(texts, stops)

# Texts must contain at least 3 words

target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2]

texts = [x for x in texts if len(x.split()) > 2]

# Split up data set into train/test

train_indices = np.random.choice(len(target), round(0.8*len(target)), replace=False)

test_indices = np.array(list(set(range(len(target))) - set(train_indices)))

texts_train = [x for ix, x in enumerate(texts) if ix in train_indices]

texts_test = [x for ix, x in enumerate(texts) if ix in test_indices]

target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])

target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])

# Load dictionary and embedding matrix加载CBOW嵌套中保存的单词字典

dict_file = os.path.join(data_folder_name, 'movie_vocab.pkl')

word_dictionary = pickle.load(open(dict_file, 'rb'))

# Convert texts to lists of indices根据单词字典将加载的句子转化为数值型numpy数组

text_data_train = np.array(text_helpers.text_to_numbers(texts_train, word_dictionary))

text_data_test = np.array(text_helpers.text_to_numbers(texts_test, word_dictionary))

# Pad/crop movie reviews to specific length电影影评长度不一，不满100维的用0凑满，超过100维的取前100维

text_data_train = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_train]])

text_data_test = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_test]])

step4：构建图

print('Creating Model')

# Define Embeddings:创建嵌套变量，用于之后加载CBOW训练好的嵌套向量

embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# Define model:

# Create variables for logistic regression变量

A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))

b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders数据占位符

x_data = tf.placeholder(shape=[None, max_words], dtype=tf.int32)

y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)

# Lookup embeddings vectors

embed = tf.nn.embedding_lookup(embeddings, x_data)

# Take average of all word embeddings in documents计算句子中所有单词的平均嵌套

embed_avg = tf.reduce_mean(embed, 1)

# Declare logistic model (sigmoid in loss function)

model_output = tf.add(tf.matmul(embed_avg, A), b)

# Declare loss function (Cross Entropy loss)

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(model_output, y_target))

# Actual Prediction

prediction = tf.round(tf.sigmoid(model_output))

predictions_correct = tf.cast(tf.equal(prediction, y_target), tf.float32)

accuracy = tf.reduce_mean(predictions_correct)

# Declare optimizer

my_opt = tf.train.AdagradOptimizer(0.005)

train_step = my_opt.minimize(loss)

step5：训练

# Intitialize Variables

init = tf.initialize_all_variables()

sess.run(init)

# Load model embeddings加载CBOW训练好的嵌套矩阵

model_checkpoint_path = os.path.join(data_folder_name,'cbow_movie_embeddings.ckpt')

saver = tf.train.Saver({"embeddings": embeddings})

saver.restore(sess, model_checkpoint_path)

# Start Logistic Regression

print('Starting Model Training')

train_loss = []

test_loss = []

train_acc = []

test_acc = []

i_data = []

for i in range(10000):

    rand_index = np.random.choice(text_data_train.shape[0], size=batch_size)

    rand_x = text_data_train[rand_index]

    rand_y = np.transpose([target_train[rand_index]])

    sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})

    # Only record loss and accuracy every 100 generations

    if (i+1)%100==0:

        i_data.append(i+1)

        train_loss_temp = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})

        train_loss.append(train_loss_temp)

        test_loss_temp = sess.run(loss, feed_dict={x_data: text_data_test, y_target: np.transpose([target_test])})

        test_loss.append(test_loss_temp)

        train_acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y})

        train_acc.append(train_acc_temp)

        test_acc_temp = sess.run(accuracy, feed_dict={x_data: text_data_test, y_target: np.transpose([target_test])})

        test_acc.append(test_acc_temp)

    if (i+1)%500==0:

        acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]

        acc_and_loss = [np.round(x,2) for x in acc_and_loss]

        print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

可视化结果展示：

# Plot loss over time

plt.plot(i_data, train_loss, 'k-', label='Train Loss')

plt.plot(i_data, test_loss, 'r--', label='Test Loss', linewidth=4)

plt.title('Cross Entropy Loss per Generation')

plt.xlabel('Generation')

plt.ylabel('Cross Entropy Loss')

plt.legend(loc='upper right')

plt.show()

# Plot train and test accuracy

plt.plot(i_data, train_acc, 'k-', label='Train Set Accuracy')

plt.plot(i_data, test_acc, 'r--', label='Test Set Accuracy', linewidth=4)

plt.title('Train and Test Accuracy')

plt.xlabel('Generation')

plt.ylabel('Accuracy')

plt.legend(loc='lower right')

plt.show()

tensorflow在文本处理中的使用——Word2Vec预测的更多相关文章

tensorflow在文本处理中的使用——Doc2Vec情感分析
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——CBOW词嵌入模型
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——skip-gram模型
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——TF-IDF算法
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——词袋
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——辅助函数
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——skip-gram & CBOW原理总结
摘自:http://www.cnblogs.com/pinard/p/7160330.html 先看下列三篇,再理解此篇会更容易些(个人意见) skip-gram,CBOW,Word2Vec 词向量基 ...
TensorFlow实现文本情感分析详解
http://c.biancheng.net/view/1938.html 前面我们介绍了如何将卷积网络应用于图像.本节将把相似的想法应用于文本. 文本和图像有什么共同之处?乍一看很少.但是,如果将句 ...
jQuery文本框中的事件应用
jQuery文本框中的事件应用 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "ht ...

随机推荐

node.js对象数据类型
在这里复习下前端JS的数据类型:前端JS中的数据类型: 1.基本/原生/值类型 string.number.boolean.null.undefined 2.引用/对象类型 ES对象类型:Str ...
【JZOJ3213】【SDOI2013】直径
╰(￣▽￣)╭ 小 Q最近学习了一些图论知识.根据课本,有如下定义. 树:无回路且连通的无向图,每条边都有正整数的权值来表示其长度.如果一棵树有N个节点,可以证明其有且仅有 N-1 条边. 路径:一棵 ...
【JZOJ4799】【NOIP2016提高A组模拟9.24】我的快乐时代
题目描述输入一行,两个整数l,r . 输出一行,一个整数,表示第l 天到第r 天的愉悦值的和. 样例输入 64 89 样例输出 1818 数据范围解法可以参考数位动态规划的想法. 从个位开始 ...
[已转移]js事件流之事件冒泡的应用----事件委托
该文章已转移到博客:https://cynthia0329.github.io/ 什么是事件委托? 它还有一个名字叫事件代理. JavaScript高级程序设计上讲: 事件委托就是利用事件冒泡,只指定 ...
1.Golang开山篇,GO就是NB!
目录:GO就是NB GO sb吗安装环境 GO就是NB,K2R三位大佬写的GO,学GO不吃亏! (1)我们为什么要学高并发深度 || 广度 (2)go学习思路和目标多打多练掌握go语言做一 ...
navicat for mysql 在Mac上安装后没有连接列表，就是左边的那一列连接项目怎么办？
在连接数处打对勾就可以了
洛谷2375 BZOJ 3670动物园题解
题目链接洛谷链接我们发现题目要我们求的num[i]东西本质上其实是求有多少以i结尾的非前缀且能与前缀匹配的字符串,而且要求字符串长度小于(i/2) 我们先不考虑字符串长度的限制,看所有以i结尾的 ...
docker保存容器的修改
docker保存容器修改通过在容器中运行某一个命令,可以把对容器的修改保存下来, 这样下次可以从保存后的最新状态运行该容器.docker中保存状态的过程称之为committing, 它保存的新旧状态 ...
PHP程序连接Redis报read error on connection问题
线上PHP程序动不动就报PHP Fatal error: Uncaught RedisException: read error on connection错误,就是连接Redis在那么1秒钟有问题, ...
jQuery show hide方法二级菜单
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

tensorflow在文本处理中的使用——Word2Vec预测

tensorflow在文本处理中的使用——Word2Vec预测的更多相关文章

随机推荐

热门专题