tensorflow在文本处理中的使用—

代码来源于：tensorflow机器学习实战指南（曾益强译，2017年9月）——第七章：自然语言处理

代码地址：https://github.com/nfmcclure/tensorflow-cookbook

数据：http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz

在word2vec的方法中，处理的是单词之间的上下文关系，但是没有考虑单词和单词所在文档之间的关系

word2vec方法的拓展之一就是doc2vec方法，其考虑文档的影响。

算法思想

引入文档嵌套，连同单词嵌套一起帮助判断文档的情感色彩

如何结合？

1.文档嵌套和单词嵌套相加（要求文档嵌套的大小和单词嵌套大小相同）

2.文档嵌套追加在单词嵌套后面（需要增加变量数目）

一般而言，对于小数据集，两种嵌套相加是更好的选择

步骤

数据准备处理工作
计算出文档嵌套和单词嵌套
训练逻辑回归模型

1. 数据准备处理工作

1.1 导入必要的包

1.2 加载数据

1.3 声明模型参数

1.4 归一化影评，确保影评长度大于指定窗口大小

1.5 创建单词词典（无需建立文档字典，每个文档均有唯一的索引值）

import tensorflow as tf

import matplotlib.pyplot as plt

import numpy as np

import random

import os

import pickle

import string

import requests

import collections

import io

import tarfile

import urllib.request

import text_helpers

from nltk.corpus import stopwords

from tensorflow.python.framework import ops

ops.reset_default_graph()

os.chdir(os.path.dirname(os.path.realpath(__file__)))

# Make a saving directory if it doesn't exist

data_folder_name = 'temp'

if not os.path.exists(data_folder_name):

    os.makedirs(data_folder_name)

# Start a graph session

sess = tf.Session()

# Declare model parameters

batch_size = 500

vocabulary_size = 7500

generations = 100000

model_learning_rate = 0.001

embedding_size = 200   # Word embedding size

doc_embedding_size = 100   # Document embedding size

concatenated_size = embedding_size + doc_embedding_size

num_sampled = int(batch_size/2)    # Number of negative examples to sample.

window_size = 3       # How many words to consider to the left.

# Add checkpoints to training

save_embeddings_every = 5000

print_valid_every = 5000

print_loss_every = 100

# Declare stop wordsstops = stopwords.words('english')# We pick a few test words for validation.

valid_words = ['love', 'hate', 'happy', 'sad', 'man', 'woman']

# Later we will have to transform these into indices

# Load the movie review data

print('Loading Data')

texts, target = text_helpers.load_movie_data(data_folder_name)

# Normalize text

print('Normalizing Text Data')

texts = text_helpers.normalize_text(texts, stops)

# Texts must contain at least 3 words

target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > window_size]

texts = [x for x in texts if len(x.split()) > window_size]

assert(len(target)==len(texts))

# Build our data set and dictionaries

print('Creating Dictionary')

word_dictionary = text_helpers.build_dictionary(texts, vocabulary_size)

word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys()))

text_data = text_helpers.text_to_numbers(texts, word_dictionary)

# Get validation word keys待验证单词集

valid_examples = [word_dictionary[x] for x in valid_words]

2. 计算出文档嵌套和单词嵌套

2.1 定义文档嵌套和单词嵌套

print('Creating Model')

# Define Embeddings:

embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

doc_embeddings = tf.Variable(tf.random_uniform([len(texts), doc_embedding_size], -1.0, 1.0))

# NCE loss parameters

nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, concatenated_size], stddev=1.0 / np.sqrt(concatenated_size)))

nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

2.2 声明doc2vec索引和目标单词索引的占位符（输入索引大小是窗口大小加1，因为有一个额外的文档索引）

# Create data/target placeholders

x_inputs = tf.placeholder(tf.int32, shape=[None, window_size + 1]) # plus 1 for doc index

y_target = tf.placeholder(tf.int32, shape=[None, 1])

valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

2.3 解释产生批量数据时输入多一维文档索引

参考：tensorflow在文本处理中的使用——辅助函数

# 和skip-gram，CBOW一样

>>> rand_sentence=[2520, 1421, 146, 1215, 5, 468, 12, 14, 18, 20]

>>> window_size = 3

>>> window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]

>>> label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]

>>> window_sequences

[[2520, 1421, 146, 1215], [2520, 1421, 146, 1215, 5], [2520, 1421, 146, 1215, 5, 468], [2520, 1421, 146, 1215, 5, 468, 12], [1421, 146, 1215, 5, 468, 12, 14], 
[146, 1215, 5, 468, 12, 14, 18], [1215, 5, 468, 12, 14, 18, 20], [5, 468, 12, 14, 18, 20], [468, 12, 14, 18, 20], [12, 14, 18, 20]]

>>> label_indices

[0, 1, 2, 3, 3, 3, 3, 3, 3, 3]

# 左边窗口词和目标词

>>> batch_and_labels = [(rand_sentence[i:i+window_size], rand_sentence[i+window_size]) for i in range(0, len(rand_sentence)-window_size)]

>>> batch_and_labels

[([2520, 1421, 146], 1215), ([1421, 146, 1215], 5), ([146, 1215, 5], 468), ([1215, 5, 468], 12), ([5, 468, 12], 14), ([468, 12, 14], 18), ([12, 14, 18], 20)]

>>> batch, labels = [list(x) for x in zip(*batch_and_labels)]

>>> batch

[[2520, 1421, 146], [1421, 146, 1215], [146, 1215, 5], [1215, 5, 468], [5, 468, 12], [468, 12, 14], [12, 14, 18]]

>>> labels

[1215, 5, 468, 12, 14, 18, 20]

# 155代表的是rand_sentence_ix，把文档索引加入

>>> batch = [x + [155] for x in batch]

>>> batch

[[2520, 1421, 146, 155], [1421, 146, 1215, 155], [146, 1215, 5, 155], [1215, 5, 468, 155], [5, 468, 12, 155], [468, 12, 14, 155], [12, 14, 18, 155]]

2.4 创建嵌套函数将单词嵌套求和，然后连接文档嵌套

# 创建嵌套函数将单词嵌套求和，然后连接文档嵌套

embed = tf.zeros([batch_size, embedding_size])

for element in range(window_size):

    embed += tf.nn.embedding_lookup(embeddings, x_inputs[:, element])

doc_indices = tf.slice(x_inputs, [0,window_size],[batch_size,1])

doc_embed = tf.nn.embedding_lookup(doc_embeddings,doc_indices)

# concatenate embeddings

final_embed = tf.concat(1, [embed, tf.squeeze(doc_embed)])

函数参考：tf.slice，tf.concat，tf.squeeze

上述计算过程实例展示

In [1]: import tensorflow as tf

In [2]: batch_size=4

In [3]: embedding_size=2

In [4]: window_size=3

In [5]: vocabulary_size=5

In [6]: embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

In [7]: x_inputs = tf.constant([[1,2,3,4],[2,1,3,4],[0,1,2,4],[0,2,1,4]])

In [8]: doc_embedding_size=6

In [9]: doc_embeddings=tf.Variable(tf.random_uniform([5, doc_embedding_size], -1.0, 1.0))

# 输入上述operation

In [17]: init = tf.global_variables_initializer()

In [18]: sess.run(init)

In [19]: sess.run(embeddings)

Out[19]:

array([[ 0.98145223, -0.83131742],

       [-0.52276659, -0.29783845],

       [ 0.46809649,  0.044348  ],

       [ 0.31654406, -0.46324134],

       [ 0.75828886, -0.99072552]], dtype=float32)

In [20]: sess.run(doc_embeddings)

Out[20]:

array([[ 0.9781754 , -0.06880736, -0.2043252 ,  0.26498127, -0.5746839 ,

         0.98072433],

       [-0.38463736,  0.79673767,  0.43830776,  0.85574818,  0.56155062,

        -0.18029761],

       [-0.72975278,  0.39068675, -0.72180915, -0.04128242,  0.00453711,

         0.35815096],

       [ 0.66498852,  0.07083178, -0.42824841, -0.12427211,  0.35502028,

         0.92180991],

       [ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,

         0.52539349]], dtype=float32)

In [22]: sess.run(x_inputs)

Out[22]:

array([[1, 2, 3, 4],

       [2, 1, 3, 4],

       [0, 1, 2, 4],

       [0, 2, 1, 4]], dtype=int32)

In [23]: sess.run(embed)

Out[23]:

array([[ 0.26187396, -0.71673179],

       [ 0.26187396, -0.71673179],

       [ 0.92678213, -1.08480787],

       [ 0.92678213, -1.08480787]], dtype=float32)

In [24]: sess.run(doc_indices)

Out[24]:

array([[4],

       [4],

       [4],

       [4]], dtype=int32)

In [25]: sess.run(doc_embed)

Out[25]:

array([[[ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,

          0.52539349]],

       [[ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,

          0.52539349]],

       [[ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,

          0.52539349]],

       [[ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,

          0.52539349]]], dtype=float32)

2.5 声明loss和optimizer

# Get loss from prediction

loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, final_embed, y_target, num_sampled, vocabulary_size))

# Create optimizer

optimizer = tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate)

train_step = optimizer.minimize(loss)

2.6 声明验证单词集的余弦距离

# Cosine similarity between words

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))

normalized_embeddings = embeddings / norm

valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)

similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

2.7 保存嵌套数据&训练

# Create model saving operation

saver = tf.train.Saver({"embeddings": embeddings, "doc_embeddings": doc_embeddings})

#Add variable initializer.

init = tf.initialize_all_variables()

sess.run(init)

# Run the skip gram model.

print('Starting Training')

loss_vec = []

loss_x_vec = []

for i in range(generations):

    batch_inputs, batch_labels = text_helpers.generate_batch_data(text_data, batch_size, window_size, method='doc2vec')

    feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}

    # Run the train step

    sess.run(train_step, feed_dict=feed_dict)

    # Return the loss

    if (i+1) % print_loss_every == 0:

        loss_val = sess.run(loss, feed_dict=feed_dict)

        loss_vec.append(loss_val)

        loss_x_vec.append(i+1)

        print('Loss at step {} : {}'.format(i+1, loss_val))

    # Validation: Print some random words and top 5 related words

    if (i+1) % print_valid_every == 0:

        sim = sess.run(similarity, feed_dict=feed_dict)

        for j in range(len(valid_words)):

            valid_word = word_dictionary_rev[valid_examples[j]]

            top_k = 5 # number of nearest neighbors

            nearest = (-sim[j, :]).argsort()[1:top_k+1]

            log_str = "Nearest to {}:".format(valid_word)

            for k in range(top_k):

                close_word = word_dictionary_rev[nearest[k]]

                log_str = '{} {},'.format(log_str, close_word)

            print(log_str)

    # Save dictionary + embeddings

    if (i+1) % save_embeddings_every == 0:

        # Save vocabulary dictionary

        with open(os.path.join(data_folder_name,'movie_vocab.pkl'), 'wb') as f:

            pickle.dump(word_dictionary, f)

        # Save embeddings

        model_checkpoint_path = os.path.join(os.getcwd(),data_folder_name,'doc2vec_movie_embeddings.ckpt')

        save_path = saver.save(sess, model_checkpoint_path)

        print('Model saved in file: {}'.format(save_path))

结果展示：

3. 训练逻辑回归模型

3.1 参数

# Start logistic model-------------------------

max_words = 20

logistic_batch_size = 500

3.2 分割数据集

# Split dataset into train and test sets

# Need to keep the indices sorted to keep track of document index

train_indices = np.sort(np.random.choice(len(target), round(0.8*len(target)), replace=False))

test_indices = np.sort(np.array(list(set(range(len(target))) - set(train_indices))))

texts_train = [x for ix, x in enumerate(texts) if ix in train_indices]

texts_test = [x for ix, x in enumerate(texts) if ix in test_indices]

target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])

target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])

3.3 以20个单词表示每条影评

# Convert texts to lists of indices

text_data_train = np.array(text_helpers.text_to_numbers(texts_train, word_dictionary))

text_data_test = np.array(text_helpers.text_to_numbers(texts_test, word_dictionary))

# Pad/crop movie reviews to specific length

text_data_train = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_train]])

text_data_test = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_test]])

3.4 创建另一个嵌套函数，前面的嵌套函数是训练三个单词窗口和文档索引预测最近的单词，这里类似，不同的是训练20个单词的影评

# Define logistic embedding lookup (needed if we have two different batch sizes)

# Add together element embeddings in window:

log_embed = tf.zeros([logistic_batch_size, embedding_size])

for element in range(max_words):

    log_embed += tf.nn.embedding_lookup(embeddings, log_x_inputs[:, element])

log_doc_indices = tf.slice(log_x_inputs, [0,max_words],[logistic_batch_size,1])

log_doc_embed = tf.nn.embedding_lookup(doc_embeddings,log_doc_indices)

# concatenate embeddings

log_final_embed = tf.concat(1, [log_embed, tf.squeeze(log_doc_embed)])

3.5 构建逻辑回归模型的训练图

# Define Logistic placeholders

log_x_inputs = tf.placeholder(tf.int32, shape=[None, max_words + 1]) # plus 1 for doc index

log_y_target = tf.placeholder(tf.int32, shape=[None, 1])

# Define model:

# Create variables for logistic regression

A = tf.Variable(tf.random_normal(shape=[concatenated_size,1]))

b = tf.Variable(tf.random_normal(shape=[1,1]))

# Declare logistic model (sigmoid in loss function)

model_output = tf.add(tf.matmul(log_final_embed, A), b)

# Declare loss function (Cross Entropy loss)

logistic_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(model_output, tf.cast(log_y_target, tf.float32)))

# Actual Prediction

prediction = tf.round(tf.sigmoid(model_output))

predictions_correct = tf.cast(tf.equal(prediction, tf.cast(log_y_target, tf.float32)), tf.float32)

accuracy = tf.reduce_mean(predictions_correct)

# Declare optimizer

logistic_opt = tf.train.GradientDescentOptimizer(learning_rate=0.01)

logistic_train_step = logistic_opt.minimize(logistic_loss, var_list=[A, b])

3.6 训练

# Intitialize Variables

init = tf.global_variables_initializer()

sess.run(init)

# Start Logistic Regression

print('Starting Logistic Doc2Vec Model Training')

train_loss = []

test_loss = []

train_acc = []

test_acc = []

i_data = []

for i in range(10000):

    rand_index = np.random.choice(text_data_train.shape[0], size=logistic_batch_size)

    rand_x = text_data_train[rand_index]

    # Append review index at the end of text data

    rand_x_doc_indices = train_indices[rand_index]

    rand_x = np.hstack((rand_x, np.transpose([rand_x_doc_indices])))

    rand_y = np.transpose([target_train[rand_index]])

    feed_dict = {log_x_inputs : rand_x, log_y_target : rand_y}

    sess.run(logistic_train_step, feed_dict=feed_dict)

    # Only record loss and accuracy every 100 generations

    if (i+1)%100==0:

        rand_index_test = np.random.choice(text_data_test.shape[0], size=logistic_batch_size)

        rand_x_test = text_data_test[rand_index_test]

        # Append review index at the end of text data

        rand_x_doc_indices_test = test_indices[rand_index_test]

        rand_x_test = np.hstack((rand_x_test, np.transpose([rand_x_doc_indices_test])))

        rand_y_test = np.transpose([target_test[rand_index_test]])

        test_feed_dict = {log_x_inputs: rand_x_test, log_y_target: rand_y_test}

        i_data.append(i+1)

        train_loss_temp = sess.run(logistic_loss, feed_dict=feed_dict)

        train_loss.append(train_loss_temp)

        test_loss_temp = sess.run(logistic_loss, feed_dict=test_feed_dict)

        test_loss.append(test_loss_temp)

        train_acc_temp = sess.run(accuracy, feed_dict=feed_dict)

        train_acc.append(train_acc_temp)

        test_acc_temp = sess.run(accuracy, feed_dict=test_feed_dict)

        test_acc.append(test_acc_temp)

    if (i+1)%500==0:

        acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]

        acc_and_loss = [np.round(x,2) for x in acc_and_loss]

        print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

结果：

可视化代码参考：tensorflow在文本处理中的使用——Word2Vec预测

tensorflow在文本处理中的使用——Doc2Vec情感分析的更多相关文章

tensorflow在文本处理中的使用——Word2Vec预测
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——CBOW词嵌入模型
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——skip-gram模型
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——TF-IDF算法
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——词袋
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——辅助函数
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——skip-gram & CBOW原理总结
摘自:http://www.cnblogs.com/pinard/p/7160330.html 先看下列三篇,再理解此篇会更容易些(个人意见) skip-gram,CBOW,Word2Vec 词向量基 ...
[Algorithm & NLP] 文本深度表示模型——word2vec&doc2vec词向量模型
深度学习掀开了机器学习的新篇章,目前深度学习应用于图像和语音已经产生了突破性的研究进展.深度学习一直被人们推崇为一种类似于人脑结构的人工智能算法,那为什么深度学习在语义分析领域仍然没有实质性的进展呢? ...
TensorFlow实现文本情感分析详解
http://c.biancheng.net/view/1938.html 前面我们介绍了如何将卷积网络应用于图像.本节将把相似的想法应用于文本. 文本和图像有什么共同之处?乍一看很少.但是,如果将句 ...

随机推荐

KDD2016,Accepted Papers
RESEARCH TRACK PAPERS - ORAL Title & Authors NetCycle: Collective Evolution Inference in Heterog ...
Linux与Unix shell编程指南(完整高清版).pdf
找到一本很详细的Linux Shell脚本教程,其实里面不光讲了Shell脚本编程,还介绍了系统的各种命令 http://vdisk.weibo.com/s/yVBlEojGMQMpv 本书共分五部分 ...
自学FPAG笔记之 " top_down “
top_town设计:在FPGA中top_down(自顶向上)是十分重要的一种编程方法,优点:使用top_down方法去写代码会使得程序看起来十分简洁,缺点:top_down写的文件会特别多. 例子: ...
JS 寄生继承
寄生构造函数寄生构造函数的用途目的:给String内置对象功能扩充稳妥的构造函数继承对象冒充继承一般继承组合继承原型链继承:借助于中转函数和已有对象寄生式继承:把原型式+工厂式结合而来 ...
IOS 第三方管理库管理 CocoaPods
CocoaPod集成Tips http://www.jianshu.com/p/dcde0668eee9 import导入类失败 http://www.360doc.com/content/15/03 ...
2018-11-21-WPF-解决-ViewBox--不显示线的问题
title author date CreateTime categories WPF 解决 ViewBox 不显示线的问题 lindexi 2018-11-21 09:37:53 +0800 201 ...
Inno Setup 设置开机启动
案1: 将快捷方式添加到“启动”文件夹 [Tasks] Name: "startupicon"; Description: "{cm:CreateQuickLaunchI ...
洛谷 P1447 [NOI2010]能量采集（莫比乌斯反演）
题意:问题可以转化成求$\sum_{i=1}^{n}\sum_{j=1}^{m}(2*gcd(i,j)-1)$ 将2和-1提出来可以得到:$2*\sum_{i=1}^{n}\sum_{j=1}^{m} ...
Directx11学习笔记【八】龙书D3DApp的实现
原文:Directx11学习笔记[八] 龙书D3DApp的实现 directx11龙书中的初始化程序D3DApp跟我们上次写的初始化程序大体一致,只是包含了计时器的内容,而且使用了深度模板缓冲. D3 ...
IDEA-创建WEB项目并部署Tomcat
一.创建简单web项目 1.创建一个web project File -> new Project ->选择project sdk 为1.6(如果没有sdk的同学请先配置)-> Ne ...

tensorflow在文本处理中的使用——Doc2Vec情感分析

tensorflow在文本处理中的使用——Doc2Vec情感分析的更多相关文章

随机推荐

热门专题