使用LSTM做电影评论负面检测——使用朴素贝叶斯才51%，但是使用LSTM可以达到99%准确度

基本思路：

每个评论取前200个单词。然后生成词汇表，利用词汇index标注评论（对每条评论的前200个单词编号而已），然后使用LSTM做正负评论检测。

代码解读见【【【评论】】】！embedding层本质上是word2vec！！！在进行数据降维，但是不是所有的LSTM都需要这个，比如在图像检测mnist时候，就没有这层！

import tensorflow as tf
from tensorflow.contrib.learn.python import learn
from sklearn import metrics
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.naive_bayes import GaussianNB
import os
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.contrib.layers.python.layers import encoders
from sklearn import svm
import tflearn
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb
 
MAX_DOCUMENT_LENGTH = 200
EMBEDDING_SIZE = 50
 
n_words=0
 
def load_one_file(filename):
    x=""
    with open(filename) as f:
        for line in f:
            x+=line
    return x
 
def load_files(rootdir,label):
    list = os.listdir(rootdir)
    x=[]
    y=[]
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            print "Load file %s" % path
            y.append(label)
            x.append(load_one_file(path))
    return x,y
 
def load_data():
    x=[]
    y=[]
    x1,y1=load_files("../data/movie-review-data/review_polarity/txt_sentoken/pos/",0)
    x2,y2=load_files("../data/movie-review-data/review_polarity/txt_sentoken/neg/", 1)
    x=x1+x2
    y=y1+y2
    return x,y 
 
def do_rnn(trainX, testX, trainY, testY):
    global n_words
    # Data preprocessing
    # Sequence padding
    print "GET n_words embedding %d" % n_words
 
    trainX = pad_sequences(trainX, maxlen=MAX_DOCUMENT_LENGTH, value=0.)
    testX = pad_sequences(testX, maxlen=MAX_DOCUMENT_LENGTH, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    print trainX[:10]
    print testX[:10]
    # Network building
    net = tflearn.input_data([None, MAX_DOCUMENT_LENGTH])
    net = tflearn.embedding(net, input_dim=n_words, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')
 
    # Training
 
    model = tflearn.DNN(net, tensorboard_verbose=3)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
             batch_size=32,run_id="maidou")
 
def do_NB(x_train, x_test, y_train, y_test):
    gnb = GaussianNB()
    y_predict = gnb.fit(x_train, y_train).predict(x_test)
    score = metrics.accuracy_score(y_test, y_predict)
    print('NB Accuracy: {0:f}'.format(score))
 
def main(unused_argv):
    global n_words
 
    x,y=load_data()
 
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)
 
    vp = learn.preprocessing.VocabularyProcessor(max_document_length=MAX_DOCUMENT_LENGTH, min_frequency=1)
    vp.fit(x)
    x_train = np.array(list(vp.transform(x_train)))
    x_test = np.array(list(vp.transform(x_test)))
    n_words=len(vp.vocabulary_)
    print('Total words: %d' % n_words)
 
    do_NB(x_train, x_test, y_train, y_test)
    do_rnn(x_train, x_test, y_train, y_test)
 
if __name__ == '__main__':
  tf.app.run()

负面的示例评论：

plot : two teen couples go to a church party , drink and then drive .
they get into an accident .
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares .
what's the deal ?
watch the movie and " sorta " find out . . .
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package .
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly .
they seem to have taken this pretty neat concept , but executed it terribly .
so what are the problems with the movie ?
well , its main problem is that it's simply too jumbled .
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea what's going on .
there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained .
now i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem .
it's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes .
and do they make things entertaining , thrilling or even engaging , in the meantime ?
not really .
the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining .
i guess the bottom line with movies like this is that you should always make sure that the audience is " into it " even before they are given the secret password to enter your world of understanding .
i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! !
okay , we get it . . . there
are people chasing her and we don't know who they are .
do we really need to see it over and over again ?
how about giving us different scenes offering further insight into all of the strangeness going down in the movie ?
apparently , the studio took this film away from its director and chopped it up themselves , and it shows .
there might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess " the suits " decided that turning it into a music video with little edge , would make more sense .
the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood .
but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling .
overall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it .
oh , and by the way , this is not a horror or teen slasher flick . . . it's
just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids .
it also wrapped production two years ago and has been sitting on the shelves ever since .
whatever . . . skip
it !

正面的：

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .
in other words , don't dismiss this film because of its source .
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?
the ghetto in question is , of course , whitechapel in 1888 london's east end .
it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .
when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case .
abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium .
upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .
i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .
in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end .
it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .
and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) .
don't worry - it'll all make sense when you see it .
now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .
the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic .
oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place .
even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent .
ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham .
i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad .
the film , however , is all good .
2 : 00 - r for strong violence/gore , sexuality , language and drug content

pad后和category后的数据示例：

padded and cated data:
trainX 3条:
[[ 1299     6     1  1596    26   354   155     1    62   101   537   252
      5    22  2048   516     4   140   252   119    19     1   147   226
     16    56    19     2   435    37     2    77   648    15     1   164
    222    22   389    12    93    39 19392   235    16   189    83  1299
      6     2  1426   453     7   976  1375    97     1    67     6  2928
     42  2489    58     1   251   225     3    36     4   133   120   305
    138     8  4730   244  1274    70  1018     4    49 14539  2290   947
   3881 22772  1594  1296    67  1812    11  2663  9397  7513  3133     2
   1619  8232   307 16958     4  2015  1329     1 16813  3571     4   869
   3376     5  1019    41     7   518    33   598     7     1  1600     4
  15406  1473    29     2    77   199   812 15956    21    33  1841   315
   1852   371  5280    27   468  2663   343     2   334 11397  1619     5
   1562    47    19     0     3  4239    11   100    10   234   219    10
      0     0     8    30     4   220   144     1   414     4  3226 11120
   3161    92   299   366   725  1010 27520     5  3343    76     7     1
   1205    12 12549  1121     4    44     1  2195  9938     6    23     0
     12  2663  6858     5  1425    19     2  1378]
 [ 1361     1  1647     4     1  4974   130    26 11041  1126   130  1232
      1    57    26     7   269   641     5   205  3325  1053     3  5152
   6318   622     2  5999     4   911   223    14  3772  5166 15739  6635
   2036   633     1  2146   778  2697   327  9589  8311     3  3031    19
     36     1  4974  8164    28     1  3103  4276  6344    27   618     2
   4266     5     1  4203  1427  1199  1083     7   150   192     1  2294
      3 15520   185    52     6     2  3689   572     4  6431 15520  6635
      6   130  1232     2  5020   778   503    12    36  2805     4  1538
   9333  4795  1518     4    25   405  1539 17927  6489  1427  6646    34
  17491    13  3501    99  1232  5309    17    90     2  4074  1232    32
     68 13660   162     5     2  7412   258    83     4   405   460    11
   8238 12857 18618  3890   922  3915     3   146    32     5   488    10
   2125  9736     5     2  2217 16298  3915    81  2529    48  1232   996
      4    54  1053   522    18   157     9   410    24    25     4 23045
    348    24  1535    35  1689     1  5410  1232 23995     3     4   220
      9   340    41  1053     6  1391 18618  9608 16865  1232    24   272
      6   681     7     0   100    20   109   642]
 [   83    59    25 11208     9   371  3442     7     2   546   181    29
    176   158    13   546 25133     3    13  1554  4819    20 25356    12
     36    46  5311     1  1075     4  3442   169    31   134     5    75
     11     1    98    44   104    22  6759    12     2 13377   235  1397
      4     1  1948   826 26697   371  3442  1605    13   260  1364 12771
   4462     7     2   429  1340    29     1   164    63  1142  7782  4587
   1599     6     7     1  1758  1217    12     2   541  8661  1142   168
  10363   541     3     9   588    33     5   826    37     1   546  4553
      4    36   140   300    93    97   361   168     2  8661    28     2
   1988   508     3   102     6  2524     5  7651   100   516  1180    20
   4837    11 13791     5     1    67     8   115   245   529   391   109
      2   821    78   578   198   715     5   103  1218    95     5     1
    415   662   337   415  1605   337  1415     3    10     2   571     6
   2311  2812     3 10809  3442     3  1599   245  2349     5  4402    87
   4339     3    18  1422  6642    12     2 11316  8790     5   819    46
    116   266   193  1599    32  1585     5   141    85     7   546   487
    144    18  1537  3442   124    41     4    13]]

trainY 3条:

[[ 1. 0.] [ 0. 1.] [ 0. 1.]]

其中，MAX_DOCUMENT_LENGTH = 200，由于每个文档都进行了剪切。超过200的就直接截断文本，不再计算了！！！因为：

tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)

参数：

max_document_length: 文档的最大长度。如果文本的长度大于最大长度，那么它会被剪切，反之则用0填充。
min_frequency: 词频的最小值，出现次数小于最小词频则不会被收录到词表中。
vocabulary: CategoricalVocabulary 对象。
tokenizer_fn：分词函数

代码：

from tensorflow.contrib import learn
import numpy as np
max_document_length = 4
x_text =[
    'i love you',
    'me too'
]
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
vocab_processor.fit(x_text)
print next(vocab_processor.transform(['i me too'])).tolist()
x = np.array(list(vocab_processor.fit_transform(x_text)))
print x
[1, 4, 5, 0]
[[1 2 3 0]
 [4 5 0 0]]

文档地址：http://tflearn.org/data_utils/

使用LSTM做电影评论负面检测——使用朴素贝叶斯才51%，但是使用LSTM可以达到99%准确度的更多相关文章

使用CNN做电影评论的负面检测——本质上感觉和ngram或者LSTM同，因为CNN里图像检测卷积一般是3x3，而文本分类的话是直接是一维的3、4、5
代码如下: from __future__ import division, print_function, absolute_import import tensorflow as tf impor ...
检测用户命令序列异常——使用LSTM分类算法【使用朴素贝叶斯，类似垃圾邮件分类的做法也可以，将命令序列看成是垃圾邮件】
通过搜集 Linux 服务器的 bash 操作日志, 通过训练识别出特定用户的操作习惯, 然后进一步识别出异常操作行为. 使用 SEA 数据集涵盖 70 多个 U ...
kaggle之电影评论文本情感分类
电影文本情感分类 Github地址 Kaggle地址这个任务主要是对电影评论文本进行情感分类,主要分为正面评论和负面评论,所以是一个二分类问题,二分类模型我们可以选取一些常见的模型比如贝叶斯.逻辑回 ...
UEBA 学术界研究现状——用户行为异常检测思路：序列挖掘prefixspan，HMM，LSTM/CNN，SVM异常检测，聚类CURE算法
论文技术分析<关于网络分层信息泄漏点快速检测仿真> "1.基于动态阈值的泄露点快速检测方法,采样Mallat算法对网络分层信息的离散采样数据进行离散小波变换;利用滑动窗口对该尺 ...
贝叶斯--旧金山犯罪分类预测和电影评价好坏 demo
来源引用:https://blog.csdn.net/han_xiaoyang/article/details/50629608 1.引言贝叶斯是经典的机器学习算法,朴素贝叶斯经常运用于机器学习的案 ...
基于Keras的imdb数据集电影评论情感二分类
IMDB数据集下载速度慢,可以在我的repo库中找到下载,下载后放到~/.keras/datasets/目录下,即可正常运行.)中找到下载,下载后放到~/.keras/datasets/目录下,即可正 ...
【项目实战】Kaggle电影评论情感分析
前言这几天持续摆烂了几天,原因是我自己对于Kaggle电影评论情感分析的这个赛题敲出来的代码无论如何没办法运行,其中数据变换的维度我无法把握好,所以总是在函数中传错数据.今天痛定思痛,重新写了一遍代 ...
tensorflow 教程文本分类 IMDB电影评论
昨天配置了tensorflow的gpu版本,今天开始简单的使用一下主要是看了一下tensorflow的tutorial 里面的 IMDB 电影评论二分类这个教程教程里面主要包括了一下几个内容:下载 ...
Keras + LSTM 做回归demo
学习神经网络想拿lstm 做回归, 网上找demo 基本三种: sin拟合cos 那个, 比特币价格预测(我用相同的代码和数据没有跑成功, 我太菜了)和keras 的一个例子我基于keras 那个 ...

随机推荐

QT 随笔
1. 设置窗体属性,无边框 | 置顶 setWindowFlags(Qt::FramelessWindowHint); setWindowFlags(Qt::FramelessWindowHin ...
Caused by: java.lang.NoClassDefFoundError: org/apache/neethi/AssertionBuilderFactory
转自:https://blog.csdn.net/iteye_8264/article/details/82641058 1.错误描述严重: StandardWrapper.Throwable or ...
使用log4net记录日志到数据库（含自定义属性）
日志输出自定义属性! 特来总结一下: 一.配置文件使用log4写入数据库就不多说了,网上方法很多,自定义字段如下 <commandText value="INSERT INTO db ...
[转]C#事件-使用事件需要的步骤
事件是C#中另一高级概念,使用方法和委托相关.奥运会参加百米的田径运动员听到枪声,比赛立即进行.其中枪声是事件,而运动员比赛就是这个事件发生后的动作.不参加该项比赛的人对枪声没有反应. 从程序员的角度 ...
MySQL中的存储函数和存储过程的简单示例
存储函数定义 CREATE FUNCTION `fn_sum`(`a` int,`b` int) RETURNS int(11) BEGIN RETURN a + b; END 调用 Navicat ...
wechat4j开发所有jar包
wechat4j开发所需要的jar包合计,不用你去单独下载,已经全部包括下载连接 wechat4j-lib.rar 如果你的服务是部署在新浪云计算SAE上的,那么下载这个jar合集 wechat4j ...
struts2的DTD配置文件
新手可以看看,高手可以跳过…… 最近在学习struts2这个框架,自己也动手写过一些DTD文件,所以很好struts2这个DTD文件是怎么写的,接下来就一个一个的分析根元素是struts,然后又4个 ...
Temporary Tables临时表
1简介 ORACLE数据库除了可以保存永久表外,还可以建立临时表temporary tables.这些临时表用来保存一个会话SESSION的数据, 或者保存在一个事务中需要的数据.当会话退出或者用户提 ...
我用windows live Writer 写个日志试试看
我用windows live Writer 写个日志试试看. 哈哈播放幻灯片全部下载
CF #487 (Div. 2) D. A Shade of Moonlight 构造_数形结合
题意: 给 nnn个长度为 lll 且互不相交的开区间 (xi,xi+l)(x_{i}, x_{i}+l)(xi,xi+l) ,每个区间有一个移动速度 vvv,v∈1,−1v∈1,-1v∈1,−1 ...

使用LSTM做电影评论负面检测——使用朴素贝叶斯才51%，但是使用LSTM可以达到99%准确度

使用LSTM做电影评论负面检测——使用朴素贝叶斯才51%，但是使用LSTM可以达到99%准确度的更多相关文章

随机推荐

热门专题