基本思路:

每个评论取前200个单词。然后生成词汇表,利用词汇index标注评论(对 每条评论的前200个单词编号而已),然后使用LSTM做正负评论检测。

代码解读见【【【评论】】】!embedding层本质上是word2vec!!!在进行数据降维,但是不是所有的LSTM都需要这个,比如在图像检测mnist时候,就没有这层!

  1. import tensorflow as tf
  2. from tensorflow.contrib.learn.python import learn
  3. from sklearn import metrics
  4. from sklearn.model_selection import train_test_split
  5. import numpy as np
  6. from sklearn.naive_bayes import GaussianNB
  7. import os
  8. from sklearn.feature_extraction.text import CountVectorizer
  9. from tensorflow.contrib.layers.python.layers import encoders
  10. from sklearn import svm
  11. import tflearn
  12. from tflearn.data_utils import to_categorical, pad_sequences
  13. from tflearn.datasets import imdb
  14.  
  15. MAX_DOCUMENT_LENGTH = 200
  16. EMBEDDING_SIZE = 50
  17.  
  18. n_words=0
  19.  
  20. def load_one_file(filename):
  21. x=""
  22. with open(filename) as f:
  23. for line in f:
  24. x+=line
  25. return x
  26.  
  27. def load_files(rootdir,label):
  28. list = os.listdir(rootdir)
  29. x=[]
  30. y=[]
  31. for i in range(0, len(list)):
  32. path = os.path.join(rootdir, list[i])
  33. if os.path.isfile(path):
  34. print "Load file %s" % path
  35. y.append(label)
  36. x.append(load_one_file(path))
  37. return x,y
  38.  
  39. def load_data():
  40. x=[]
  41. y=[]
  42. x1,y1=load_files("../data/movie-review-data/review_polarity/txt_sentoken/pos/",0)
  43. x2,y2=load_files("../data/movie-review-data/review_polarity/txt_sentoken/neg/", 1)
  44. x=x1+x2
  45. y=y1+y2
  46. return x,y
  47.  
  48. def do_rnn(trainX, testX, trainY, testY):
  49. global n_words
  50. # Data preprocessing
  51. # Sequence padding
  52. print "GET n_words embedding %d" % n_words
  53.  
  54. trainX = pad_sequences(trainX, maxlen=MAX_DOCUMENT_LENGTH, value=0.)
  55. testX = pad_sequences(testX, maxlen=MAX_DOCUMENT_LENGTH, value=0.)
  56. # Converting labels to binary vectors
  57. trainY = to_categorical(trainY, nb_classes=2)
  58. testY = to_categorical(testY, nb_classes=2)

  59. print trainX[:10]
    print testX[:10]
  60. # Network building
  61. net = tflearn.input_data([None, MAX_DOCUMENT_LENGTH])
  62. net = tflearn.embedding(net, input_dim=n_words, output_dim=128)
  63. net = tflearn.lstm(net, 128, dropout=0.8)
  64. net = tflearn.fully_connected(net, 2, activation='softmax')
  65. net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
  66. loss='categorical_crossentropy')
  67.  
  68. # Training
  69.  
  70. model = tflearn.DNN(net, tensorboard_verbose=3)
  71. model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
  72. batch_size=32,run_id="maidou")
  73.  
  74. def do_NB(x_train, x_test, y_train, y_test):
  75. gnb = GaussianNB()
  76. y_predict = gnb.fit(x_train, y_train).predict(x_test)
  77. score = metrics.accuracy_score(y_test, y_predict)
  78. print('NB Accuracy: {0:f}'.format(score))
  79.  
  80. def main(unused_argv):
  81. global n_words
  82.  
  83. x,y=load_data()
  84.  
  85. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)
  86.  
  87. vp = learn.preprocessing.VocabularyProcessor(max_document_length=MAX_DOCUMENT_LENGTH, min_frequency=1)
  88. vp.fit(x)
  89. x_train = np.array(list(vp.transform(x_train)))
  90. x_test = np.array(list(vp.transform(x_test)))
  91. n_words=len(vp.vocabulary_)
  92. print('Total words: %d' % n_words)
  93.  
  94. do_NB(x_train, x_test, y_train, y_test)
  95. do_rnn(x_train, x_test, y_train, y_test)
  96.  
  97. if __name__ == '__main__':
  98. tf.app.run()

负面的示例评论:

  1. plot : two teen couples go to a church party , drink and then drive .
  2. they get into an accident .
  3. one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares .
  4. what's the deal ?
  5. watch the movie and " sorta " find out . . .
  6. critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package .
  7. which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly .
  8. they seem to have taken this pretty neat concept , but executed it terribly .
  9. so what are the problems with the movie ?
  10. well , its main problem is that it's simply too jumbled .
  11. it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea what's going on .
  12. there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained .
  13. now i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem .
  14. it's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes .
  15. and do they make things entertaining , thrilling or even engaging , in the meantime ?
  16. not really .
  17. the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining .
  18. i guess the bottom line with movies like this is that you should always make sure that the audience is " into it " even before they are given the secret password to enter your world of understanding .
  19. i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! !
  20. okay , we get it . . . there
  21. are people chasing her and we don't know who they are .
  22. do we really need to see it over and over again ?
  23. how about giving us different scenes offering further insight into all of the strangeness going down in the movie ?
  24. apparently , the studio took this film away from its director and chopped it up themselves , and it shows .
  25. there might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess " the suits " decided that turning it into a music video with little edge , would make more sense .
  26. the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood .
  27. but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling .
  28. overall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it .
  29. oh , and by the way , this is not a horror or teen slasher flick . . . it's
  30. just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids .
  31. it also wrapped production two years ago and has been sitting on the shelves ever since .
  32. whatever . . . skip
  33. it !

正面的:

  1. films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .
  2. for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
  3. to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .
  4. the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .
  5. in other words , don't dismiss this film because of its source .
  6. if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
  7. getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?
  8. the ghetto in question is , of course , whitechapel in 1888 london's east end .
  9. it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .
  10. when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case .
  11. abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium .
  12. upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .
  13. i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .
  14. in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end .
  15. it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .
  16. and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) .
  17. don't worry - it'll all make sense when you see it .
  18. now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .
  19. the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic .
  20. oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place .
  21. even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent .
  22. ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham .
  23. i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad .
  24. the film , however , is all good .
  25. 2 : 00 - r for strong violence/gore , sexuality , language and drug content

pad后和category后的数据示例:

  1. padded and cated data:
    trainX 3条:
  2. [[ 1299 6 1 1596 26 354 155 1 62 101 537 252
  3. 5 22 2048 516 4 140 252 119 19 1 147 226
  4. 16 56 19 2 435 37 2 77 648 15 1 164
  5. 222 22 389 12 93 39 19392 235 16 189 83 1299
  6. 6 2 1426 453 7 976 1375 97 1 67 6 2928
  7. 42 2489 58 1 251 225 3 36 4 133 120 305
  8. 138 8 4730 244 1274 70 1018 4 49 14539 2290 947
  9. 3881 22772 1594 1296 67 1812 11 2663 9397 7513 3133 2
  10. 1619 8232 307 16958 4 2015 1329 1 16813 3571 4 869
  11. 3376 5 1019 41 7 518 33 598 7 1 1600 4
  12. 15406 1473 29 2 77 199 812 15956 21 33 1841 315
  13. 1852 371 5280 27 468 2663 343 2 334 11397 1619 5
  14. 1562 47 19 0 3 4239 11 100 10 234 219 10
  15. 0 0 8 30 4 220 144 1 414 4 3226 11120
  16. 3161 92 299 366 725 1010 27520 5 3343 76 7 1
  17. 1205 12 12549 1121 4 44 1 2195 9938 6 23 0
  18. 12 2663 6858 5 1425 19 2 1378]
  19. [ 1361 1 1647 4 1 4974 130 26 11041 1126 130 1232
  20. 1 57 26 7 269 641 5 205 3325 1053 3 5152
  21. 6318 622 2 5999 4 911 223 14 3772 5166 15739 6635
  22. 2036 633 1 2146 778 2697 327 9589 8311 3 3031 19
  23. 36 1 4974 8164 28 1 3103 4276 6344 27 618 2
  24. 4266 5 1 4203 1427 1199 1083 7 150 192 1 2294
  25. 3 15520 185 52 6 2 3689 572 4 6431 15520 6635
  26. 6 130 1232 2 5020 778 503 12 36 2805 4 1538
  27. 9333 4795 1518 4 25 405 1539 17927 6489 1427 6646 34
  28. 17491 13 3501 99 1232 5309 17 90 2 4074 1232 32
  29. 68 13660 162 5 2 7412 258 83 4 405 460 11
  30. 8238 12857 18618 3890 922 3915 3 146 32 5 488 10
  31. 2125 9736 5 2 2217 16298 3915 81 2529 48 1232 996
  32. 4 54 1053 522 18 157 9 410 24 25 4 23045
  33. 348 24 1535 35 1689 1 5410 1232 23995 3 4 220
  34. 9 340 41 1053 6 1391 18618 9608 16865 1232 24 272
  35. 6 681 7 0 100 20 109 642]
  36. [ 83 59 25 11208 9 371 3442 7 2 546 181 29
  37. 176 158 13 546 25133 3 13 1554 4819 20 25356 12
  38. 36 46 5311 1 1075 4 3442 169 31 134 5 75
  39. 11 1 98 44 104 22 6759 12 2 13377 235 1397
  40. 4 1 1948 826 26697 371 3442 1605 13 260 1364 12771
  41. 4462 7 2 429 1340 29 1 164 63 1142 7782 4587
  42. 1599 6 7 1 1758 1217 12 2 541 8661 1142 168
  43. 10363 541 3 9 588 33 5 826 37 1 546 4553
  44. 4 36 140 300 93 97 361 168 2 8661 28 2
  45. 1988 508 3 102 6 2524 5 7651 100 516 1180 20
  46. 4837 11 13791 5 1 67 8 115 245 529 391 109
  47. 2 821 78 578 198 715 5 103 1218 95 5 1
  48. 415 662 337 415 1605 337 1415 3 10 2 571 6
  49. 2311 2812 3 10809 3442 3 1599 245 2349 5 4402 87
  50. 4339 3 18 1422 6642 12 2 11316 8790 5 819 46
  51. 116 266 193 1599 32 1585 5 141 85 7 546 487
  52. 144 18 1537 3442 124 41 4 13]]
  1. trainY 3条:
  1. [[ 1. 0.] [ 0. 1.] [ 0. 1.]]

其中,MAX_DOCUMENT_LENGTH = 200,由于每个文档都进行了剪切。超过200的就直接截断文本,不再计算了!!!因为:

  1. tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)

参数:

max_document_length: 文档的最大长度。如果文本的长度大于最大长度,那么它会被剪切,反之则用0填充。
min_frequency: 词频的最小值,出现次数小于最小词频则不会被收录到词表中。
vocabulary: CategoricalVocabulary 对象。
tokenizer_fn:分词函数

代码:

  1. from tensorflow.contrib import learn
  2. import numpy as np
  3. max_document_length = 4
  4. x_text =[
  5. 'i love you',
  6. 'me too'
  7. ]
  8. vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
  9. vocab_processor.fit(x_text)
  10. print next(vocab_processor.transform(['i me too'])).tolist()
  11. x = np.array(list(vocab_processor.fit_transform(x_text)))
  12. print x
  13. [1, 4, 5, 0]
  14. [[1 2 3 0]
  15. [4 5 0 0]]

文档地址:http://tflearn.org/data_utils/

使用LSTM做电影评论负面检测——使用朴素贝叶斯才51%,但是使用LSTM可以达到99%准确度的更多相关文章

  1. 使用CNN做电影评论的负面检测——本质上感觉和ngram或者LSTM同,因为CNN里图像检测卷积一般是3x3,而文本分类的话是直接是一维的3、4、5

    代码如下: from __future__ import division, print_function, absolute_import import tensorflow as tf impor ...

  2. 检测用户命令序列异常——使用LSTM分类算法【使用朴素贝叶斯,类似垃圾邮件分类的做法也可以,将命令序列看成是垃圾邮件】

    通过 搜集 Linux 服务器 的 bash 操作 日志, 通过 训练 识别 出 特定 用户 的 操作 习惯, 然后 进一步 识别 出 异常 操作 行为. 使用 SEA 数据 集 涵盖 70 多个 U ...

  3. kaggle之电影评论文本情感分类

    电影文本情感分类 Github地址 Kaggle地址 这个任务主要是对电影评论文本进行情感分类,主要分为正面评论和负面评论,所以是一个二分类问题,二分类模型我们可以选取一些常见的模型比如贝叶斯.逻辑回 ...

  4. UEBA 学术界研究现状——用户行为异常检测思路:序列挖掘prefixspan,HMM,LSTM/CNN,SVM异常检测,聚类CURE算法

    论文 技术分析<关于网络分层信息泄漏点快速检测仿真> "1.基于动态阈值的泄露点快速检测方法,采样Mallat算法对网络分层信息的离散采样数据进行离散小波变换;利用滑动窗口对该尺 ...

  5. 贝叶斯--旧金山犯罪分类预测和电影评价好坏 demo

    来源引用:https://blog.csdn.net/han_xiaoyang/article/details/50629608 1.引言 贝叶斯是经典的机器学习算法,朴素贝叶斯经常运用于机器学习的案 ...

  6. 基于Keras的imdb数据集电影评论情感二分类

    IMDB数据集下载速度慢,可以在我的repo库中找到下载,下载后放到~/.keras/datasets/目录下,即可正常运行.)中找到下载,下载后放到~/.keras/datasets/目录下,即可正 ...

  7. 【项目实战】Kaggle电影评论情感分析

    前言 这几天持续摆烂了几天,原因是我自己对于Kaggle电影评论情感分析的这个赛题敲出来的代码无论如何没办法运行,其中数据变换的维度我无法把握好,所以总是在函数中传错数据.今天痛定思痛,重新写了一遍代 ...

  8. tensorflow 教程 文本分类 IMDB电影评论

    昨天配置了tensorflow的gpu版本,今天开始简单的使用一下 主要是看了一下tensorflow的tutorial 里面的 IMDB 电影评论二分类这个教程 教程里面主要包括了一下几个内容:下载 ...

  9. Keras + LSTM 做回归demo

    学习神经网络 想拿lstm 做回归, 网上找demo 基本三种: sin拟合cos 那个, 比特币价格预测(我用相同的代码和数据没有跑成功, 我太菜了)和keras 的一个例子 我基于keras 那个 ...

随机推荐

  1. QT 随笔

     1. 设置窗体属性,无边框 | 置顶 setWindowFlags(Qt::FramelessWindowHint); setWindowFlags(Qt::FramelessWindowHin ...

  2. Caused by: java.lang.NoClassDefFoundError: org/apache/neethi/AssertionBuilderFactory

    转自:https://blog.csdn.net/iteye_8264/article/details/82641058 1.错误描述 严重: StandardWrapper.Throwable or ...

  3. 使用log4net记录日志到数据库(含自定义属性)

    日志输出自定义属性! 特来总结一下: 一.配置文件 使用log4写入数据库就不多说了,网上方法很多,自定义字段如下 <commandText value="INSERT INTO db ...

  4. [转]C#事件-使用事件需要的步骤

    事件是C#中另一高级概念,使用方法和委托相关.奥运会参加百米的田径运动员听到枪声,比赛立即进行.其中枪声是事件,而运动员比赛就是这个事件发生后的动作.不参加该项比赛的人对枪声没有反应. 从程序员的角度 ...

  5. MySQL中的存储函数和存储过程的简单示例

    存储函数 定义 CREATE FUNCTION `fn_sum`(`a` int,`b` int) RETURNS int(11) BEGIN RETURN a + b; END 调用 Navicat ...

  6. wechat4j开发所有jar包

    wechat4j开发所需要的jar包合计,不用你去单独下载,已经全部包括 下载连接 wechat4j-lib.rar 如果你的服务是部署在新浪云计算SAE上的,那么下载这个jar合集 wechat4j ...

  7. struts2的DTD配置文件

    新手可以看看,高手可以跳过…… 最近在学习struts2这个框架,自己也动手写过一些DTD文件,所以很好struts2这个DTD文件是怎么写的,接下来就一个一个的分析 根元素是struts,然后又4个 ...

  8. Temporary Tables临时表

    1简介 ORACLE数据库除了可以保存永久表外,还可以建立临时表temporary tables.这些临时表用来保存一个会话SESSION的数据, 或者保存在一个事务中需要的数据.当会话退出或者用户提 ...

  9. 我用windows live Writer 写个日志试试看

    我用windows live Writer 写个日志试试看. 哈哈 播放幻灯片 全部下载

  10. CF #487 (Div. 2) D. A Shade of Moonlight 构造_数形结合

    题意: 给 nnn个长度为 lll 且互不相交的开区间 (xi,xi+l)(x_{i}, x_{i}+l)(xi​,xi​+l) ,每个区间有一个移动速度 vvv,v∈1,−1v∈1,-1v∈1,−1 ...