Natural Language Processing with Python

Chapter 6.1

由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。

 import nltk
from nltk.corpus import movie_reviews as mr def document_features(document,words_features):
document_words=set(document)
features={}
for word in words_features:
features['has(%s)' %word] = (word in document_words)
return features def test_doc_classification():
documents=[(list(mr.words(fileid)),category)
for category in mr.categories()
for fileid in mr.fileids(categories=category)]
all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
words_features=[word[0] for word in words_freq] featuresets=[(document_features(doc,words_features),c) for (doc,c) in
documents] train_set, test_set= featuresets[100:],featuresets[:100]
classifier=nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,test_set) classifier.show_most_informative_features(5)

结果如下,accuracy为0.86:

0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0

Document Classification的更多相关文章

  1. Support Vector Machines for classification

    Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...

  2. Classification of text documents: using a MLComp dataset

    注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...

  3. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. Link-based Classification相关数据集

    Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...

  6. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  7. Text Classification

    Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...

  8. Machine Learning Algorithms Study Notes(2)--Supervised Learning

    Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...

  9. Similarity-based Learning

    Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...

随机推荐

  1. ubuntu apt-get update 连接不到指定的源

    问题描述: ubuntu apt-get update 连接不到指定的源,修改了几个软件源还是连接不上,同样的软件源在别的机器上都可以正常使用,后来发现每次 sudo apt-get update操作 ...

  2. c3p0连接池的使用

    利用c3p0连接池获取数据库连接,即不再通过DriverManager的getConnection(url,user,password)方法获取connection,而是通过c3p0数据源的类来获取连 ...

  3. Swift中的闭包(Closure) 浅析

    转载自:http://www.devtalking.com/articles/closure-expressions-in-swift/ 闭包在Swift中非常有用.通俗的解释就是一个Int类型里存储 ...

  4. eclipse修改豆沙绿

    长时间的使用eclipse开发会很累吧  设置一个保护眼睛的豆沙绿色 不刺眼 是不是会更好一些呢 那么如何设置呢现在就教大家   工具/原料 eclipse jdk 方法/步骤 1 首先打开eclip ...

  5. 用python计算md5,sha1,crc32

    Linux下计算md5sum,sha1sum,crc: 命令 输出 $md5sum hello f19dd746bc6ab0f0155808c388be8ff0  hello $sha1sum hel ...

  6. POJ - 3666 Making the Grade(dp+离散化)

    Description A straight dirt road connects two fields on FJ's farm, but it changes elevation more tha ...

  7. Notice: ob_end_clean() [ref.outcontrol]: failed to delete buffer. No buffer to delete

    解决方法一 @ob_end_clean(); 解决方法二 if(ob_get_contents()) ob_end_clean();

  8. 利用id来进行树状数组,而不是离散化以后的val HDU 4417 离线+树状数组

    题目大意:给你一个长度为n的数组,问[L,R]之间<=val的个数 思路:就像标题说的那样就行了.树状数组不一定是离散化以后的区间,而可以是id //看看会不会爆int!数组会不会少了一维! / ...

  9. I2C死锁原因及解决方法(转)

    源:http://blog.csdn.net/zyboy2000/article/details/5603091 死锁总线表现为:SCL为高,SDA一直为低 现象:单片机采用硬件i2c读取E2PROM ...

  10. 关于表单提交submit的兼容性问题。

    这里的form 表单 点击下载执行的函数名字是submit,这样不规范,submit是提交表单,函数名字不能取名叫submit,如果取名叫submit会在低版本的浏览器上无法识别,导致直接提交表单,无 ...