Document Classification

Natural Language Processing with Python

Chapter 6.1

由于nltk.FreqDist的排序问题，获取电影文本特征词的代码有些微改动。

 import nltk

 from nltk.corpus import movie_reviews as mr   

 def document_features(document,words_features):

     document_words=set(document)

     features={}

     for word in words_features:

         features['has(%s)' %word] = (word in document_words)

     return features   

 def test_doc_classification():

     documents=[(list(mr.words(fileid)),category)

                 for category in mr.categories()

                 for fileid in mr.fileids(categories=category)]

     all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())

     words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]

     words_features=[word[0] for word in words_freq]

     featuresets=[(document_features(doc,words_features),c) for (doc,c) in

                     documents]

     train_set, test_set= featuresets[100:],featuresets[:100]

     classifier=nltk.NaiveBayesClassifier.train(train_set)

     print nltk.classify.accuracy(classifier,test_set)

     classifier.show_most_informative_features(5)

结果如下，accuracy为0.86：

0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0

Document Classification的更多相关文章

Support Vector Machines for classification
Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...
Classification of text documents: using a MLComp dataset
注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...
[Tensorflow] RNN - 04. Work with CNN for Text Classification
Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...
论文列表——text classification
https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...
Link-based Classification相关数据集
Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...
#论文阅读# Universial language model fine-tuing for text classification
论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结文章研究了一些在general corous上pretrain LM,然后把得到的model t ...
Text Classification
Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...
Machine Learning Algorithms Study Notes(2)--Supervised Learning
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...
Similarity-based Learning
Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...

随机推荐

less分页阅读
less 工具也是对文件或其它输出进行分页显示的工具,应该说是linux正统查看文件内容的工具,功能极其强大.less 的用法比起 more 更加的有弹性.在 more 的时候,我们并没有办法向前面翻 ...
win8.1点击“更改电脑设置”无反应（闪退）
系统:win8.1 专业版症状:win键+C → 设置 → 更改电脑设置,无反应. 尝试办法: 1.SFC /scannow扫描修复,扫描出错误但无法修复.因为曾经为了节省空间,用DISM++清理了 ...
破解SharpPlus Sqlite Develope[转]
1.运行里输入regedit,打开注册表 2.编辑->查找,输入sqlite 查找结果如下 3.直接删除SqliteDev节点就可以了,重新打开Sqlite Developer就可以用了,当然还 ...
c++数组指针bug
ClassA* csList = ]; ClassA ca = csList[]; ca.x=; CCLOG(].x);//output: caList[0].x -431602080.000000 ...
转：HTML错误编号大全
HTML错误编号大全状态行包含HTTP版本.状态代码.与状态代码对应的简短说明信息.在大多数情况下,除了Content-Type之外的所有应答头都是可选的.但Content-Type是必需的,它描述 ...
asp读取指定目录下的文件名
bianli(Server.MapPath("/")+"\pic") InStrRev("abcd.jpg", ".") ...
L7,too late
words: parcel,包裹 detective,侦探 expect,期待 airfield,飞机起落的场地 guard,警戒,守卫,n precious,adj,珍贵的 stone,石头 exp ...
angular实现select的ng-options4
ng实现简单的select <div ng-controller="ngSelect"> <select ng-model="vm.selectVal& ...
usb调试
修改文件:/home/mxy/code/v1/kernel-3.10/drivers/power/mediatek/battery_common.c //bool AutoDebug=true;//x ...
vim Podfile
platform :ios, "7.0"pod "AFNetworking"pod "SDWebImage"pod "SVProg ...

Document Classification

Document Classification的更多相关文章

随机推荐

热门专题