Document Classification
Natural Language Processing with Python
Chapter 6.1
由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。
import nltk
from nltk.corpus import movie_reviews as mr def document_features(document,words_features):
document_words=set(document)
features={}
for word in words_features:
features['has(%s)' %word] = (word in document_words)
return features def test_doc_classification():
documents=[(list(mr.words(fileid)),category)
for category in mr.categories()
for fileid in mr.fileids(categories=category)]
all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
words_features=[word[0] for word in words_freq] featuresets=[(document_features(doc,words_features),c) for (doc,c) in
documents] train_set, test_set= featuresets[100:],featuresets[:100]
classifier=nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,test_set) classifier.show_most_informative_features(5)
结果如下,accuracy为0.86:
0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0
Document Classification的更多相关文章
- Support Vector Machines for classification
Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...
- Classification of text documents: using a MLComp dataset
注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...
- [Tensorflow] RNN - 04. Work with CNN for Text Classification
Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...
- 论文列表——text classification
https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...
- Link-based Classification相关数据集
Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...
- #论文阅读# Universial language model fine-tuing for text classification
论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...
- Text Classification
Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...
- Machine Learning Algorithms Study Notes(2)--Supervised Learning
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...
- Similarity-based Learning
Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...
随机推荐
- 转 : 如何用sys as sysdba权限连接数据库进行EXP/IMP
使用sys as sysdba权限进行EXP/IMP与其它用户稍有不同,详细内容如下(摘自metalink) Applies to: Oracle Server - Enterprise Editio ...
- PHP 上传图片,生成水印,支持文字, gif, png
//admin_upfile.php <html> <meta http-equiv="Content-Type" content="text/html ...
- hack,不同的IE浏览器
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title> ...
- HDU1879--继续畅通工程(最小生成树)
Problem Description 省政府"畅通工程"的目标是使全省任何两个村庄间都可以实现公路交通(但不一定有直接的公路相连,只要能间接通过公路可达即可).现得到城镇道路统计 ...
- android 多个shortCut快捷方式实现以及对58同城快捷方式的实现思路的研究
这几天,项目中有个新需求,需要按照模块添加不同的快捷方式到桌面上,从而方便用户的使用.特意进行了研究并分析了下58上面桌面快捷方式的实现. 首先多个shortcut的实现: <activity ...
- 【转】使用gulp 进行ES6开发
原谅地址:https://segmentfault.com/a/1190000004394726 一说起ES6,总会顺带看到webpack.babel.browserify还有一些认都不认识的blab ...
- 编写MR代码中,JAVA注意事项
在编写一个job的过程中,发现代码中抛出 java.lang.UnsupportedOperationException 异常. 编写相似逻辑的测试代码: String[] userid = {&qu ...
- ssh能够连接而sftp不能连接的解决方法
ssh能够连接而sftp不能连接的解决方法 昨天开始用FileZilla一直不能登录远程的服务器,ssh的登录就OK,因为是服务器,也不敢乱动.查了好多资料终于解决了. 首先,查看一下系统的安全日 ...
- java工程开发之图形化界面之(第一课)
下面我们先上代码: package 一个事例图形小应用程序; import javax.swing.JApplet; import java.awt.Graphics; public class 绘制 ...
- bundle export fail
C:\eclipse\eclipse.exe -vmargs -Dfile.encoding=utf8