Document Classification
Natural Language Processing with Python
Chapter 6.1
由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。
import nltk
from nltk.corpus import movie_reviews as mr def document_features(document,words_features):
document_words=set(document)
features={}
for word in words_features:
features['has(%s)' %word] = (word in document_words)
return features def test_doc_classification():
documents=[(list(mr.words(fileid)),category)
for category in mr.categories()
for fileid in mr.fileids(categories=category)]
all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
words_features=[word[0] for word in words_freq] featuresets=[(document_features(doc,words_features),c) for (doc,c) in
documents] train_set, test_set= featuresets[100:],featuresets[:100]
classifier=nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,test_set) classifier.show_most_informative_features(5)
结果如下,accuracy为0.86:
0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0
Document Classification的更多相关文章
- Support Vector Machines for classification
Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...
- Classification of text documents: using a MLComp dataset
注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...
- [Tensorflow] RNN - 04. Work with CNN for Text Classification
Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...
- 论文列表——text classification
https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...
- Link-based Classification相关数据集
Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...
- #论文阅读# Universial language model fine-tuing for text classification
论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...
- Text Classification
Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...
- Machine Learning Algorithms Study Notes(2)--Supervised Learning
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...
- Similarity-based Learning
Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...
随机推荐
- VS2010中安装AjaxControlToolkit
原文地址:http://www.asp.net/ajaxlibrary/act.ashx 第一步 下载Ajax Control Toolkit 进入网址http://ajaxcontroltoolki ...
- linux 文件系统操作()
1. 用Xshell 客户端连上远程主机. 2.ll 或 ls 查看当前目录下的文件或目录, cd / 切换到根目录, cd **切换到某个目录(或者叫进入某个文件夹) 3.文件的压缩命令:zip - ...
- hover带有动画效果的导航
html,body{overflow-x:hidden;} ul,li{list-style: none;} .nav{width:100%; height: 26px; overflow: hidd ...
- Sanatorium
Sanatorium time limit per test 1 second memory limit per test 256 megabytes input standard input out ...
- linux视频学习7(ssh, linux启动过程分析,加解压缩,java网络编程)
回顾数据库mysql的备份和恢复: show databases; user spdb1; show tables; 在mysql/bin目录下 执行备份: ./mysqldump -u root - ...
- div.2/Bellovin<最长上升子序列>
题意: 序列arr[i--n];输出以a[i]为结尾的最长上升子序列.1<=n<=100000; 思路: O(n*log(n)),求最长上升子序列. #include<cstdio& ...
- LoadRunner参数化
在场景中,每一个vuser能够按照取唯一值的策略,是unique one , 出现84800错误有以下2种(自我实验中得出) 1.vuser的个数大于参数给定的个数 2.vuser初始时间不够,在可通 ...
- CodeForces 83B Doctor
二分查找. #include<cstdio> #include<cstring> #include<cmath> #include<vector> #i ...
- Nginx反向代理导致PHP获取不到正确的HTTP_HOST,SERVER_NAME,客户端IP的解决方法
今天第一次陪Nginx负载均衡,发现PHP无法获取HTTP_HOST 贴上的Nginx配置 upstream abc.com { server 10.141.8.55:8005; server 10. ...
- 解读QML之二
QML文档 QML文档是用QML语法组成的字符串.一个文档定义了一个QML对象类型.文档以”.qml”最为后缀,可以保存在本地和网络上,可以使用代码生成.一 个在文档中定义的对象类型的实例,也可以使用 ...