Document Classification
Natural Language Processing with Python
Chapter 6.1
由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。
import nltk
from nltk.corpus import movie_reviews as mr def document_features(document,words_features):
document_words=set(document)
features={}
for word in words_features:
features['has(%s)' %word] = (word in document_words)
return features def test_doc_classification():
documents=[(list(mr.words(fileid)),category)
for category in mr.categories()
for fileid in mr.fileids(categories=category)]
all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
words_features=[word[0] for word in words_freq] featuresets=[(document_features(doc,words_features),c) for (doc,c) in
documents] train_set, test_set= featuresets[100:],featuresets[:100]
classifier=nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,test_set) classifier.show_most_informative_features(5)
结果如下,accuracy为0.86:
0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0
Document Classification的更多相关文章
- Support Vector Machines for classification
Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...
- Classification of text documents: using a MLComp dataset
注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...
- [Tensorflow] RNN - 04. Work with CNN for Text Classification
Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...
- 论文列表——text classification
https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...
- Link-based Classification相关数据集
Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...
- #论文阅读# Universial language model fine-tuing for text classification
论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...
- Text Classification
Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...
- Machine Learning Algorithms Study Notes(2)--Supervised Learning
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...
- Similarity-based Learning
Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...
随机推荐
- iOS 判断数组不为空
if (array != nil && ![array isKindOfClass:[NSNull class]] && array.count != 0)
- jdk,tomcat配置
方法/步骤 一.安装JDK和Tomcat 1,安装JDK:直接运行jdk-7-windows-i586.exe可执行程序,默认安装即可. 备注:路径可以其他盘符,不建议路径包含中文名及特殊符号. 2. ...
- linux 查看进程 和 杀死进程
ps ax 显示当前系统进程的列表 PID TTY STAT TIME COMMAND ps aux 显示当前系统进程详细列表以及进程用户 USER PID %CPU %ME ...
- java 类与对象
class XiyoujiRenwu { float height,weight; String head, ear; void speak(String s) { System.out.printl ...
- dos命令创建安卓签名
1.dos下进入到jdk安装目录的bin目录, 如:C:\Program Files\Java\jdk1.7.0_79\bin 2.输入命令格式如: keytool -genkey -alias aa ...
- 批量文件重命名工具-极力推荐 advanced renamer
http://www.advancedrenamer.com/ 功能太强大了,自己慢慢探索吧.
- ajax请求dotnet webservice格式
$.ajax({ type: "post", url: "your_webservice.asmx/you_method", contentType: &quo ...
- python中uuid来生成机器唯一标识
摘要: 我们可以使用uuid1的后16位来标识一个机器. # use machine specific uuid, last 16 char will be the same if machine ...
- C#调用C++动态库时类型转换
因为本人主要从事c#开发,但是在工作中经常需要用到c++编写的DLL,因此需要知道c++中的类型与c#中的类型是如何转换的.搜集整理如下. //C++中的DLL函数原型为 //extern &qu ...
- Linux 查询程序安装路径 是否安装
rpm -ql httpd #[搜索rpm包]--list所有文件安装目录 rpm -q mysql //查询程序是否安装 关于rpm详细用法 参考 http://www.cnblogs.com/x ...