Natural Language Processing with Python

Chapter 6.1

由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。

 import nltk
from nltk.corpus import movie_reviews as mr def document_features(document,words_features):
document_words=set(document)
features={}
for word in words_features:
features['has(%s)' %word] = (word in document_words)
return features def test_doc_classification():
documents=[(list(mr.words(fileid)),category)
for category in mr.categories()
for fileid in mr.fileids(categories=category)]
all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
words_features=[word[0] for word in words_freq] featuresets=[(document_features(doc,words_features),c) for (doc,c) in
documents] train_set, test_set= featuresets[100:],featuresets[:100]
classifier=nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,test_set) classifier.show_most_informative_features(5)

结果如下,accuracy为0.86:

0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0

Document Classification的更多相关文章

  1. Support Vector Machines for classification

    Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...

  2. Classification of text documents: using a MLComp dataset

    注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...

  3. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. Link-based Classification相关数据集

    Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...

  6. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  7. Text Classification

    Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...

  8. Machine Learning Algorithms Study Notes(2)--Supervised Learning

    Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...

  9. Similarity-based Learning

    Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...

随机推荐

  1. linux 内核协议栈收报流程(一)ixgbe网卡驱动

    首先模块加载insmod ixgbe.ko module_init(ixgbe_init_module); module_init(ixgbe_init_module); { int ret; pr_ ...

  2. js中Object.__proto__===Function.prototype

    参考:http://stackoverflow.com/questions/650764/how-does-proto-differ-from-constructor-prototype http:/ ...

  3. linux视频学习3(linux安装,shell,tcp/ip协议,网络配置)

    linux系统的安装: 1.linux系统的安装方式三种: 1.独立安装linux系统. 2.虚拟机安装linux系统. a.安装虚拟机,基本是一路点下去. b.安装linux. c.linux 安装 ...

  4. isinstance使用方法

    #!/usr/bin/python2.7    def displayNumType(num):    print num, 'is',    if isinstance(num,(int, long ...

  5. CDockablepane风格设置

    屏蔽掉pane右上角的几个按钮 即将CDockablePane右上角的三个按钮屏蔽. 1            去掉关闭按钮 在CDockablePane的派生类中,重写方法CanBeClosed即可 ...

  6. pycharm快捷键、常用设置、配置管理

    http://blog.csdn.net/pipisorry/article/details/39909057 pycharm学习技巧 Learning tips /pythoncharm/help/ ...

  7. 数据格式处理(数字,日期),java处理,jsp的fmt处理

    java 格式处理  public static String formatTosepara(float data) {DecimalFormat df = new DecimalFormat(&qu ...

  8. mplayer最全的命令

    前段时间做过qt内嵌mplayer的一个小程序,感觉mplayer还行不过不支持打开图片感觉有点无力.话不多说上代码: QString path="d:/1.mkv"; QWidg ...

  9. CodeForces 678D Iterated Linear Function

    简单矩阵快速幂. #include<cstdio> #include<cstring> #include<cmath> #include<algorithm& ...

  10. Android MediaScanner 详尽分析

    [Innost]: http://blog.csdn.net/Innost/article/details/6083467 ====================================== ...