注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html

运行结果为:

Loading 20 newsgroups training set...
20 newsgroups dataset for document classification (http://people.csail.mit.edu/jrennie/20Newsgroups)
13180 documents
20 categories
Extracting features from the dataset using a sparse vectorizer
done in 139.231000s
n_samples: 13180, n_features: 130274
Loading 20 newsgroups test set...
done in 0.000000s
Predicting the labels of the test set...
5648 documents
20 categories
Extracting features from the dataset using the same vectorizer
done in 7.082000s
n_samples: 5648, n_features: 130274
Testbenching a linear classifier...
parameters: {'penalty': 'l2', 'loss': 'hinge', 'alpha': 1e-05, 'fit_intercept': True, 'n_iter': 50}
done in 22.012000s
Percentage of non zeros coef: 30.074190
Predicting the outcomes of the testing set
done in 0.172000s
Classification report on test set for classifier:
SGDClassifier(alpha=1e-05, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
penalty='l2', power_t=0.5, random_state=None, shuffle=True,
verbose=0, warm_start=False) precision recall f1-score support alt.atheism 0.95 0.93 0.94 245
comp.graphics 0.85 0.91 0.88 298
comp.os.ms-windows.misc 0.88 0.88 0.88 292
comp.sys.ibm.pc.hardware 0.82 0.80 0.81 301
comp.sys.mac.hardware 0.90 0.92 0.91 256
comp.windows.x 0.92 0.88 0.90 297
misc.forsale 0.87 0.89 0.88 290
rec.autos 0.93 0.94 0.94 324
rec.motorcycles 0.97 0.97 0.97 294
rec.sport.baseball 0.97 0.97 0.97 315
rec.sport.hockey 0.98 0.99 0.99 302
sci.crypt 0.97 0.96 0.96 297
sci.electronics 0.87 0.89 0.88 313
sci.med 0.97 0.97 0.97 277
sci.space 0.97 0.97 0.97 305
soc.religion.christian 0.95 0.96 0.95 293
talk.politics.guns 0.94 0.94 0.94 246
talk.politics.mideast 0.97 0.99 0.98 296
talk.politics.misc 0.96 0.92 0.94 236
talk.religion.misc 0.89 0.84 0.86 171 avg / total 0.93 0.93 0.93 5648 Confusion matrix:
[[227 0 0 0 0 0 0 0 0 0 0 1 2 1 1 1 0 1
0 11]
[ 0 271 3 8 2 5 2 0 0 1 0 0 3 1 1 0 0 1
0 0]
[ 0 7 256 14 5 6 1 0 0 0 0 0 2 0 1 0 0 0
0 0]
[ 1 8 12 240 9 3 12 2 0 0 0 1 12 0 0 1 0 0
0 0]
[ 0 1 3 6 235 2 4 0 0 0 0 1 3 0 1 0 0 0
0 0]
[ 0 17 9 4 0 260 0 0 1 1 0 0 2 0 2 0 1 0
0 0]
[ 0 1 3 7 3 0 257 7 2 0 0 1 8 0 1 0 0 0
0 0]
[ 0 0 0 2 1 0 5 305 2 3 0 0 4 1 0 0 1 0
0 0]
[ 0 0 0 0 1 0 3 3 285 0 0 0 1 0 0 1 0 0
0 0]
[ 0 0 0 0 0 0 3 2 0 305 2 1 1 0 0 0 0 0
1 0]
[ 0 0 0 0 0 0 1 0 1 0 300 0 0 0 0 0 0 0
0 0]
[ 0 0 1 1 0 2 0 1 0 0 0 284 0 1 1 0 2 2
1 1]
[ 0 2 2 10 2 2 6 5 1 0 1 1 279 1 1 0 0 0
0 0]
[ 0 3 0 0 1 1 1 0 0 0 0 0 0 269 0 1 1 0
0 0]
[ 0 5 0 0 1 0 0 0 0 0 2 0 1 0 295 0 0 0
1 0]
[ 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 282 1 0
0 3]
[ 0 0 1 0 0 0 0 0 1 3 0 0 1 0 0 1 232 1
5 1]
[ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 293
0 0]
[ 0 2 0 0 0 0 2 0 0 1 0 1 0 1 0 0 7 4
216 2]
[ 11 0 0 0 0 0 0 0 0 0 0 1 0 2 0 9 2 1
2 143]]
Testbenching a MultinomialNB classifier...
parameters: {'alpha': 0.01}
done in 0.608000s
Percentage of non zeros coef: 100.000000
Predicting the outcomes of the testing set
done in 0.203000s
Classification report on test set for classifier:
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True) precision recall f1-score support alt.atheism 0.90 0.92 0.91 245
comp.graphics 0.81 0.89 0.85 298
comp.os.ms-windows.misc 0.87 0.83 0.85 292
comp.sys.ibm.pc.hardware 0.82 0.83 0.83 301
comp.sys.mac.hardware 0.90 0.92 0.91 256
comp.windows.x 0.90 0.89 0.89 297
misc.forsale 0.90 0.84 0.87 290
rec.autos 0.93 0.94 0.93 324
rec.motorcycles 0.98 0.97 0.97 294
rec.sport.baseball 0.97 0.97 0.97 315
rec.sport.hockey 0.97 0.99 0.98 302
sci.crypt 0.95 0.95 0.95 297
sci.electronics 0.90 0.86 0.88 313
sci.med 0.97 0.96 0.97 277
sci.space 0.95 0.97 0.96 305
soc.religion.christian 0.91 0.97 0.94 293
talk.politics.guns 0.89 0.96 0.93 246
talk.politics.mideast 0.95 0.98 0.97 296
talk.politics.misc 0.93 0.87 0.90 236
talk.religion.misc 0.92 0.74 0.82 171 avg / total 0.92 0.92 0.92 5648 Confusion matrix:
[[226 0 0 0 0 0 0 0 0 1 0 0 0 0 2 7 0 0
0 9]
[ 1 266 7 4 1 6 2 2 0 0 0 3 4 1 1 0 0 0
0 0]
[ 0 11 243 22 4 7 1 0 0 0 0 1 2 0 0 0 0 0
1 0]
[ 0 7 12 250 8 4 9 0 0 1 1 0 9 0 0 0 0 0
0 0]
[ 0 3 3 5 235 2 3 1 0 0 0 2 1 0 1 0 0 0
0 0]
[ 0 19 5 3 2 263 0 0 0 0 0 1 0 1 1 0 2 0
0 0]
[ 0 1 4 9 3 1 243 9 2 3 1 0 8 0 0 0 2 2
2 0]
[ 0 0 0 1 1 0 5 304 1 2 0 0 3 2 3 1 1 0
0 0]
[ 0 0 0 0 0 2 2 3 285 0 0 0 1 0 0 0 0 0
0 1]
[ 0 1 0 0 0 1 1 3 0 304 5 0 0 0 0 0 0 0
0 0]
[ 0 0 0 0 0 0 0 0 1 2 299 0 0 0 0 0 0 0
0 0]
[ 0 2 2 1 0 1 2 0 0 0 0 283 1 0 0 0 2 1
2 0]
[ 0 11 1 9 3 1 3 5 1 0 1 4 270 1 3 0 0 0
0 0]
[ 0 2 0 1 1 1 0 0 0 0 0 1 0 266 2 1 0 0
2 0]
[ 0 2 0 0 1 0 0 0 0 0 0 2 1 1 296 0 1 1
0 0]
[ 3 1 0 0 0 0 0 0 0 0 1 0 0 2 0 283 0 1
2 0]
[ 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 237 1
3 1]
[ 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 3 0 291
0 0]
[ 1 1 0 0 1 1 0 1 0 0 0 0 0 0 1 1 17 6
206 0]
[ 18 1 0 0 0 0 0 0 0 1 0 0 0 0 0 14 4 2
4 127]]

步骤为:

一、preprocessing

1.加载训练集(training set)

2.训练集特征提取,用TfidfVectorizer,得到训练集上的x_train和y_train

3.加载测试集(test set)

4.测试集特征提取,用TfidfVectorizer,得到测试集上的x_train和y_train

二、定义Benchmark classifiers

5.训练,clf = clf_class(**params).fit(X_train, y_train)

6.测试,pred = clf.predict(X_test)

7.测试集上分类报告,print(classification_report(y_test, pred,target_names=news_test.target_names))

8.confusion matrix,cm = confusion_matrix(y_test, pred)

三、训练

9.调用两个分类器,SGDClassifier和MultinomialNB

Classification of text documents: using a MLComp dataset的更多相关文章

  1. Clustering text documents using k-means

    源代码的链接为http://scikit-learn.org/stable/auto_examples/text/document_clustering.html Loading 20 newsgro ...

  2. scikit-learn:4.2.3. Text feature extraction

    http://scikit-learn.org/stable/modules/feature_extraction.html 4.2节内容太多,因此将文本特征提取单独作为一块. 1.the bag o ...

  3. Python scikit-learn机器学习工具包学习笔记

    feature_selection模块 Univariate feature selection:单变量的特征选择 单变量特征选择的原理是分别单独的计算每个变量的某个统计指标,根据该指标来判断哪些指标 ...

  4. 特征选择 (feature_selection)

    目录 特征选择 (feature_selection) Filter 1. 移除低方差的特征 (Removing features with low variance) 2. 单变量特征选择 (Uni ...

  5. sklearn—特征工程

    sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...

  6. scikit-learn:3.3. Model evaluation: quantifying the quality of predictions

    參考:http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter 三种方法评估模型的预測质量: Est ...

  7. [Scikit-learn] 1.1 Generalized Linear Models - Comparing various online solvers

    数据集分割 一.Online learning for 手写识别 From: Comparing various online solvers An example showing how diffe ...

  8. [Scikit-learn] Yield miniBatch for online learning.

    From: Out-of-core classification of text documents Code:  """ ======================= ...

  9. sklearn中的模型评估-构建评估函数

    1.介绍 有三种不同的方法来评估一个模型的预测质量: estimator的score方法:sklearn中的estimator都具有一个score方法,它提供了一个缺省的评估法则来解决问题. Scor ...

随机推荐

  1. 转 : 如何用sys as sysdba权限连接数据库进行EXP/IMP

    使用sys as sysdba权限进行EXP/IMP与其它用户稍有不同,详细内容如下(摘自metalink) Applies to: Oracle Server - Enterprise Editio ...

  2. Spring Boot 系列教程6-全局异常处理

    @ControllerAdvice源码 package org.springframework.web.bind.annotation; import java.lang.annotation.Ann ...

  3. C、C++、java的区别

    经常听到有人在抱怨这个语言哪里哪里 不好,那个语言又是如何的优秀.对于这样 的牢骚,我只是一笑而过. 就我而言,语言只是工具,没有好坏之 分.只要你采用相应的语言,完成对应的工 作,那你的目标就完成了 ...

  4. 事务(JDBC、Hibernate、Spring)

    如果不用spring管理事务,我们自己写代码来操作事务.那么这个代码怎么写要看底层怎么访问数据库了. 当采用原生JDBC访问数据库时,操作事务需要使用java.sql.Connection的API.开 ...

  5. sql trim()函数去掉两头空格

    1.sql trim()函数去掉两头空格 sql语法中没有直接去除两头空格的函数,但有ltrim()去除左空格rtrim()去除右空格. 合起来用就是sql的trim()函数,即select ltri ...

  6. 4D(DRG、DLG、DOM、DEM)数据 概念

    抱歉原文链接未知 Technorati 标签: DRG,DLG,DOM,DEM 以下有不同的说法,但是意思都很相近. 一. DOM (数字正射影像图):利用数字高程模型对扫描处理的数字化的航空相片.遥 ...

  7. 一个设置 material design icon的插件工具

    一个设置 material design icon的插件工具 github地址:https://github.com/konifar/android-material-design-icon-gene ...

  8. 《Peering Inside the PE: A Tour of the Win32 Portable Executable File Format》阅读笔记二

    Common Sections The .text section is where all general-purpose code emitted by the compiler or assem ...

  9. opencv2 矩阵方式 resize图像缩放代码(转载)

    http://blog.sina.com.cn/s/blog_74a459380101r0yx.html opencv2 矩阵方式 resize图像缩放代码(转载) (2014-05-16 09:55 ...

  10. jq中的移除事件

    unbind([type],[data]); 第一个参数是事件类型,第二个参数是将要移出的函数 说明: 1,如果没有参数,则删除所有绑定的事件 2,如果只提供了事件类型作为参数,则只删除该类型的绑定事 ...