Text classifcation with Naïve Bayes

In this section we will try to classify newsgroup messages using a dataset that can be retrieved from within scikit-learn. This dataset consists of around 19,000 newsgroup messages from 20 different topics ranging from politics and religion to sports and science.

As usual, we frst start by importing our pylab environment:

  1. %pylab inline

Our dataset can be obtained by importing the fetch_20newgroups function from the sklearn.datasets module. We have to specify if we want to import a part or all of the set of instances (we will import all of them).

  1. from sklearn.datasets import fetch_20newsgroups
  2. news = fetch_20newsgroups(subset='all')

If we look at the properties of the dataset, we will fnd that we have the usual ones: DESCR, data, target, and target_names. The difference now is that data holds a list of text contents, instead of a numpy matrix:

  1. print(type(news.data), type(news.target), type(news.target_names))
  2. print(news.target_names)
  3. print(len(news.data))
  4. print(len(news.target))
  1. <class 'list'> <class 'numpy.ndarray'> <class 'list'>
  2. ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
  3. 18846
  4. 18846

If you look at, say, the frst instance, you will see the content of a newsgroup message, and you can get its corresponding category:

  1. print(news.data[0])
  2. print(news.target[0], news.target_names[news.target[0]])
  1. From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
  2. Subject: Pens fans reactions
  3. Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
  4. Lines: 12
  5. NNTP-Posting-Host: po4.andrew.cmu.edu
  6.  
  7. I am sure some bashers of Pens fans are pretty confused about the lack
  8. of any kind of posts about the recent Pens massacre of the Devils. Actually,
  9. I am bit puzzled too and a bit relieved. However, I am going to put an end
  10. to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
  11. are killing those Devils worse than I thought. Jagr just showed you why
  12. he is much better than his regular season stats. He is also a lot
  13. fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
  14. fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
  15. regular season game. PENS RULE!!!
  16.  
  17. 10 rec.sport.hockey

Preprocessing the data

Our machine learning algorithms can work only on numeric data, so our next step will be to convert our text-based dataset to a numeric dataset. Currently we only have one feature, the text content of the message; we need some function that transforms a text into a meaningful set of numeric features.

Intuitively one could try to look at which are the words (or more precisely, tokens, including numbers or punctuation signs) that are used in each of the text categories, and try to characterize

each category with the frequency distribution of each of those words. The sklearn.

feature_extraction.text
module has some useful utilities to build numeric feature vectors from text documents.

Before starting the transformation, we will have to partition our data into training and testing set. The loaded data is already in a random order, so we only have to split the data into, for example, 75 percent for training and the rest 25 percent for testing:

If you look inside the sklearn.feature_extraction.text module, you will fnd three different classes that can transform text into numeric features: CountVectorizer, HashingVectorizer, and TfidfVectorizer.

The difference between them resides in the calculations they perform to obtain the numeric features.

  • CountVectorizer主要从语料库创建了一个单词的字典,将每一个样本转化成一个关于每个单词在文档中出现次数的向量。
  • HashingVectorizer不是在内存中压缩和维护字典,而是实现了一个将标记映射到特征索引的哈希函数,然后同一样计数。
  • TfidfVectorizer与CountVectorizer类似,不过它使用一种更高级的计算方式,叫做Term Frequency Inverse Document Frequency (TF-IDF)。这是一种统计来测量一个单词在文档或语料库中的重要性。直观地,它寻找在当前文档中与整个文档集相比更加频繁的单词。你可以把它看做一种方式,用来标准化结果和避免单词太频繁,因此不能用来描述样本。

Training a Naïve Bayes classifer

We will create a Naïve Bayes classifer that is composed of a feature vectorizer and the actual Bayes classifer. We will use the MultinomialNB class from the sklearn.naive_bayes module. Scikitlearn has a very useful class called Pipeline (available in the sklearn.pipeline module) that eases the construction of a compound classifer, which consists of several vectorizers and classifers.

We will create three different classifers by combining MultinomialNB with the three different text vectorizers just mentioned, and compare which one performs better using the default parameters:

  1. from sklearn.naive_bayes import MultinomialNB
  2. from sklearn.pipeline import Pipeline
  3. from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
  4. clf_1 = Pipeline([
  5. ('vect', CountVectorizer()),
  6. ('clf', MultinomialNB()),
  7. ])
  8. clf_2 = Pipeline([
  9. ('vect', HashingVectorizer(non_negative=True)),
  10. ('clf', MultinomialNB()),
  11. ])
  12. clf_3 = Pipeline([
  13. ('vect', TfidfVectorizer()),
  14. ('clf', MultinomialNB()),
  15. ])

We will defne a function that takes a classifer and performs the K-fold crossvalidation over the specifed X and y values:

  1. from sklearn.cross_validation import cross_val_score, KFold
  2. from scipy.stats import sem
  3. def evaluate_cross_validation(clf, X, y, K):
  4. # create a k-fold cross validation iterator of k=5 folds
  5. cv = KFold(len(y), K, shuffle=True, random_state=0)
  6. # by default the score used is the one returned by score method of the estimator (accuracy)
  7. scores = cross_val_score(clf, X, y, cv=cv)
  8. print(scores)
  9. print(("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores)))

Then we will perform a fve-fold cross-validation by using each one of the classifers.

  1. clfs = [clf_1, clf_2, clf_3]
  2. for clf in clfs:
  3. evaluate_cross_validation(clf, news.data, news.target, 5)
  1. [ 0.85782493 0.85725657 0.84664367 0.85911382 0.8458477 ]
  2. Mean score: 0.853 (+/-0.003)
  3. [ 0.75543767 0.77659857 0.77049615 0.78508888 0.76200584]
  4. Mean score: 0.770 (+/-0.005)
  5. [ 0.84482759 0.85990979 0.84558238 0.85990979 0.84213319]
  6. Mean score: 0.850 (+/-0.004)

As you can see CountVectorizer and TfidfVectorizer had similar performances, and much better than HashingVectorizer.

Let's continue with TfidfVectorizer; we could try to improve the results by trying to parse the text documents into tokens with a different regular expression.

  1. clf_4 = Pipeline([
  2. ('vect', TfidfVectorizer(
  3. token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
  4. )),
  5. ('clf', MultinomialNB()),
  6. ])
  1. (不知道为什么报错SyntaxError: invalid syntax

The default regular expression: ur"\b\w\w+\b" considers alphanumeric characters and the underscore. Perhaps also considering the slash and the dot could improve the tokenization, and begin considering tokens as Wi-Fi and site.com. The new regular expression could be: ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b". If you have queries about how to defne regular expressions, please refer to the Python re module documentation. Let's try our new classifer:

  1. evaluate_cross_validation(clf_4, news.data, news.target, 5)

We have a slight improvement from 0.84 to 0.85.

Another parameter that we can use is stop_words: this argument allows us to pass a list of words we do not want to take into account, such as too frequent words, or words we do not a priori expect to provide information about the particular topic.

We will defne a function to load the stop words from a text fle as follows:

  1. def get_stop_words():
  2. result = set()
  3. for line in open('stopwords_en.txt', 'r').readlines():
  4. result.add(line.strip())
  5. return result

And create a new classifer with this new parameter as follows:

  1. clf_5 = Pipeline([
  2. ('vect', TfidfVectorizer(
  3. stop_words=get_stop_words(),
  4. token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
  5. )),
  6. ('clf', MultinomialNB()),
  7. ])
  8. evaluate_cross_validation(clf_5, news.data, news.target, 5)

The preceding code shows another improvement from 0.85 to 0.87.

Let's keep this vectorizer and start looking at the MultinomialNB parameters. This classifer has few parameters to tweak; the most important is the alpha parameter, which is a smoothing parameter. Let's set it to a lower value; instead of setting alpha to 1.0 (the default value), we will set it to 0.01:

  1. clf_6 = Pipeline([
  2. ('vect', TfidfVectorizer(
  3. stop_words=get_stop_words(),
  4. token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
  5. )),
  6. ('clf', MultinomialNB(alpha=0.01)),
  7. ])
  8.  
  9. evaluate_cross_validation(clf_6, X_train, Y_train, 5)

The results had an important boost from 0.89 to 0.92, pretty good. At this point, we could continue doing trials by using different values of alpha or doing new modifcations of the vectorizer.

Evaluating the performance

If we decide that we have made enough improvements in our model, we are ready to evaluate its performance on the testing set.

We will defne a helper function(见http://www.cnblogs.com/iamxyq/p/5912048.html train_and_evaluate函数) that will train the model in the entire training setand evaluate the accuracy in the training and in the testing sets.

We will evaluate our best classifer.

  1. train_and_evaluate(clf_7, X_train, X_test, y_train, y_test)

If we look inside the vectorizer, we can see which tokens have been used to create our dictionary:

  1. print len(clf_7.named_steps['vect'].get_feature_names())

Let's print the feature names.

  1. clf_7.named_steps['vect'].get_feature_names()

The following table presents an extract of the results:

You can see that some words are semantically very similar, for example, sand and sands, sanctuaries and sanctuary. Perhaps if the plurals and the singulars are counted to the same bucket, we would better represent the documents. This is a very common task, which could be solved using stemming, a technique that relates two words having the same lexical root.

sklearn学习笔记2的更多相关文章

  1. sklearn学习笔记之简单线性回归

    简单线性回归 线性回归是数据挖掘中的基础算法之一,从某种意义上来说,在学习函数的时候已经开始接触线性回归了,只不过那时候并没有涉及到误差项.线性回归的思想其实就是解一组方程,得到回归函数,不过在出现误 ...

  2. sklearn学习笔记3

    Explaining Titanic hypothesis with decision trees decision trees are very simple yet powerful superv ...

  3. sklearn学习笔记1

    Image recognition with Support Vector Machines #our dataset is provided within scikit-learn #let's s ...

  4. sklearn学习笔记

    用Bagging优化模型的过程:1.对于要使用的弱模型(比如线性分类器.岭回归),通过交叉验证的方式找到弱模型本身的最好超参数:2.然后用这个带着最好超参数的弱模型去构建强模型:3.对强模型也是通过交 ...

  5. sklearn学习笔记(一)——数据预处理 sklearn.preprocessing

    https://blog.csdn.net/zhangyang10d/article/details/53418227 数据预处理 sklearn.preprocessing 标准化 (Standar ...

  6. sklearn学习笔记之岭回归

    岭回归 岭回归是一种专用于共线性数据分析的有偏估计回归方法,实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息.降低精度为代价获得回归系数更为符合实际.更可靠的回归方法,对病 ...

  7. sklearn学习笔记之开始

    简介   自2007年发布以来,scikit-learn已经成为Python重要的机器学习库了.scikit-learn简称sklearn,支持包括分类.回归.降维和聚类四大机器学习算法.还包含了特征 ...

  8. sklearn学习笔记(1)--make_blobs函数及相应参数简介

    make_blobs方法: sklearn.datasets.make_blobs(n_samples=100,n_features=2,centers=3, cluster_std=1.0,cent ...

  9. Google TensorFlow深度学习笔记

    Google Deep Learning Notes Google 深度学习笔记 由于谷歌机器学习教程更新太慢,所以一边学习Deep Learning教程,经常总结是个好习惯,笔记目录奉上. Gith ...

随机推荐

  1. boost array使用

    #include <iostream> #include<boost/array.hpp> int main() { boost::array<int, 6> ar ...

  2. RTSP协议媒体数据发包相关的细节

    最近完成了一RTSP代理网关,这是第二次开发做RTSP协议相关的开发工作了,相比11年的简单粗糙的版本,这次在底层TCP/IP通讯和RTSP协议上都有了一些新的积累,这里记录一下.基本的RTSP协议交 ...

  3. 关于超出部分隐藏加省略号的css方法

    单行效果:display:block;     white-space:nowrap;  overflow:hidden;    text-overflow:ellipsis; 多行效果:width: ...

  4. UML和UP简介(转载)

    UML(统一建模语言,Unified Modeling Language)是用于系统的可视化建模语言.  UP(统一过程,Unified Process)是通用的软件开发过程. 很多人或书籍过大的夸大 ...

  5. python 后台爆破工具

    sys:使用sys模块获得脚本的参数 queue模块,创建一个“队列”对象 time 模块     Python time time() 返回当前时间的时间戳(1970纪元后经过的浮点秒数). fin ...

  6. PHP代码实用片段

    一.黑名单过滤 function is_spam($text, $file, $split = ':', $regex = false){ $handle = fopen($file, 'rb'); ...

  7. C#防止反编译

    http://blog.csdn.net/wangpei421/article/details/42393095 http://www.cnblogs.com/tianguook/archive/20 ...

  8. jQuery 常用方法经典总结

    1.关于页面元素的引用通过jquery的$()引用元素包括通过id.class.元素名以及元素的层级关系及dom或者xpath条件等方法,且返回的对象为jquery对象(集合对象),不能直接调用dom ...

  9. tengine+tomcat配置

    # 根据你服务器的cpu核数来确定此值 worker_processes 4; error_log logs/error.log crit; #error_log logs/error.log not ...

  10. Tencent://Message/协议的实现原理

    腾讯官方通过 Tencent://Message/协议可以让QQ用户显示QQ/TM的在线状态发布在互联网上:并且点击 XXX  ,不用加好友也可以聊天 官方链接: http://is.qq.com/w ...