机器学习---文本特征提取之词袋模型（Machine Learning Text Feature Extraction Bag of Words）

假设有一段文本："I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." 那么怎么提取这段文本的特征呢？

一个简单的方法就是使用词袋模型（bag of words model）。选定文本内一定的词放入词袋，统计词袋内所有词在文本中出现的次数（忽略语法和单词出现的顺序），将其用向量的形式表示出来。

词频统计可以用scikit-learn的CountVectorizer实现：

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." 

from sklearn.feature_extraction.text import CountVectorizer

CV=CountVectorizer()

words=CV.fit_transform([text1]) #这里注意要把文本字符串变为列表进行输入

print(words)

首先CountVectorizer将文本映射成字典，字典的键是文本内的词，值是词的索引，然后对字典进行学习，将其转换成词频矩阵并输出：

  (0, 3)        1

  (0, 4)        1

  (0, 0)        1

  (0, 11)       1

  (0, 2)        1

  (0, 10)       1

  (0, 7)        2

  (0, 8)        2

  (0, 9)        1

  (0, 6)        1

  (0, 1)        1

  (0, 5)        1

(0, 7)        2  代表第7个词"Huzihu"出现了2次。

注：CountVectorizer类会把文本全部转换成小写，然后将文本词块化（tokenize）。文本词块化是把句子分割成词块（token）或有意义的字母序列的过程。词块大多是单词，但它们也可能是一些短语，如标点符号和词缀。CountVectorizer类通过正则表达式用空格分割句子，然后抽取长度大于等于2的字母序列。（摘自：http://lib.csdn.net/article/machinelearning/42813）

我们一般提取文本特征是用于文档分类，那么就需要知道各个文档之间的相似程度。可以通过计算文档特征向量之间的欧氏距离（Euclidean distance）来进行比较。

让我们添加另外两段文本，看看这三段文本之间的相似程度如何。

文本二："My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."

文本三："We all need to make plans for the future, otherwise we will regret when we're old."

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends."

text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."

text3= "We all need to make plans for the future, otherwise we will regret when we're old."

corpus=[text1,text2,text3] #把三个文档放入语料库

from sklearn.feature_extraction.text import CountVectorizer

CV=CountVectorizer()

words=CV.fit_transform(corpus)

words_frequency=words.todense()  #用todense()转化成矩阵

print(CV.get_feature_names())

print(words_frequency)

此时分别输出的是特征名称和由每个文本的词频向量组成的矩阵：

['all', 'and', 'are', 'cat', 'cousin', 'cute', 'dog', 'eating', 'for', 'friendly', 'friends', 'future', 'good', 'has', 'have', 'he', 'his', 'huzihu', 'is', 'likes', 'make', 'my', 'name', 'need', 'old', 'others', 'otherwise', 'plans', 're', 'really', 'regret', 'sleeping', 'the', 'to', 'we', 'when', 'will']

[[0 1 1 ..., 1 0 0]

 [0 1 0 ..., 0 0 0]

 [1 0 0 ..., 3 1 1]]

可以看到，矩阵第一列，其中前两个数都为0，最后一个数为1，代表"all"在前两个文本中都未出现过，而在第三个文本中出现了一次。

接下来，我们就可以用sklearn中的euclidean_distances来计算这三个文本特征向量之间的距离了。

from sklearn.metrics.pairwise import euclidean_distances

for i,j in ([0,1],[0,2],[1,2]):

    dist=euclidean_distances(words_frequency[i],words_frequency[j])

    print("文本{}和文本{}特征向量之间的欧氏距离是：{}".format(i+1,j+1,dist))

输出如下：

文本1和文本2特征向量之间的欧氏距离是：[[ 5.19615242]]

文本1和文本3特征向量之间的欧氏距离是：[[ 6.08276253]]

文本2和文本3特征向量之间的欧氏距离是：[[ 6.164414]]

可以看到，文本一和文本二之间最相似。

现在思考一下，应该选什么样的词放入词袋呢？有一些词并不能提供多少有用的信息，比如：the, be, you, he...这些词被称为停止词（stop words）。由于文本内包含的词的数量非常之多（词袋内的每一个词都是一个维度），因此我们需要尽量减少维度，去除这些噪音，以便更好地计算和拟合。

可以在创建CountVectorizer实例时添加stop_words="english"参数来去除这些停用词。

另外，也可以下载NLTK（Natural Language Toolkit）自然语言工具包，使用其里面的停用词。

下面，我们就用NLTK来试一试（使用之前，请大家先下载安装：pip install NLTK）：

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends."

text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."

text3= "We all need to make plans for the future, otherwise we will regret when we're old."

corpus=[text1,text2,text3]

from nltk.corpus import stopwords

noise=stopwords.words("english")

from sklearn.feature_extraction.text import CountVectorizer

CV=CountVectorizer(stop_words=noise)

words=CV.fit_transform(corpus)

words_frequency=words.todense()

print(CV.get_feature_names())

print(words_frequency)

输出：

['cat', 'cousin', 'cute', 'dog', 'eating', 'friendly', 'friends', 'future', 'good', 'huzihu', 'likes', 'make', 'name', 'need', 'old', 'others', 'otherwise', 'plans', 'really', 'regret', 'sleeping']

[[1 0 1 ..., 1 0 0]

 [0 1 1 ..., 0 0 1]

 [0 0 0 ..., 0 1 0]]

可以看到，此时词袋里的词减少了。通过查看words_frequncy.shape，我们发现特征向量的维度也由原来的37变为了21。

还有一个需要考虑的情况，比如说文本中出现的friendly和friends意思相近，可以看成是一个词。但是由于之前把这两个词分别算成是两个不同的特征，这就可能导致文本分类出现偏差。解决办法是对单词进行词干提取（stemming），再把词干放入词袋。

下面用NLTK中的SnowballStemmer来提取词干（注意：需要先用正则表达式把文本中的词提取出来，也就是进行词块化，再提取词干，因此在用CountVectorizer时可以把tokenizer参数设为自己写的function）：

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends."

text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."

text3= "We all need to make plans for the future, otherwise we will regret when we're old."

corpus=[text1,text2,text3]

from nltk import RegexpTokenizer

from nltk.stem.snowball import SnowballStemmer

def stemming(token):

    stemming=SnowballStemmer("english")

    stemmed=[stemming.stem(each) for each in token]

    return stemmed

def tokenize(text):

    tokenizer=RegexpTokenizer(r'\w+')  #设置正则表达式规则

    tokens=tokenizer.tokenize(text)

    stems=stemming(tokens)

    return stems

from nltk.corpus import stopwords

noise=stopwords.words("english")

from sklearn.feature_extraction.text import CountVectorizer

CV=CountVectorizer(stop_words=noise,tokenizer=tokenize,lowercase=False)

words=CV.fit_transform(corpus)

words_frequency=words.todense()

print(CV.get_feature_names())

print(words_frequency)

输出：

['cat', 'cousin', 'cute', 'dog', 'eat', 'friend', 'futur', 'good', 'huzihu', 'like', 'make', 'name', 'need', 'old', 'otherwis', 'plan', 'realli', 'regret', 'sleep']

[[1 0 1 ..., 1 0 0]

 [0 1 1 ..., 0 0 1]

 [0 0 0 ..., 0 1 0]]

可以看到，friendly和friends在提取词干后都变成了friend。而others提取词干后变为other，other属于停用词，被移除了，因此现在词袋特征向量维度变成了19。

此外，还需注意的是词形的变化。比如说单复数："foot"和"feet"，过去式和现在进行时："understood"和"understanding"，主动和被动："eat"和"eaten"，等等。这些词都应该被视为同一个特征。解决的办法是进行词形还原（lemmatization）。这里就不演示了，可以用NLTK中的WordNetLemmatizer来进行词形还原（from nltk.stem.wordnet import WordNetLemmatizer）。

词干提取和词形还原的区别可参见：https://www.neilx.com/blog/?p=1425。

最后，再想一下，长文本和短文本包含的信息是不对等的，一般来说，长文本包含的关键词要比短文本多，因此，我们需要对文本进行归一化处理，将每个单词出现的次数除以该文本中所有单词的个数，这被称之为词频（term frequency）（注：之前说的词频是指绝对频率，这里的词频是指相对频率）。其次，我们在对文档进行分类时，假如某个词在各文本中都有出现，那么这个词就无法给分类带来多少有用的信息。因此，对于出现频率高的词和频率低的词，我们应该区分对待，它们的重要性是不一样的。解决的办法就是用逆文档频率（inverse document frequency）来给词进行加权。IDF会根据单词在文本中出现的频率进行加权，出现频率高的词，加权系数就低，反之，出现频率低的词，加权系数就高。这两者相结合被称之为TF-IDF（term frequncy, inverse document frequency）。可以用sklearn的TfidfVectorizer来实现。

下面，我们把CountVectorizer换成TfidfVectorizer（包括之前使用过的提取词干和去除停用词），再来计算一下这三个文本之间的相似度：

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends."

text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."

text3= "We all need to make plans for the future, otherwise we will regret when we're old."

corpus=[text1,text2,text3]

from nltk import RegexpTokenizer

from nltk.stem.snowball import SnowballStemmer

def stemming(token):

    stemming=SnowballStemmer("english")

    stemmed=[stemming.stem(each) for each in token]

    return stemmed

def tokenize(text):

    tokenizer=RegexpTokenizer(r'\w+')  #设置正则表达式规则

    tokens=tokenizer.tokenize(text)

    stems=stemming(tokens)

    return stems

from nltk.corpus import stopwords

noise=stopwords.words("english")

from sklearn.feature_extraction.text import TfidfVectorizer

CV=TfidfVectorizer(stop_words=noise,tokenizer=tokenize,lowercase=False)

words=CV.fit_transform(corpus)

words_frequency=words.todense()

print(CV.get_feature_names())

print(words_frequency)

from sklearn.metrics.pairwise import euclidean_distances

for i,j in ([0,1],[0,2],[1,2]):

    dist=euclidean_distances(words_frequency[i],words_frequency[j])

    print("文本{}和文本{}特征向量之间的欧氏距离是：{}".format(i+1,j+1,dist))

输出：

['cat', 'cousin', 'cute', 'dog', 'eat', 'friend', 'futur', 'good', 'huzihu', 'like', 'make', 'name', 'need', 'old', 'otherwis', 'plan', 'realli', 'regret', 'sleep']

[[ 0.30300252  0.          0.23044123 ...,  0.30300252  0.          0.        ]

 [ 0.          0.40301621  0.30650422 ...,  0.          0.          0.40301621]

 [ 0.          0.          0.         ...,  0.          0.37796447  0.        ]]

文本1和文本2特征向量之间的欧氏距离是：[[ 1.25547312]]

文本1和文本3特征向量之间的欧氏距离是：[[ 1.41421356]]

文本2和文本3特征向量之间的欧氏距离是：[[ 1.41421356]]

可以看到，现在特征值不再是单词出现的次数了，而是相对频率加权之后的值。虽然我们只用了很短的文本进行测试，但还是能看出来，经过一系列优化后，计算出的结果更准确了。

词袋模型的缺点： 1. 无法反映词之间的关联关系。例如："Humans like cats."和"Cats like humans"具有相同的特征向量。

2. 无法捕捉否定关系。例如："I will not eat noodles today."和"I will eat noodles today."尽管意思相反，但是从特征向量来看它们非常相似。

不过这些问题有一部分可以通过使用N-gram模型来解决（可以在用sklearn创建CountVectorizer实例时加上ngram_range参数）。

机器学习---文本特征提取之词袋模型（Machine Learning Text Feature Extraction Bag of Words）的更多相关文章

【sklearn文本特征提取】词袋模型/稀疏表示/停用词/TF-IDF模型
1. 词袋模型 (Bag of Words, BOW) 文本分析是机器学习算法的一个主要应用领域.然而,原始数据的这些符号序列不能直接提供给算法进行训练,因为大多数算法期望的是固定大小的数字特征向量, ...
文本特征提取---词袋模型，TF-IDF模型，N-gram模型（Text Feature Extraction Bag of Words TF-IDF N-gram ）
假设有一段文本:"I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good frie ...
文本向量化及词袋模型 - NLP学习（3-1）
分词(Tokenization) - NLP学习(1) N-grams模型.停顿词(stopwords)和标准化处理 - NLP学习(2) 之前我们都了解了如何对文本进行处理:(1)如用NLTK文 ...
第二章——机器学习项目完整案例（End-to-End Machine Learning Project）
本章通过一个例子,介绍机器学习的整个流程. 2.1 使用真实数据集练手(Working with Real Data) 国外一些获取数据的网站: Popular open data repositor ...
机器学习笔记1 - Hello World In Machine Learning
前言 Alpha Go在16年以4:1的战绩打败了李世石,17年又以3:0的战绩战胜了中国围棋天才柯洁,这真是科技界振奋人心的进步.伴随着媒体的大量宣传,此事变成了妇孺皆知的大事件.大家又开始激烈的讨 ...
机器学习之强化学习概览（Machine Learning for Humans: Reinforcement Learning）
声明:本文翻译自Vishal Maini在Medium平台上发布的<Machine Learning for Humans>的教程的<Part 5: Reinforcement Le ...
斯坦福机器学习视频笔记 Week6 关于机器学习的建议 Advice for Applying Machine Learning
我们将学习如何系统地提升机器学习算法,告诉你学习算法何时做得不好,并描述如何'调试'你的学习算法和提高其性能的“最佳实践”.要优化机器学习算法,需要先了解可以在哪里做最大的改进. 我们将讨论如何理解具 ...
斯坦福第十课：应用机器学习的建议(Advice for Applying Machine Learning)
10.1 决定下一步做什么 10.2 评估一个假设 10.3 模型选择和交叉验证集 10.4 诊断偏差和方差 10.5 归一化和偏差/方差 10.6 学习曲线 10.7 决定下一步做什么 ...
Ng第十课：应用机器学习的建议(Advice for Applying Machine Learning)
10.1 决定下一步做什么 10.2 评估一个假设 10.3 模型选择和交叉验证集 10.4 诊断偏差和方差 10.5 归一化和偏差/方差 10.6 学习曲线 10.7 决定下一步做什么 ...

随机推荐

关于ipv6被拒的问题
遇到ipv6被拒,你首先要搭建一个ipv6的环境,进行测试一下,如果在ipv6环境下没有问题,那你就可以再次直接提交,或者重新打包提交.再次提交的时候,你可以录制一段在ipv6环境下运行的一段视频上 ...
Android Fragment的用法（一）
1.碎片是什么碎片(Fragment)是一种可以嵌入在活动当中的UI片段,它能让程序更加合理和充分地利用大屏幕的空间,因而在平板上应用的非常广泛.虽然碎片对你来说应该是个全新的概念,但我相信你学习起 ...
RabbitMQ权限控制原理
我们在使用MQ搭建系统的时候,经常要开放队列给外接系统访问.外接系统的稳定性是不可控的.为了防止外接系统不稳定导致误操作破坏了MQ的配置或数据,需要对MQ做比较精细的权限控制. 我的需求是这样的: 我 ...
Linux shell 及命令汇总
1 文件管理命令 1．cat命令:将文件内容连接后传送到标准输出或重定向到文件 2．chmod命令:更改文件的访问权限 3．chown命令:更改文件的所有者 4．find命令:查找(符合条件)文件并将 ...
jenkins安装详细教程
一.jenkins简介 jenkins是一个开源的软件项目,是基于java开发的一种持续集成工具,用于监控持续重复的工作,旨在提供一个开放易用的软件平台,使软件的持续集成变成可能. 1.持续的软件版本 ...
读写锁ReentrantReadWriteLock的使用
package com.thread.test.Lock; import java.util.Random; import java.util.concurrent.locks.Lock; impor ...
好程序员分享居中一个float元素
好程序员分享居中一个float元素,我们布局的时候,用margin来设置float元素的外边距来达到效果.对于,在文档流中的元素,我们很容易让它水平居中,只要给元素设置一个固定的宽度,用margin: ...
基于SpringMVC拦截器和注解实现controller中访问权限控制
SpringMVC的拦截器HandlerInterceptorAdapter对应提供了三个preHandle,postHandle,afterCompletion方法. preHandle在业务处理器 ...
Autoware（2）—加载地图数据
选择Point cloud.Ref选择.autoware/.data/map/pointcloud_map/里面的全选点Point cloud加载 vector Map和TF同理
UUID简记
一.概述 wiki上的解释: A universally unique identifier (UUID) is a 128-bit number used to identify informati ...

机器学习---文本特征提取之词袋模型（Machine Learning Text Feature Extraction Bag of Words）

机器学习---文本特征提取之词袋模型（Machine Learning Text Feature Extraction Bag of Words）的更多相关文章

随机推荐

热门专题