之前我们都了解了如何对文本进行处理:(1)如用NLTK文本处理库将文本的句子成分分成了N-Gram模型,与此同时引入了正则表达式去除一些多余的句子成分;(2)将停顿词去除;(3)一些通用的标准化处理,如大小写、提取词干等。在这一节我们将看看如何对文本中的单词进行统计,并以此来查看一个单词在特定文档中或者整个文本集中的重要性。统计单词的任务是为了给特定的词语或者字一个量化的衡量标准,因为在量化过程中我们会赋予某个词或者字一个重要性值,根据该值我们可以做很多其他相关的事,例如关键字查找或者对该单词判定是积极还是消极等等。在之前我们用独热编码(one hot encoding)的方式对英文文本中的单词进行量化表示,但是这个方式有一个比较明显的缺陷就是当我们的文本集很大的时候,那么独热编码得到的向量的维度也会非常的大,并不利于计算机进行处理,导致计算开销大。因此希望可以找到一个更好的办法对单词进行编码并且体现出单词对于文本的重要性。


词袋模型(Bag of Words)是对文本中的单词进行统计,简单点说就是统计某个单词在一个文本中出现的频率或者次数。为什么说这样子的统计是有效的呢?假设某一个单词出现的频率够高或者足够多,那么也就是说该单词对于文本的重要性足够大或者说它就是在传达文本想要表达的意思,这里需要注意的是这仅仅是一个假设,因为我们也知道某些单词出现的频率相当的高,但是可能并不能表达出什么,例如英文中的“The”,又例如我们中文文本中可能会高频词的出现“的”,所以这也就是为什么我们上一小节中讲到一些标准化处理技巧的原因所在,我们需要去除一定没用的单词或者字,以让保留下来的成分能够充分准确的表示文本想要表达的意思(meaning)。下面通过代码来看看如何对文本中的单词进行统计:

 # 对单词出现在某个文本的次数进行统计:频率
# 一个单词出现的频率越高,则意味着它更能表达这篇文章所要表达的意思
from nltk.tokenize import TreebankWordTokenizer
from collections import Counter sentence = """The faster Ray get to the bus stop, the faster and the faster Ray, can get to the school."""
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence.lower())
print("The tokens are: ", tokens) bag_of_words = Counter(tokens)
print("The frequency of each word: ", bag_of_words)


The tokens are:  ['the', 'faster', 'ray', 'get', 'to', 'the', 'bus', 'stop', ',', 'the', 'faster', 'and', 'the', 'faster', 'ray', ',', 'can', 'get', 'to', 'the', 'school', '.']
The frequency of each word: Counter({'the': 5, 'faster': 3, 'ray': 2, 'get': 2, 'to': 2, ',': 2, 'bus': 1, 'stop': 1, 'and': 1, 'can': 1, 'school': 1, '.': 1})


这里有一个概念需要注意:从上面的Python字典中,我们统计出了每个单词出现的频率(Term Frequency),这个也就是我们通常所说的TF。除此之外,我们可以看出之前提到的一点,就是有一些高频率出现的词不一定是有用的,上述输出结果(词袋),我们可以看出the出现的频次最高,但是它并没有给我们带来什么有用的信息,因此我们可以将之去掉,而剩下的我们知道频次出现最高的前两个就是faster和ray。接下来,我们来计算下ray出现的频率:

 print(bag_of_words.most_common(4))                  # Counter对象中的most_common()函数获取词袋中4个出现最多的词

 times_harry_appears = bag_of_words['ray']    # 从字典(词袋)中获取ray出现的频次
num_unique_words = len(bag_of_words) # 看下词袋中一共有多少不同的词 tf = times_harry_appears / num_unique_words # 计算ray出现的频率
print(round(tf, 4))


'A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react\nagainst the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites\noften have a bridle to guide the face of the kite at the correct angle so the wind can lift it.\nA kite’s wing also may be so designed so a bridle is not needed; when kiting a sailplane\nfor launch, the tether meets the wing at a single point. A kite may have fixed or moving\nanchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing\nsets; even in technical kiting, though, a wing in the system is still often called the kite.\nThe lift that sustains the kite in flight is generated when air flows around the kite’s\nsurface, producing low pressure above and high pressure below the wings. The\ninteraction with the wind also generates horizontal drag along the direction of the wind.\nThe resultant force vector from the lift and drag force components is opposed by the\ntension of one or more of the lines or tethers to which the kite is attached. The anchor\npoint of the kite line may be static or moving (such as the towing of a kite by a running\nperson, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).\nThe same principles of fluid flow apply in liquids and kites are also used under water.\nA hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting\nsurface is called a kytoon.\nKites have a long and varied history and many different types are flown individually and\nat festivals worldwide. Kites may be flown for recreation, art or other practical uses.\nSport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites\nare multi-line steerable kites designed to generate large forces which can be used to\npower activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a\nnew trend snow kiting. Even Man-lifting kites have been made.'


 from collections import Counter
from nltk.tokenize import TreebankWordTokenizer tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(kite_text.lower())
token_counts = Counter(tokens)
import nltk

stopwords = nltk.corpus.stopwords.words('english')
tokens = [x for x in tokens if x not in stopwords]
kite_counts = Counter(tokens)
print(kite_counts) # Vectorizing
document_vector = []
doc_length = len(tokens) for key, value in kite_counts.most_common():
document_vector.append(round(value / doc_length, 4)) print(document_vector)


[0.0727, 0.0636, 0.0364, 0.0227, 0.0182, 0.0182, 0.0136, 0.0136, 0.0136, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045]


  • 计算每个标识符的TF而不仅仅是得到每个文档中单词的频次;
  • 确保所有的向量都是同一维度的。


 docs = ["The faster Harry got to the store, the faster and faster Harry would get home."]
docs.append("Harry is hairy and faster than Jill.")
docs.append("Jill is not as hairy as Harry.") doc_tokens = []
for doc in docs:
doc_tokens += [sorted(tokenizer.tokenize(doc.lower()))]
print("文本标识符: ", doc_tokens)
print("第一个文本的长度: ", len(doc_tokens[0])) all_doc_tokens = sum(doc_tokens, [])
print("三个文本的总标识符: ", len(all_doc_tokens)) lexicon = sorted(set(all_doc_tokens))
print("打印出的总词汇量: ", lexicon)
print("总词汇量: ", len(lexicon))


文本标识符:  [[',', '.', 'and', 'faster', 'faster', 'faster', 'get', 'got', 'harry', 'harry', 'home', 'store', 'the', 'the', 'the', 'to', 'would'], ['.', 'and', 'faster', 'hairy', 'harry', 'is', 'jill', 'than'], ['.', 'as', 'as', 'hairy', 'harry', 'is', 'jill', 'not']]
第一个文本的长度: 17
三个文本的总标识符: 33
打印出的总词汇量: [',', '.', 'and', 'as', 'faster', 'get', 'got', 'hairy', 'harry', 'home', 'is', 'jill', 'not', 'store', 'than', 'the', 'to', 'would']
总词汇量: 18


import copy
from collections import OrderedDict zero_vector = OrderedDict((token, 0) for token in lexicon)
print(zero_vector, '\n\n') doc_vectors = []
for doc in docs:
vec = copy.copy(zero_vector)
tokens = tokenizer.tokenize(doc.lower())
token_counts = Counter(tokens)
for key, value in token_counts.items():
vec[key] = round(value / len(lexicon), 4)
doc_vectors.append(vec) for i, doc_vec in enumerate(doc_vectors):
print("{} : {}".format(i + 1, doc_vec), '\n')


OrderedDict([(',', 0), ('.', 0), ('and', 0), ('as', 0), ('faster', 0), ('get', 0), ('got', 0), ('hairy', 0), ('harry', 0), ('home', 0), ('is', 0), ('jill', 0), ('not', 0), ('store', 0), ('than', 0), ('the', 0), ('to', 0), ('would', 0)]) 

1 : OrderedDict([(',', 0.0556), ('.', 0.0556), ('and', 0.0556), ('as', 0), ('faster', 0.1667), ('get', 0.0556), ('got', 0.0556), ('hairy', 0), ('harry', 0.1111), ('home', 0.0556), ('is', 0), ('jill', 0), ('not', 0), ('store', 0.0556), ('than', 0), ('the', 0.1667), ('to', 0.0556), ('would', 0.0556)]) 

2 : OrderedDict([(',', 0), ('.', 0.0556), ('and', 0.0556), ('as', 0), ('faster', 0.0556), ('get', 0), ('got', 0), ('hairy', 0.0556), ('harry', 0.0556), ('home', 0), ('is', 0.0556), ('jill', 0.0556), ('not', 0), ('store', 0), ('than', 0.0556), ('the', 0), ('to', 0), ('would', 0)]) 

3 : OrderedDict([(',', 0), ('.', 0.0556), ('and', 0), ('as', 0.1111), ('faster', 0), ('get', 0), ('got', 0), ('hairy', 0.0556), ('harry', 0.0556), ('home', 0), ('is', 0.0556), ('jill', 0.0556), ('not', 0.0556), ('store', 0), ('than', 0), ('the', 0), ('to', 0), ('would', 0)]) 

在这个过程中我们首先创建了一个空向量,然后将我们计算得到的结果赋值给这个空向量,以此得到我们需要的同一维度的向量。从上可以看出,有些标识符的值TF值为0,这是因为不同文本含有的单词不一样,这个有不一定另外一个有。这就是文本向量化及词袋模型的构造过程,这篇主要就讲解如何将文本的单词向量化表示,并初步了解TF,TF的概念简单易懂,但是它跟我们之后的TF-IDF的计算是紧密结合的,所以需要清楚的被提出来,再接下来的两个小节(3-2和3-3)中,还会继续学习齐波夫定律(Zipf's Law)在NLP中是扮演怎么样一个角色,由此引出TF-IDF和并且简单的介绍下主题模型(Topic Modeling)是如何运作的。

