Python文本处理nltk基础

自然语言处理 -->计算机数据，计算机可以处理vector，matrix 向量矩阵。

NLTK 自然语言处理库，自带语料，词性分析，分类，分词等功能。

简单版的wrapper，比如textblob。

import nltk
nltk.download() #可以下载语料库等。

#自带的语料库
from nltk.corpus import brown
brown.categories()
len(brown.sents()) # 多少句话
len(brown.words()) # 多少个单词

一简单的文本预处理流水线

1.分词 Tokenize 长句子分成有意义的小部件。

sentence = "hello word"
nltk.word_tokenize(sentence)

nltk的分词对于中文是无效的，因为英文是词语按照空格键分开的，而中文单个字分开是无效的，比如今天天气不错，要分成今天/天气/不错/！

中文有两种 1 启发式 Heuristic ，就是比如最长词，字典作为词库，有今天，没有今天天这么长的，所以今天为一个词。

　　　　　2 机器学习/统计方法：HMM，CRF。（coreNLP ，斯坦福）

　　　　　　中文分词结巴。

分完词之后再调用nltk。

社交网络语音的分词，会员表情符号，url，#话题，@某人需要正则表达式来预处理。

2 nltk.pos_tag(text) #text为分词完的list，part of speech 在这句话中的部分，adj adv，det（the,a这种）

3 stemming 词干提取如walking 到walk

lemmatize（postag）词形归一 #会根据词性，把is am are 归一成be went 归一成go 这种

4 stop words（停止词）， he,the这些没有意义的词，直接删掉。

from nltk.corpus import stopwords
[word for word in word_list if word not in stopwords.words('english')]

插入图片1 流程

插入图片2 life is like a box of chocolate

二向量化

nltk在nlp的经典应用1情感分析 2 文本相似度 3 文本分类（用的最多，如新闻分类）

1.情感分析：

　　最简单的 sentiment dictionary

字典中单词的正负性，如 like 1分 good 2分 bad -2 分 terrible -3 分。　　一句话所有的词打分，相加看正负。

sentimen_dictionary = {}
for line in open('*.txt'):
　　word,score = line.split('\t')
　　sentiment_dictionary[word] = int(score)
total_score = sum(sentiment_dictionary.get(word,0) for word in words) #字典中有则score，没有的Word则0分。

#有的人骂的比较黑装粉，需要配上ML
from nltk.classify import NaiveBayesClassifier
# 随手的简单训练集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'
def preprocess（s):
　#句子处理，这里是用split()，把每个单词都分开，没有用到tokenize，因为例子比较简单。
     return {word : True for word in s.lower().split()}　　　　　　　　
    #{fname,fval} 这里用true是最简单的存储形式，fval 每个文本单词对应的值，高级的可以用word2vec来得到fval。
#训练 this is terrible good awesome bad book 这样一次单词长列（1,1,0，1,0,0，1）如s1对应的向量
 
training_data = [ [preprocess(s1),'pos'],
                           [preprocess(s1),'pos'],
                          [preprocess(s1),'neg'],
                         [preprocess(s1),'neg']]
model = NaiveBayesClassifier.train(training_data)
print(model.classify(preprocess('this is a good book')))

2.文本相似性

　把文本变成相同长度的向量，通过余弦相似度求相似性。

　 nltk中FreqDist统计文字出现的频率

3.文本分类

　　　　TF-IDF

　　　　TF，Term Frequency，一个term在一个文档中出现的有多频繁。

　　　　TF（t) = t出现在文档中的次数/文档中的term总数

　　　　IDF ：Inverse Document Frequency,衡量一个term有多重要，如 is the 这些不重要

　　　　把罕见的权值农高。

　　　　IDF（t) = log e (文档总数/含有t的文档总数）

　　　　TF-IDF = TF×IDF

from nltk.text import TextCollection
# 首首先, 把所有的文文档放到TextCollection类中。
# 这个类会自自动帮你断句句, 做统计, 做计算
corpus = TextCollection(['this is sentence one',
    'this is sentence two',
    'this is sentence three'])
# 直接就能算出tfidf
# (term: 一一句句话中的某个term, text: 这句句话)
print(corpus.tf_idf('this', 'this is sentence four'))
# 0.444342
# 同理理, 怎么得到一一个标准大大小小的vector来表示所有的句句子子?
# 对于每个新句句子子
new_sentence = 'this is sentence five'
# 遍历一一遍所有的vocabulary中的词:
for word in standard_vocab:
    print(corpus.tf_idf(word, new_sentence))
# 我们会得到一一个巨⻓长(=所有vocab⻓长度)的向量量