机器学习之路： python nltk 文本特征提取

git： https://github.com/linyi0604/MachineLearning

分别使用词袋法和nltk自然预言处理包提供的文本特征提取

 from sklearn.feature_extraction.text import CountVectorizer

 import nltk

 # nltk.download("punkt")

 # nltk.download('averaged_perceptron_tagger')

 '''

 分别使用词袋法和nltk自然预言处理包提供的文本特征提取

 '''

 sent1 = "The cat is walking in the bedroom."

 sent2 = "A dog was running across the kitchen."

 # 使用词袋法 将文本转化为特征向量

 count_vec = CountVectorizer()

 sentences = [sent1, sent2]

 # 输出转化后的特征向量

 # print(count_vec.fit_transform(sentences).toarray())

 '''

 [[0 1 1 0 1 1 0 0 2 1 0]

  [1 0 0 1 0 0 1 1 1 0 1]]

 '''

 # 输出转化后特征的含义

 # print(count_vec.get_feature_names())

 '''

 ['across', 'bedroom', 'cat', 'dog', 'in', 'is', 'kitchen', 'running', 'the', 'walking', 'was']

 '''

 # 使用nltk对文本进行语言分析

 # 对句子词汇分割和正则化 把aren't 分割成 are 和 n't   I'm 分割成 I和'm

 tokens1 = nltk.word_tokenize(sent1)

 tokens2 = nltk.word_tokenize(sent2)

 # print(tokens1)

 # print(tokens2)

 '''

 ['The', 'cat', 'is', 'walking', 'in', 'the', 'bedroom', '.']

 ['A', 'dog', 'was', 'running', 'across', 'the', 'kitchen', '.']

 '''

 # 整理词汇表 按照ASCII的顺序排序

 vocab_1 = sorted(set(tokens1))

 vocab_2 = sorted(set(tokens2))

 # print(vocab_1)

 # print(vocab_2)

 '''

 ['.', 'The', 'bedroom', 'cat', 'in', 'is', 'the', 'walking']

 ['.', 'A', 'across', 'dog', 'kitchen', 'running', 'the', 'was']

 '''

 # 初始化stemer 寻找每个单词最原始的词根

 stemmer = nltk.stem.PorterStemmer()

 stem_1 = [stemmer.stem(t) for t in tokens1]

 stem_2 = [stemmer.stem(t) for t in tokens2]

 # print(stem_1)

 # print(stem_2)

 '''

 ['the', 'cat', 'is', 'walk', 'in', 'the', 'bedroom', '.']

 ['A', 'dog', 'wa', 'run', 'across', 'the', 'kitchen', '.']

 '''

 # 利用词性标注器 对词性进行标注

 pos_tag_1 = nltk.tag.pos_tag(tokens1)

 pos_tag_2 = nltk.tag.pos_tag(tokens2)

 # print(pos_tag_1)

 # print(pos_tag_2)

 '''

 [('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('walking', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('bedroom', 'NN'), ('.', '.')]

 [('A', 'DT'), ('dog', 'NN'), ('was', 'VBD'), ('running', 'VBG'), ('across', 'IN'), ('the', 'DT'), ('kitchen', 'NN'), ('.', '.')]

 '''

机器学习之路： python nltk 文本特征提取的更多相关文章

机器学习之路: python k近邻分类器 KNeighborsClassifier 鸢尾花分类预测
使用python语言学习k近邻分类器的api 欢迎来到我的git查看源代码: https://github.com/linyi0604/MachineLearning from sklearn.da ...
机器学习之路--Python
常用数据结构 1.list 列表有序集合 classmates = ['Michael', 'Bob', 'Tracy'] len(classmates) classmates[0] len(cla ...
机器学习之路: python 回归树 DecisionTreeRegressor 预测波士顿房价
python3 学习api的使用 git: https://github.com/linyi0604/MachineLearning 代码: from sklearn.datasets import ...
机器学习之路: python 线性回归LinearRegression, 随机参数回归SGDRegressor 预测波士顿房价
python3学习使用api 线性回归,和随机参数回归 git: https://github.com/linyi0604/MachineLearning from sklearn.datasets ...
机器学习之路: python 决策树分类DecisionTreeClassifier 预测泰坦尼克号乘客是否幸存
使用python3 学习了决策树分类器的api 涉及到特征的提取,数据类型保留,分类类型抽取出来新的类型需要网上下载数据集,我把他们下载到了本地, 可以到我的git下载代码和数据集: https: ...
python —— 文本特征提取 CountVectorize
CountVectorize 来自:python学习文本特征提取(二) CountVectorizer TfidfVectorizer 中文处理 - CSDN博客 https://blog.csdn ...
机器学习之路：python 文本特征提取 CountVectorizer, TfidfVectorizer
本特征提取: 将文本数据转化成特征向量的过程比较常用的文本特征表示法为词袋法词袋法: 不考虑词语出现的顺序,每个出现过的词汇单独作为一列特征这些不重复的特征词汇集合为词表每一个文本都可以在很长的 ...
【NLP】干货！Python NLTK结合stanford NLP工具包进行文本处理
干货!详述Python NLTK下如何使用stanford NLP工具包作者:白宁超 2016年11月6日19:28:43 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的 ...
【NLP】Python NLTK处理原始文本
Python NLTK 处理原始文本作者:白宁超 2016年11月8日22:45:44 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开 ...

随机推荐

C++利用cin输入时检测回车的方法
今天做TJU的OJ ,其中一道题是先读入一个字符串,再读入一个整数,循环往复,直到字符串是空,也就是说回车键结束循环. 但是cin对空格和回车都不敏感,都不影响继续读入数据,所以需要一种新的方式检测回 ...
sklearn_模型遍历
# _*_ coding = utf_8 _*_ import matplotlib.pyplot as plt import seaborn as sns import pandas as pd f ...
查看gcc的默认宏定义命令【转】
转自:http://blog.csdn.net/cywosp/article/details/10730931 有些时候我们在编写代码或者阅读开源项目时经常会遇到一些陌生的宏定义,在找遍所有源代码都没 ...
金蝶K3，名称或代码在系统中已被使用,由于数据移动,未能继续以NOLOCK方式扫描
使用金蝶K3时出现:名称或代码在系统中已被使用:错误代码:3604(E14H)source:Microsoft OLE DB provider for SQL SERVERDetail:由于数据移动, ...
openjudge-NOI 2.5-1756 八皇后
题目链接:http://noi.openjudge.cn/ch0205/1756/ 题解: 上一道题稍作改动…… #include<cstdio> #include<algorith ...
html5学习之canvas
Canvas画布 1.绘图方法 ctx.moveTo(x,y) 落笔ctx.lineTo(x,y) 连线ctx.stroke() 描边 ctx.beginPath(): 开启新的图层演示: stro ...
Genymotion上不能安装APK软件的问题
Genymotion模拟器不能安装APK的原因官网给出的解释:Genymotion模拟器使用的是x86架构,在第三方市场上的应用有部分不采用x86这么一种架构,所以在编译的时候不通过,报“APP n ...
python dict交换key value值
方法一: 使用dict.items()方式 dict_ori = {'A':1, 'B':2, 'C':3} dict_new = {value:key for key,value in dict_o ...
洛谷P1525关押罪犯
传送门啦想让最大值最小,所以,这题可以用二分法,排序之后发现可以并查集,因为要使最大值最小,排序后这个最大值是存在的. 对于会冲突的两个罪犯,我们连一条无向边,然后按权值从大到小排序,从大到小枚举每 ...
不同意义的new和delete
补充说明: new/delete是运算符而非函数,operator new/delete并非是new/delete的重载.事实上,我们无法自定义new/delete的行为: operator new/ ...

机器学习之路： python nltk 文本特征提取

机器学习之路： python nltk 文本特征提取的更多相关文章

随机推荐

热门专题