自然语言15_Part of Speech Tagging with NLTK
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程)
https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share
https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tutorial/
# -*- coding: utf-8 -*-
"""
Created on Sun Nov 13 09:14:13 2016 @author: daxiong
"""
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer #训练数据
train_text=state_union.raw("2005-GWBush.txt")
#测试数据
sample_text=state_union.raw("2006-GWBush.txt")
'''
Punkt is designed to learn parameters (a list of abbreviations, etc.)
unsupervised from a corpus similar to the target domain.
The pre-packaged models may therefore be unsuitable:
use PunktSentenceTokenizer(text) to learn parameters from the given text
'''
#我们现在训练punkttokenizer(分句器)
custom_sent_tokenizer=PunktSentenceTokenizer(train_text)
#训练后,我们可以使用punkttokenizer(分句器)
tokenized=custom_sent_tokenizer.tokenize(sample_text) '''
nltk.pos_tag(["fire"]) #pos_tag(列表)
Out[19]: [('fire', 'NN')]
''' #文本词性标记函数
def process_content():
try:
for i in tokenized[0:5]:
words=nltk.word_tokenize(i)
tagged=nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e)) process_content()
One of the more powerful aspects of the NLTK module is the Part of Speech tagging that it can do for you. This means labeling words in a sentence as nouns, adjectives, verbs...etc. Even more impressive, it also labels by tense, and more. Here's a list of the tags, what they mean, and some examples:
POS tag list: CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: "there is" ... think of it like "there exists")
FW foreign word
IN preposition/subordinating conjunction
JJ adjective 'big'
JJR adjective, comparative 'bigger'
JJS adjective, superlative 'biggest'
LS list marker 1)
MD modal could, will
NN noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular 'Harrison'
NNPS proper noun, plural 'Americans'
PDT predeterminer 'all the kids'
POS possessive ending parent's
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go 'to' the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
How might we use this? While we're at it, we're going to cover a new sentence tokenizer, called the PunktSentenceTokenizer. This tokenizer is capable of unsupervised machine learning, so you can actually train it on any body of text that you use. First, let's get some imports out of the way that we're going to use:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
Now, let's create our training and testing data:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
One is a State of the Union address from 2005, and the other is from 2006 from past President George W. Bush.
Next, we can train the Punkt tokenizer like:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
Then we can actually tokenize, using:
tokenized = custom_sent_tokenizer.tokenize(sample_text)
Now we can finish up this part of speech tagging script by creating a function that will run through and tag all of the parts of speech per sentence like so:
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged) except Exception as e:
print(str(e)) process_content()
The output should be a list of tuples, where the first element in the tuple is the word, and the second is the part of speech tag. It should look like:
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'NNP'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'NNP'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'NNP'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'DT'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')] [('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NNS'), (',', ','), ('distinguished', 'VBD'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'NN'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBN'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')] [('Tonight', 'NNP'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'NN'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'NN'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')] [('(', 'NN'), ('Applause', 'NNP'), ('.', '.'), (')', ':')] [('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]
At this point, we can begin to derive meaning, but there is still some work to do. The next topic that we're going to cover is chunking, which is where we group words, based on their parts of speech, into hopefully meaningful groups.
自然语言15_Part of Speech Tagging with NLTK的更多相关文章
- 自然语言15.1_Part of Speech Tagging 词性标注
QQ:231469242 欢迎喜欢nltk朋友交流 https://en.wikipedia.org/wiki/Part-of-speech_tagging In corpus linguistics ...
- 自然语言12_Tokenizing Words and Sentences with NLTK
https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ # -*- coding: utf-8 -*- ...
- 词性标注 parts of speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging ...
- 自然语言处理NLP程序包(NLTK/spaCy)使用总结
NLTK和SpaCy是NLP的Python应用,提供了一些现成的处理工具和数据接口.下面介绍它们的一些常用功能和特性,便于对NLP研究的组成形式有一个基本的了解. NLTK Natural Langu ...
- 自然语言27_Converting words to Features with NLTK
https://www.pythonprogramming.net/words-as-features-nltk-tutorial/ Converting words to Features with ...
- 自然语言18.1_Named Entity Recognition with NLTK
QQ:231469242 欢迎nltk爱好者交流 https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?c ...
- Part of Speech Tagging
Natural Language Processing with Python Charpter 6.1 suffix_fdist处代码稍微改动. import nltk from nltk.corp ...
- 自然语言14_Stemming words with NLTK
https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/ # -*- ...
- python and 我爱自然语言处理
曾经因为NLTK的 缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开发语言是C/C++,但平时的很多文本数据处理任务都交给了Python.离 开腾讯创业后,第一个作品课程图谱也 ...
随机推荐
- nginx中获取真实ip
nginx反向代理配置时,一般会添加下面的配置: proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; ...
- Beta项目冲刺 --第五天
忙里偷得半日闲-- 队伍:F4 成员:031302301 毕容甲 031302302 蔡逸轩 031302430 肖阳 031302418 黄彦宁 会议内容: 1.站立式会议照片: 2.项目燃尽图 3 ...
- 1117Mysql prepare预处理语句
转自http://www.jb51.net/article/81378.htm 综述:一般用来拼凑SQL然后执行 MySQL 5.1对服务器一方的预制语句提供支持.如果您使用合适的客户端编程界面,则这 ...
- android -- 之PopupWindow的使用
LayoutInflater inflater = (LayoutInflater) getSystemService(LAYOUT_INFLATER_SERVICE); View contentVi ...
- 开发错误记录5:Failed to resolve: com
今在导入项目时报:Failed to resolve: com.android.support:appcompat-v7:23.1.1包! 一.按F12查看包引用情况 v7包版本不一样,环境中只有co ...
- 【转】HTTP协议详解
原文地址:http://www.cnblogs.com/EricaMIN1987_IT/p/3837436.html 一.概念 协议是指计算机通信网络中两台计算机之间进行通信所必须共同遵守的规定或规则 ...
- [转]响应式WEB设计学习(2)—视频能够做成响应式吗
原文地址:http://www.jb51.net/web/70361.html 上集回顾: 昨天讲了页面如何根据不同的设备尺寸做出响应.主要是利用了@media命令以及尺寸百分比化这两招. 上集补充: ...
- mysql之路【第三篇】
1,查看表的结构 desc 表名; 查看表的详细结构 show create table; show create table \G; (加上G格式化输出), 2,修改表 2. ...
- 2010-2014总结 ____V_V____ hello-world
.caret,.dropup>.btn>.caret{border-top-color:#000!important}.label{border:1px solid #000}.table ...
- mysql-模拟全连接处理
方案:通过union连接查询出所有需要的特殊标签,然后在通过left join与union中的结果集做多表比较. sql select t.`code`,a.`count` as count_a,b. ...