在分析英文文本时,我们可能会关心文本当中每个词语的词性和在句中起到的作用。识别文本中各个单词词性的过程,可以称为词性标注。

英语主要的八种词性分别为:

1、名词(noun)

2、代词(pronoun)

3、动词(verb)

4、形容词(adjective)

5、副词(adverb)

6、介词(preposition)

7、连词(conjunction)

8、感叹词(interjection)

其他还包括数词(numeral)和冠词(article)等。

在使用第三方工具(如NLTK)进行词性标注时,返回的结果信息量可能比上述八种词性要丰富一些。比如NLTK,其所标注的词性可以参考Penn Treebank Project给出的pos tagset,如下图:

举例来说,我们使用NLTK对一段英文进行词性标注:

这段英文摘自19年3月13日华盛顿邮报有关加拿大停飞波音737客机相关报道,段落的原文为:

After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals.

我们对该段落进行断句,然后对每句话进行分词,再对每个词语进行词性标注,然后循环打印每句话中每个词的词性标注结果,具体代码如下:

 import nltk
passage = """After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals."""
sentences = nltk.sent_tokenize( passage )
for sent in sentences:
tokens = nltk.word_tokenize( sent )
posTags = nltk.pos_tag( tokens )
print( posTags )

代码的print()函数打印的内容如下:

[('After', 'IN'), ('the', 'DT'), ('Lion', 'NNP'), ('Air', 'NNP'), ('crash', 'NN'), (',', ','), ('questions', 'NNS'), ('were', 'VBD'), ('raised', 'VBN'), (',', ','), ('so', 'IN'), ('Boeing', 'NNP'), ('sent', 'VBD'), ('further', 'JJ'), ('instructions', 'NNS'), ('that', 'IN'), ('it', 'PRP'), ('said', 'VBD'), ('pilots', 'NNS'), ('should', 'MD'), ('know', 'VB'), (',', ','), ('”', 'FW'), ('he', 'PRP'), ('said', 'VBD'), (',', ','), ('according', 'VBG'), ('to', 'TO'), ('the', 'DT'), ('Associated', 'NNP'), ('Press', 'NNP'), ('.', '.')]
[('“Those', 'JJ'), ('relate', 'NN'), ('to', 'TO'), ('the', 'DT'), ('specific', 'JJ'), ('behavior', 'NN'), ('of', 'IN'), ('this', 'DT'), ('specific', 'JJ'), ('type', 'NN'), ('of', 'IN'), ('aircraft', 'NN'), ('.', '.')]
[('As', 'IN'), ('a', 'DT'), ('result', 'NN'), (',', ','), ('training', 'NN'), ('was', 'VBD'), ('given', 'VBN'), ('by', 'IN'), ('Boeing', 'NNP'), (',', ','), ('and', 'CC'), ('our', 'PRP$'), ('pilots', 'NNS'), ('have', 'VBP'), ('taken', 'VBN'), ('it', 'PRP'), ('and', 'CC'), ('put', 'VB'), ('it', 'PRP'), ('into', 'IN'), ('our', 'PRP$'), ('manuals', 'NNS'), ('.', '.')]

如何看懂上面的输出结果:段落中的每句话为一个list,每句话中的每个词及其词性表示为一个tuple,左边为单词本身,右边为词性缩写,这些缩写的具体含义可以查找Penn Treebank Pos Tags表格。

我们对代码稍微修改一下,以便使结果呈现更清楚一些,而不至于看的太费力,如下:

 import nltk
passage = """After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals."""
sentences = nltk.sent_tokenize( passage )
for sent in sentences:
tokens = nltk.word_tokenize( sent )
posTags = nltk.pos_tag( tokens )
for tag in posTags:
print( "{}({}) ".format( tag[0], tag[1] ), end = "" )

输出结果如下(标注的词性以括号形式紧跟在每个单词右侧):

After(IN) the(DT) Lion(NNP) Air(NNP) crash(NN) ,(,) questions(NNS) were(VBD) raised(VBN) ,(,) so(IN) Boeing(NNP) sent(VBD) further(JJ) instructions(NNS) that(IN) it(PRP) said(VBD) pilots(NNS) should(MD) know(VB) ,(,) ”(FW) he(PRP) said(VBD) ,(,) according(VBG) to(TO) the(DT) Associated(NNP) Press(NNP) .(.) “Those(JJ) relate(NN) to(TO) the(DT) specific(JJ) behavior(NN) of(IN) this(DT) specific(JJ) type(NN) of(IN) aircraft(NN) .(.) As(IN) a(DT) result(NN) ,(,) training(NN) was(VBD) given(VBN) by(IN) Boeing(NNP) ,(,) and(CC) our(PRP$) pilots(NNS) have(VBP) taken(VBN) it(PRP) and(CC) put(VB) it(PRP) into(IN) our(PRP$) manuals(NNS) .(.) 

参考文献:

1、https://en.wikipedia.org/wiki/Part_of_speech

2、https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

3、https://www.washingtonpost.com/local/trafficandcommuting/canada-grounds-boeing-737-max-8-leaving-us-as-last-major-user-of-plane/2019/03/13/25ac2414-459d-11e9-90f0-0ccfeec87a61_story.html?utm_term=.f359a714d4d8

POS Tagging 标签类型查询表(Penn Treebank Project)的更多相关文章

  1. html <input>标签类型属性type(file、text、radio、hidden等)详细介绍

    html <input>标签类型属性type(file.text.radio.hidden等)详细介绍 转载请注明:文章转载自:[169IT-最新最全的IT资讯] html <inp ...

  2. 第五篇、HTML标签类型

    <!--1.块级标签 独占一行,可以设置高度和宽度 如:div p h ul li  -----display: none(隐藏标签) block(让行内标签变块级标签) inline(让块级标 ...

  3. POS tagging的解釋

    轉錄文章~~ 什么是词性标注(POS tagging) Tue, 04/13/2010 - 10:36 — Fuller 词性标注也叫词类标注,POS tagging是part-of-speech t ...

  4. 什么是词性标注(POS tagging)

    词性标注也叫词类标注,POS tagging是part-of-speech tagging的缩写. 维基百科对POS Tagging的定义: In corpus linguistics, part-o ...

  5. CSS标签类型和样式表继承与优先级

    标签类型 块级标签 什么是块级标签:在html中<div>. <p>.h1~h6.<form>.<ul> 和 <li>就是块级元素 块级标签 ...

  6. Penn Treebank

    NLP中常用的PTB语料库,全名Penn Treebank.Penn Treebank是一个项目的名称,项目目的是对语料进行标注,标注内容包括词性标注以及句法分析. 语料来源为:1989年华尔街日报语 ...

  7. 创建标签的两种方法insertAdjacentHTML 和 createElement 创建标签 setAttribute 赋予标签类型 appendChild 插入标签

    1. 建立字符串和insertAdjacentHTML('beforeEnd', ) 2. 通过createElement 创建标签  setAttribute 赋予标签类型 appendChild ...

  8. NFC 标签类型

    NFC 标签类型 Type 1:Type 1 Tag is based on ISO/IEC 14443A. This tag type is read and re-write capable. T ...

  9. 伪类的格式重点:父标签层级 & 当前标签类型

    伪类的格式重点:父标签层级 & 当前标签类型 通过例子说明: css1: span:nth-of-type(2){color: red;} css2: span :nth-of-type(2) ...

随机推荐

  1. ECMAScript 6 入门 ----Generator 函数

    本文转自:阮一峰老师的ECMAScript 6 入门,有时间可以看下评论! Generator 函数 简介 基本概念 Generator函数是ES6提供的一种异步编程解决方案,语法行为与传统函数完全不 ...

  2. log4j的配置与使用

    配置log4j的步骤如下: 1.导入jar包 如log4j-1.2.15.jar 2.在src下添加log4j.properties 使用时把下面内容中的注释去掉: //日志级别及位置 log4j.r ...

  3. multiWriter.go

    package blog4go import ( "errors" "fmt" ) var ( // ErrFilePathNotFound 文件路径找不到 E ...

  4. Java RESTful 框架的性能比较

    来源:鸟窝, colobu.com/2015/11/17/Jax-RS-Performance-Comparison/ 如有好文章投稿,请点击 → 这里了解详情 在微服务流行的今天,我们会从纵向和横向 ...

  5. BZOJ_4653_[Noi2016]区间_线段树+离散化+双指针

    BZOJ_4653_[Noi2016]区间_线段树+离散化+双指针 Description 在数轴上有 n个闭区间 [l1,r1],[l2,r2],...,[ln,rn].现在要从中选出 m 个区间, ...

  6. BZOJ_4269_再见Xor_线性基

    BZOJ_4269_再见Xor_线性基 Description 给定N个数,你可以在这些数中任意选一些数出来,每个数可以选任意多次,试求出你能选出的数的异或和的最大值和严格次大值. Input 第一行 ...

  7. 英国毕业原版-《伯明翰大学毕业证书》UoB一模一样原件

    ☞伯明翰大学毕业证书[微/Q:865121257◆WeChat:CC6669834]UC毕业证书/联系人Alice[查看点击百度快照查看][留信网学历认证&博士&硕士&海归&a ...

  8. hystrix隔离策略(4)

    hystrix提供了两种隔离策略:线程池隔离和信号量隔离.hystrix默认采用线程池隔离. 1.线程池隔离 不同服务通过使用不同线程池,彼此间将不受影响,达到隔离效果. 例如: 我们可以通过andT ...

  9. java 理解如何实现图片验证码 傻瓜都能看懂。

    先代码后解释: 只要把代码复制到你的项目中就可以了. 代码: 验证码工具类: package cn.happy.util.imagesVerTion; /** * Author: SamGroves ...

  10. 【坑】解决CentOS 7.1版本以上安装好zabbix 3.4 无法重启zabbix-server的问题

    1. 问题所在 报错信息:zabbix_server[]: segfault at ip 00007f78842b4bd0 sp 00007fff1995a818 error ] 2. 产生原因 Ce ...