python 自然语言处理（二）____获得文本语料和词汇资源

一, 获取文本语料库

　　一个文本语料库是一大段文本。它通常包含多个单独的文本，但为了处理方便，我们把他们头尾连接起来当做一个文本对待。

1. 古腾堡语料库

　　nltk包含古腾堡项目（Project Gutenberg）电子文本档案的一小部分文本。要使用该语料库通常需要用Python解释器加载nltk包，然后尝试nltk.corpus.gutenberg.fileids().实例如下：

 >>> import nltk

 >>> nltk.corpus.gutenberg.fileids()

 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt'

 , 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-a

 lice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.t

 xt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', '

 shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'w

 hitman-leaves.txt']

 >>>

运行结果显示的是nltk包含了该语料库的哪些文本。我们可以对其中的任意文本进行操作。

1）统计词数。实例如下：

 >>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')

 >>> len(emma)

 192427

 >>>

2）索引文本。实例如下：

 >>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

 >>> emma.concordance("surprise")

 Displaying 1 of 1 matches:

  that Emma could not but feel some surprise , and a little displeasure , on he

 >>>

3）获取文本的标识符，词，句。实例如下：

 >>> for fileid in gutenberg.fileids():

 ...     raw = gutenberg.raw(fileid)

 ...     num_chars = len(raw)

 ...     words = gutenberg.words(fileid)

 ...     num_words = len(words)

 ...     sents = gutenberg.sents(fileid)

 ...     num_sents = len(sents)

 ...     vocab = set([w.lower() for w in gutenberg.words(fileid)])

 ...     num_vocab = len(vocab)

 ...     print("%d %d %d %s" % (num_chars, num_words, num_sents, fileid))

 ...

 887071 192427 7752 austen-emma.txt

 466292 98171 3747 austen-persuasion.txt

 673022 141576 4999 austen-sense.txt

 4332554 1010654 30103 bible-kjv.txt

 38153 8354 438 blake-poems.txt

 249439 55563 2863 bryant-stories.txt

 84663 18963 1054 burgess-busterbrown.txt

 144395 34110 1703 carroll-alice.txt

 457450 96996 4779 chesterton-ball.txt

 406629 86063 3806 chesterton-brown.txt

 320525 69213 3742 chesterton-thursday.txt

 935158 210663 10230 edgeworth-parents.txt

 1242990 260819 10059 melville-moby_dick.txt

 468220 96825 1851 milton-paradise.txt

 112310 25833 2163 shakespeare-caesar.txt

 162881 37360 3106 shakespeare-hamlet.txt

 100351 23140 1907 shakespeare-macbeth.txt

 711215 154883 4250 whitman-leaves.txt

 >>> raw[:1000]

 "[Leaves of Grass by Walt Whitman 1855]\n\n\nCome, said my soul,\nSuch verses fo

 r my Body let us write, (for we are one,)\nThat should I after return,\nOr, long

 , long hence, in other spheres,\nThere to some group of mates the chants resumin

 g,\n(Tallying Earth's soil, trees, winds, tumultuous waves,)\nEver with pleas'd

 smile I may keep on,\nEver and ever yet the verses owning--as, first, I here and

  now\nSigning for Soul and Body, set to them my name,\n\nWalt Whitman\n\n\n\n[BO

 OK I.  INSCRIPTIONS]\n\n}  One's-Self I Sing\n\nOne's-self I sing, a simple sepa

 rate person,\nYet utter the word Democratic, the word En-Masse.\n\nOf physiology

  from top to toe I sing,\nNot physiognomy alone nor brain alone is worthy for th

 e Muse, I say\n    the Form complete is worthier far,\nThe Female equally with t

 he Male I sing.\n\nOf Life immense in passion, pulse, and power,\nCheerful, for

 freest action form'd under the laws divine,\nThe Modern Man I sing.\n\n\n\n}  As

  I Ponder'd in Silence\n\nAs I ponder'd in silence,\nReturning upon my poems, c"

 >>>

 >>> words

 ['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]

 >>> sents

 [['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '', ']'], ['Come',

 ',', 'said', 'my', 'soul', ',', 'Such', 'verses', 'for', 'my', 'Body', 'let', 'u

 s', 'write', ',', '(', 'for', 'we', 'are', 'one', ',)', 'That', 'should', 'I', '

 after', 'return', ',', 'Or', ',', 'long', ',', 'long', 'hence', ',', 'in', 'othe

 r', 'spheres', ',', 'There', 'to', 'some', 'group', 'of', 'mates', 'the', 'chant

 s', 'resuming', ',', '(', 'Tallying', 'Earth', "'", 's', 'soil', ',', 'trees', '

 ,', 'winds', ',', 'tumultuous', 'waves', ',)', 'Ever', 'with', 'pleas', "'", 'd'

 , 'smile', 'I', 'may', 'keep', 'on', ',', 'Ever', 'and', 'ever', 'yet', 'the', '

 verses', 'owning', '--', 'as', ',', 'first', ',', 'I', 'here', 'and', 'now', 'Si

 gning', 'for', 'Soul', 'and', 'Body', ',', 'set', 'to', 'them', 'my', 'name', ',

 '], ...]

raw表示的是文本中所有的标识符，words是词，sents是句子。显然句子都是划分成一个个词来进行存储的。除了words(), raw() 和 sents()以外，大多数nltk语料库阅读器还包括多种访问方法。

2. 网络和聊天文本

古腾堡项目包含的是成千上万的书籍，它们比较正式，代表了既定的文学。除此之外， nltk中还有很多的网络文本小集合，其内容包括Firefox交流论坛，在纽约无意中听到的对话，《加勒比海盗》的电影剧本，个人广告和葡萄酒的评论。访问该部分的文本实例如下：

 >>> for fileid in webtext.fileids():

 ...     print("%s   %s ..." % (fileid, webtext.raw(fileid)[:65]))

 ...

 firefox.txt   Cookie Manager: "Don't allow sites that set removed cookies to se

 ...

 grail.txt   SCENE 1: [wind] [clop clop clop]

 KING ARTHUR: Whoa there!  [clop ...

 overheard.txt   White guy: So, do you have any plans for this evening?

 Asian girl ...

 pirates.txt   PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr

 ...

 singles.txt   25 SEXY MALE, seeks attrac older single lady, for discreet encoun

 ...

 wine.txt   Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...

 >>>

3. 即时消息聊天会话语料库

该语料库最初是由美国海军研究生院为研究自动检测互联网入侵者而收集的，包含超过1000个帖子，被分成15个文件，每个文件包含几百个从特定日期和特定年龄的聊天室收集的帖子。文件名包含日期，聊天室和帖子的数量。引用实例如下：

4.布朗语料库

布朗语料库是第一个百万词级的英语电子语料库，其中包含500个不同来源的文本，按照文体分类，如新闻，社论等。它主要用于研究文体之间的系统性差异（又叫做文体学的语言学研究）。我们可以将语料库作为词链表或者句子链表来访问。

1）按特定类别或文件阅读

 >>> from nltk.corpus import brown

 >>> brown.categories()

 ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',

  'humor', 'learned', 'lore', 'mystery', 'new', 'news', 'religion', 'reviews', 'r

 omance', 'science_fiction']

 >>> brown.words(categories='news')

 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

 >>> brown.words(fileids=['cg22'])

 ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]

 >>> brown.sents(categories=['news', 'editorial', 'reviews', ])

 [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investiga

 tion', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no

 ', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['T

 he', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the',

  'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'o

 f', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks',

 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which

 ', 'the', 'election', 'was', 'conducted', '.'], ...]

 >>>

2）比较不同文体之间情态动词的用法

 >>> from nltk.corpus import brown

 >>> news_text = brown.words(categories='news')

 >>> fdist=nltk.FreqDist([w.lower() for w in news_text])

 >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

 >>> for m in modals:

 ...     print("%s:%d" %(m, fdist[m]))

 ...

 can:94

 could:87

 may:93

 might:38

 must:53

 will:389

 >>>

5. 路透社语料库

路透社语料库包含10788个新闻文档，共计130万字。这些文档分成90个主题，按照训练和测试分为两组，这样分割是为了方便运用训练和测试算法的自动检测文档的主题。与布朗语料库不同，路透社语料库的类别是相互重叠的，因为新闻报道往往涉及多个主题。我们可以查找由一个或多个文档涵盖的主题，也可以查找包含在一个或者多个类别中的文档。应用实例如下：

 >>> reuters.categories()

 ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', ...]

 >>> reuters.categories('training/9865')

 ['barley', 'corn', 'grain', 'wheat']

 >>> reuters.categories(['training/9865', 'training/9880'])

 ['barley', 'corn', 'grain', 'money-fx', 'wheat']

 >>> reuters.fileids('barley')

 ['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/158

 75',....]

 >>> reuters.fileids(['barley', 'corn'])

 ['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/152

 87', 'test/15341', 'test/15618', 'test/15648', 'test/15649', ...]

 >>>

 >>> reuters.words('training/9865')[:14]

 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', '

 operators', 'have', 'requested', 'licences', 'to', 'export']

 >>> reuters.words(['training/9865', 'training/9880'])

 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]

 >>> reuters.words(categories='barley')

 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]

 >>> reuters.words(categories=['barley', 'corn'])

 ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

 >>>

6.就职演说预料库

该语料库实际上是55个文本的集合，每个文本都是一个总统的演说。这个集合的一个显著特征是时间维度。

 >>> from nltk.corpus import inaugural

 >>> inaugural.fileids()

 ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson

 .txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monro

 e.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.t

 xt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt

 ', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt

 ', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1

 885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.tx

 t', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt

 ', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt'

 , '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosev

 elt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961

 -Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Car

 ter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.t

 xt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']

 >>> [fileid[:4] for fileid in inaugural.fileids()]

 ['', '', '', '', '', '', '', '', '', '',

  '', '', '', '', '', '', '', '', '', '',

  '', '', '', '', '', '', '', '', '', '',

  '', '', '', '', '', '', '', '', '', '',

  '', '', '', '', '', '', '', '', '', '',

  '', '', '', '', '', '']

 >>>

需要注意的是每个文本的年代都出现在它的文件名中。要从文件名中获得年代，使用fileid[:4]提取前四个字符。

 >>> import nltk

 >>> cfd=nltk.ConditionalFreqDist(

 ... (target, fileid[:4]

 ... )

 ... for fileid in inaugural.fileids()

 ... for w in inaugural.words(fileid)

 ... for target in ['america', 'citizen']

 ... if w.lower().startswith(target))

 >>> cfd.plot()

以上实例是词汇america和citizen随时间推移的使用情况。就职演说语料库中所有以america或citizen开始的词都将被计数。每个演讲单独计数并绘制出图形，这样就能观察出随时间变化这些用法的演变趋势。计数没有与文档长度进行归一化处理。

7.标注文本语料库

许多文本语料库都包含语言学标注，有词性标注，命名实体，句法结构，语义角色等。nltk中提供了几种很方便的方法来访问这几个语料库，而且还包含有语料库和语料样本的数据包，用于教学和科研时可以免费下载。

8.其他语言语料库

nltk还包含多国语言语料库。比如udhr，包含有超过300种语言的世界人权宣言。这个语料库的fileids包括有关文件所使用的字符编码信息，比如：UTF8或者Latin1。利用条件频率分布来研究“世界人权宣言”(udhr)语料库中不同语言版本中的字长差异。应用实例如下：

 >>> from nltk.corpus import udhr

 >>> languages=['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut'

 , 'Hungarian_Magyar', 'Ibibio_Efik']

 >>>

 >>> cfd=nltk.ConditionalFreqDist(

 ... (lang, len(word))

 ... for lang in languages

 ... for word in udhr.words(lang+'-Latin1'))

 >>> cfd.plot(cumulative=True)

 >>>

9.nltk中定义的基本语料库函数

示例	描述
fileids()	语料库中的文件
fileids([categories])	分类对应的语料库中的文件
categories()	语料库中的分类
categories([fileids])	文件对应的语料库中的分类
raw()	语料库的原始内容
raw([fileids=[f1, f2, f3])	指定文件的原始内容
raw(categories=[c1, c2])	指定分类的原始内容
words()	整个语料库中的词汇
words(fileids=[f1,f2,f3])	指定文件中的词汇
words(categories=[c1,c2])	指定分类中的词汇
sents()	指定分类中的句子
sents(fileids=[f1,f2,f3])	指定文件中的句子
sents(categories=[c1,c2])	指定分类中的句子
abspath(fileid)	指定文件在磁盘上的位置
encoding(fileid)	文件编码（如果知道的话）
open(fileid)	打开指定语料库文件的文件流
root()	到本地安装的语料库根目录的路径
readme()	语料库中的README文件的内容

10.载入自己的语料库

 >>> from nltk.corpus import *

 >>> corpus_root = r"E:\corpora"             //本地存放文本的目录，原始的nltk数据库存放目录为D：\

 >>> wordlists=PlaintextCorpusReader(corpus_root, '.*')

 >>> wordlists.fileids()                    //获取文件列表

 ['README', 'aaaaaaaaaaa.txt', 'austen-emma.txt', 'austen-persuasion.txt', 'auste              //其中的aaaaaaaaaaa.txt是自定义的文件

 n-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess

 -busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown

 .txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'luo.txt', 'melville-

 moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-ha

 mlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

 >>>

自己的语料库加载成功后，我们就可以使用各种函数对其中的语料进行操作。

python 自然语言处理（二）____获得文本语料和词汇资源的更多相关文章

【NLP】Python NLTK获取文本语料和词汇资源
Python NLTK 获取文本语料和词汇资源作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集 ...
python+NLTK 自然语言学习处理四：获取文本语料和词汇资源
在前面我们通过from nltk.book import *的方式获取了一些预定义的文本.本章将讨论各种文本语料库 1 古腾堡语料库古腾堡是一个大型的电子图书在线网站,网址是http://www.g ...
Python自然语言处理工具小结
Python自然语言处理工具小结作者:白宁超 2016年11月21日21:45:26 目录 [Python NLP]干货!详述Python NLTK下如何使用stanford NLP工具包(1) [ ...
《Python自然语言处理》
<Python自然语言处理> 基本信息作者: (美)Steven Bird Ewan Klein Edward Loper 出版社:人民邮电出版社 ISBN:97871153 ...
【Python自然语言处理】第一章学习笔记——搜索文本、计数统计和字符串链表
这本书主要是基于Python和一个自然语言工具包(Natural Language Toolkit, NLTK)的开源库进行讲解 NLTK 介绍:NLTK是一个构建Python程序以处理人类语言数据的 ...
python 自然语言处理（四）____词典资源
词典或者词典资源是一个词和/或短语及其相关信息的集合,例如:词性和词意定义等相关信息.词典资源附属于文本,而且通常在文本的基础上创建和丰富.下面列举几种nltk中的词典资源. 1. 词汇列表语料库 n ...
Python 自然语言处理笔记(一)
一． NLTK的几个常用函数 1. Concordance 实例如下: >>> text1.concordance("monstrous") Displaying ...
转-Python自然语言处理入门
Python自然语言处理入门原文链接:http://python.jobbole.com/85094/ 分享到:20 本文由伯乐在线 - Ree Ray 翻译,renlytime 校稿.未经许 ...
Python自然语言处理（1）：初识NLP
由于我们从美国回来就是想把医学数据和医学人工智能的事认真做起来,所以我们选择了比较扎实的解决方法,想快速出成果的请绕道.我们的一些解决方法是:1.整合公开的所有医学词典,尽可能包含更多的标准医学词汇: ...

随机推荐

封装sqlhelper【一】
控件信息展示: //定义调用数据库类文件 namespace SqlHelper { public class TblClass { public int classId { get; set; } ...
[osg]节点遍历nodevisitor浅析
参考:https://www.cnblogs.com/hzhg/archive/2010/12/17/1908764.html OSG中节点的访问使用的是一种访问器模式.一个典型的访问器涉及抽象访问者 ...
arcgis 属性表字段值计算
1 如果你用VBSCRIPT的代码,那就在对应的选择项目处选择下,如果是PYTHON代码,就在另外一点点一下.如果弄混了,显然代码会报错. 2 VBSCRIPT里面的函数非常少,但是你可以去利用这些函 ...
vue中父子组件的通讯
1.父组件可以使用 props 把数据传给子组件. 2.子组件可以使用 $emit 触发父组件的自定义事件实例: 父组件: layout.vue 子组件:logform.vue 子组件: < ...
numpy广播
(m,n) +,-,*,/ (m,1) 先将(m,1)复制n次,构成(m,n)矩阵,然后再进行+,-,*,/运算 (m,n) +,-,*,/ (1,n) 先将 (1,n)复制m次,构成(m ...
vuex 源码：深入 vuex 之辅助函数 mapState
前言当一个组件要获取多个 state 的时候,声明计算属性就会变得重复和冗余了.我们可以使用到辅助函数 mapState 来更快更简洁地生成计算属性. 所以我们得清楚,mapState 的作用就是帮 ...
Spark多种运行模式
1.测试或实验性质的本地运行模式(单机) 该模式被称为Local[N]模式,是用单机的多个线程来模拟Spark分布式计算,通常用来验证开发出来的应用程序逻辑上是否有问题. 其中N代表可以使用N个线程, ...
JavaScript学习第一天（一）
JavaScript介绍 JavaScript一种直译式脚本语言,是一种动态类型.弱类型.基于原型的语言,内置支持类型.它的解释器被称为JavaScript引擎,为浏览器的一部分,广泛用于客户端的脚本 ...
子序列的按位或 Bitwise ORs of Subarrays
2018-09-23 19:05:20 问题描述: 问题求解: 显然的是暴力的遍历所有的区间是不可取的,因为这样的时间复杂度为n^2级别的,对于规模在50000左右的输入会TLE. 然而,最后的解答也 ...
[jQuery] 判断复选框checkbox是否选中checked
返回值是true/false method 1: $("#register").click(function(){ if($("#accept").get(0) ...

python 自然语言处理（二）____获得文本语料和词汇资源

python 自然语言处理（二）____获得文本语料和词汇资源的更多相关文章

随机推荐

热门专题