自然语言处理(1)之NLTK与PYTHON

题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间。碰巧这几天在亚马逊上找书时发现了这本《Python自然语言处理》,瞬间觉得这对我同时入门自然语言处理与Python有很大的帮助。所以最近都会学习这本书,也写下这些笔记。

1. NLTK简述

NLTK模块及功能介绍

语言处理任务 NLTK模块 功能描述
获取语料库 nltk.corpus 语料库和词典的标准化接口
字符串处理 nltk.tokenize,nltk.stem 分词、句子分解、提取主干
搭配研究 nltk.collocations t-检验,卡方,点互信息
词性标示符 nltk.tag n-gram,backoff,Brill,HMM,TnT
分类 nltk.classify,nltk.cluster 决策树,最大熵,朴素贝叶斯,EM,k-means
分块 nltk.chunk 正则表达式,n-gram,命名实体
解析 nltk.parse 图标,基于特征,一致性,概率性,依赖项
语义解释 nltk.sem,nltk.inference λ演算,一阶逻辑,模型检验
指标评测 nltk.metrics 精度,召回率,协议系数
概率与估计 nltk.probability 频率分布,平滑概率分布
应用 nltk.app,nltk.chat 图形化的关键词排序,分析器,WordNet查看器,聊天机器人
语言学领域的工作 nltk.toolbox 处理SIL工具箱格式的数据

2. NLTK安装

  我的Python版本是2.7.5,NLTK版本2.0.4

  1. DESCRIPTION
  2. The Natural Language Toolkit (NLTK) is an open source Python library
  3. for Natural Language Processing. A free online book is available.
  4. (If you use the library for academic research, please cite the book.)
  5.  
  6. Steven Bird, Ewan Klein, and Edward Loper ().
  7. Natural Language Processing with Python. O'Reilly Media Inc.
  8. http://nltk.org/book
  9.  
  10. @version: 2.0.

安装步骤跟http://www.nltk.org/install.html 一样

1. 安装Setuptools: http://pypi.python.org/pypi/setuptools

  在页面的最下面setuptools-5.7.tar.gz

2. 安装 Pip: 运行 sudo easy_install pip(一定要以root权限运行)

3. 安装 Numpy (optional): 运行 sudo pip install -U numpy

4. 安装 NLTK: 运行 sudo pip install -U nltk

5. 进入python,并输入以下命令

  1. :chapter2 rcf$ python
  2. Python 2.7. (default, Mar , ::)
  3. [GCC 4.2. Compatible Apple LLVM 5.0 (clang-500.0.)] on darwin
  4. Type "help", "copyright", "credits" or "license" for more information.
  5. >>> import nltk
  6. >>> nltk.download()

当出现以下界面进行nltk_data的下载

也可直接到 http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml 去下载数据包,并拖到Download Directory。我就是这么做的。

最后在Python目录运行以下命令以及结果,说明安装已成功

  1. from nltk.book import *
  2. *** Introductory Examples for the NLTK Book ***
  3. Loading text1, ..., text9 and sent1, ..., sent9
  4. Type the name of the text or sentence to view it.
  5. Type: 'texts()' or 'sents()' to list the materials.
  6. text1: Moby Dick by Herman Melville
  7. text2: Sense and Sensibility by Jane Austen
  8. text3: The Book of Genesis
  9. text4: Inaugural Address Corpus
  10. text5: Chat Corpus
  11. text6: Monty Python and the Holy Grail
  12. text7: Wall Street Journal
  13. text8: Personals Corpus
  14. text9: The Man Who Was Thursday by G . K . Chesterton

3. NLTK的初次使用

  现在开始进入正题,由于本人没学过python,所以使用NLTK也就是学习Python的过程。初次学习NLTK主要使用的时NLTK里面自带的一些现有数据,上图中已由显示,这些数据都在nltk.book里面。

3.1 搜索文本

concordance:搜索text1中的monstrous

  1. >>> text1.concordance("monstrous")
  2. Building index...
  3. Displaying of matches:
  4. ong the former , one was of a most monstrous size . ... This came towards us ,
  5. ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
  6. ll over with a heathenish array of monstrous clubs and spears . Some were thick
  7. d as you gazed , and wondered what monstrous cannibal and savage could ever hav
  8. that has survived the flood ; most monstrous and most mountainous ! That Himmal
  9. they might scout at Moby Dick as a monstrous fable , or still worse and more de
  10. th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
  11. ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
  12. ere to enter upon those still more monstrous stories of them which are to be fo
  13. ght have been rummaged out of this monstrous cabinet there is no telling . But
  14. of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

similar:查找text1中与monstrous相关的所有词语

  1. >>> text1.similar("monstrous")
  2. Building word-context index...
  3. abundant candid careful christian contemptible curious delightfully
  4. determined doleful domineering exasperate fearless few gamesome
  5. horrible impalpable imperial lamentable lazy loving

dispersion_plot:用离散图判断词在文本的位置即偏移量

  1. >>> text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])

3.2 计数词汇

len:获取长度,即可获取文章的词汇个数,也可获取单个词的长度

  1. >>> len(text1) #计算text1的词汇个数
  2.  
  3. >>> len(set(text1)) #计算text1 不同的词汇个数
  4.  
  5. >>> len(text1[]) #计算text1 第一个词的长度

sorted:排序

  1. >>> sent1
  2. ['Call', 'me', 'Ishmael', '.']
  3. >>> sorted(sent1)
  4. ['.', 'Call', 'Ishmael', 'me']

3.3 频率分布

nltk.probability.FreqDist

  1. >>> fdist1=FreqDist(text1) #获取text1的频率分布情况
  2. >>> fdist1         #text1具有19317个样本,但是总体有260819个值
  3. <FreqDist with samples and outcomes>
  4. >>> keys=fdist1.keys()
  5. >>> keys[:] #获取text1的前50个样本
    [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']
  1. >>> fdist1.items()[:] #text1的样本分布情况,比如','出现了18713次,总共的词为260819
  2. [(',', ), ('the', ), ('.', ), ('of', ), ('and', ), ('a', ), ('to', ), (';', ), ('in', ), ('that', ), ("'", ), ('-', ), ('his', ), ('it', ), ('I', ), ('s', ), ('is', ), ('he', ), ('with', ), ('was', ), ('as', ), ('"', ), ('all', ), ('for', ), ('this', ), ('!', ), ('at', ), ('by', ), ('but', ), ('not', ), ('--', ), ('him', ), ('from', ), ('be', ), ('on', ), ('so', ), ('whale', ), ('one', ), ('you', ), ('had', ), ('have', ), ('there', ), ('But', ), ('or', ), ('were', ), ('now', ), ('which', ), ('?', ), ('me', ), ('like', )]
  1. >>> fdist1.hapaxes()[:] #text1的样本只出现一次的词
  2. ['!\'"', '!)"', '!*', '!--"', '"...', "',--", "';", '):', ');--', ',)', '--\'"', '---"', '---,', '."*', '."--', '.*--', '.--"', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
    3 >>> fdist1['!\'"']
    4 1

>>> fdist1.plot(,cumulative=True) #画出text1的频率分布图

3.4 细粒度的选择词

  1. >>> long_words=[w for w in set(text1) if len(w) > ] #获取text1内样本词汇长度大于15的词并按字典序排序
  2. >>> sorted(long_words)
  3. ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
  4. >>> fdist1=FreqDist(text1) #获取text1内样本词汇长度大于7且出现次数大于7的词并按字典序排序
    >>> sorted([wforwin set(text5) if len(w) > and fdist1[w] > ]) ['American', 'actually', 'afternoon', 'anything', 'attention', 'beautiful', 'carefully', 'carrying', 'children', 'commanded', 'concerning', 'considered', 'considering', 'difference', 'different', 'distance', 'elsewhere', 'employed', 'entitled', 'especially', 'everything', 'excellent', 'experience', 'expression', 'floating', 'following', 'forgotten', 'gentlemen', 'gigantic', 'happened', 'horrible', 'important', 'impossible', 'included', 'individual', 'interesting', 'invisible', 'involved', 'monsters', 'mountain', 'occasional', 'opposite', 'original', 'originally', 'particular', 'pictures', 'pointing', 'position', 'possibly', 'probably', 'question', 'regularly', 'remember', 'revolving', 'shoulders', 'sleeping', 'something', 'sometimes', 'somewhere', 'speaking', 'specially', 'standing', 'starting', 'straight', 'stranger', 'superior', 'supposed', 'surprise', 'terrible', 'themselves', 'thinking', 'thoughts', 'together', 'understand', 'watching', 'whatever', 'whenever', 'wonderful', 'yesterday', 'yourself']

3.5 词语搭配和双连词

用bigrams()可以实现双连词

  1. >>> bigrams(['more','is','said','than','done'])
  2. [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
  3. >>> text1.collocations()
  4. Building collocations list
  5. Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
  6. whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
  7. years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
  8. mate; white whale; ivory leg; one hand

3.6 NLTK频率分类中定义的函数

例子 描述
fdist=FreqDist(samples) 创建包含给定样本的频率分布
fdist.inc(sample) 增加样本
fdist['monstrous'] 计数给定样本出现的次数
fdist.freq('monstrous') 样本总数
fdist.N() 以频率递减顺序排序的样本链表
fdist.keys() 以频率递减的顺序便利样本
for sample in fdist: 数字最大的样本
fdist.max() 绘制频率分布表
fdist.tabulate() 绘制频率分布图
fdist.plot() 绘制积累频率分布图
fdist.plot(cumulative=True) 绘制积累频率分布图
fdist1<fdist2 测试样本在fdist1中出现的样本是否小于fdist2

最后看下text1的类情况. 使用type可以查看变量类型,使用help()可以获取类的属性以及方法。以后想要获取具体的方法可以使用help(),这个还是很好用的。

  1. >>> type(text1)
  2. <class 'nltk.text.Text'>
  3. >>> help('nltk.text.Text')
  4. Help on class Text in nltk.text:
  5.  
  6. nltk.text.Text = class Text(__builtin__.object)
  7. | A wrapper around a sequence of simple (string) tokens, which is
  8. | intended to support initial exploration of texts (via the
  9. | interactive console). Its methods perform a variety of analyses
  10. | on the text's contexts (e.g., counting, concordancing, collocation
  11. | discovery), and display the results. If you wish to write a
  12. | program which makes use of these analyses, then you should bypass
  13. | the ``Text`` class, and use the appropriate analysis function or
  14. | class directly instead.
  15. |
  16. | A ``Text`` is typically initialized from a given document or
  17. | corpus. E.g.:
  18. |
  19. | >>> import nltk.corpus
  20. | >>> from nltk.text import Text
  21. | >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
  22. |
  23. | Methods defined here:
  24. |
  25. | __getitem__(self, i)
  26. |
  27. | __init__(self, tokens, name=None)
  28. | Create a Text object.
  29. |
  30. | :param tokens: The source text.
  31. | :type tokens: sequence of str
  32. |
  33. | __len__(self)
  34. |
  35. | __repr__(self)
  36. | :return: A string representation of this FreqDist.
  37. | :rtype: string
  38. |
  39. | collocations(self, num=, window_size=)
  40. | Print collocations derived from the text, ignoring stopwords.
  41. |
  42. | :seealso: find_collocations
  43. | :param num: The maximum number of collocations to print.
  44. | :type num: int
  45. | :param window_size: The number of tokens spanned by a collocation (default=)
  46. | :type window_size: int
  47. |
  48. | common_contexts(self, words, num=)
  49. | Find contexts where the specified words appear; list
  50. | most frequent common contexts first.
  51. |
  52. | :param word: The word used to seed the similarity search
  53. | :type word: str
  54. | :param num: The number of words to generate (default=)
  55. | :type num: int
  56. | :seealso: ContextIndex.common_contexts()

4. 语言理解的技术

1. 词意消歧

2. 指代消解

3. 自动生成语言

4. 机器翻译

5. 人机对话系统

6. 文本的含义

5. 总结

虽然是初次接触Python,NLTK,但是我已经觉得他们的好用以及方便,接下来就会深入的学习他们。

自然语言处理(1)之NLTK与PYTHON的更多相关文章

  1. python自然语言处理函数库nltk从入门到精通

    1. 关于Python安装的补充 若在ubuntu系统中同时安装了Python2和python3,则输入python或python2命令打开python2.x版本的控制台:输入python3命令打开p ...

  2. Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器

    http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-% ...

  3. 自然语言处理1——语言处理与Python(内含纠错)

    学习Python自然语言处理,记录一下学习笔记. 运用Python进行自然语言处理需要用到nltk库,关于nltk库的安装,我使用的pip方式. pip nltk 或者下载whl文件进行安装.(推荐p ...

  4. 自然语言23_Text Classification with NLTK

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/text-classification-nltk-tutorial/?compl ...

  5. 自然语言20_The corpora with NLTK

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed= ...

  6. 自然语言19.1_Lemmatizing with NLTK(单词变体还原)

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/lemmatizing-nltk-tutorial/?completed=/na ...

  7. 自然语言14_Stemming words with NLTK

    https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/ # -*- ...

  8. 自然语言13_Stop words with NLTK

    https://www.pythonprogramming.net/stop-words-nltk-tutorial/?completed=/tokenizing-words-sentences-nl ...

  9. hanlp自然语言处理包的基本使用--python

    hanlp拥有:中文分词.命名实体识别.摘要关键字.依存句法分析.简繁拼音转换.智能推荐. 这里主要介绍一下hanlp的中文分词.命名实体识别.依存句法分析,这里就不介绍具体的hanlp的安装了,百度 ...

随机推荐

  1. UVa 674: Coin Change

    动态规划题.对于1,5,10,25,50五种币值的硬币,编号为0~4,存入数组cent中.数组iWay的元素iWay[k][i]表示仅使用0~i的硬币凑出k分钱的方法数,按是否使用编号为i的硬币分类, ...

  2. .NET中的程序集(Assembly)

    在.NET 中,新引入了一个程序集的概念,就是指经由编译器编译得到的,供CLR进一步编译执行的那个中间产物,在WINDOWS系统中,它一般表现为.dll,或者是.exe的格式,但是要注意,它们跟普通意 ...

  3. java參数传递机制浅析

    欢迎转载,转载请声明出处! ----------------------------------------- 前言: java语言中,參数的传递仅仅有一种机制.那就是值传递. 举例: 以下将通过几个 ...

  4. [转] nodeJS的post提交简单实现

    index.js: ? 1 2 3 4 5 6 7 8 var server = require('./server'); var router = require('./route'); var r ...

  5. RHEL7下PXE+NFS+Kickstart无人值守安装操作系统

    RHEL7下PXE+NFS+Kickstart无人值守安装操作系统 1.配置yum源 vim /etc/yum.repos.d/development.repo [development] name= ...

  6. Python之路,Day4 - Python基础4 (new版)

    Python之路,Day4 - Python基础4 (new版)   本节内容 迭代器&生成器 装饰器 Json & pickle 数据序列化 软件目录结构规范 作业:ATM项目开发 ...

  7. Chain of Responsibility 责任链模式

    简介 责任链模式是一种对象的行为模式.在责任链模式里,很多对象由每一个对象对其[下家]的引用而连接起来形成一条链,请求在这个链上[传递],直到链上的某一个对象决定处理此请求.发出这个请求的客户端并不知 ...

  8. Linq101-QueryExecution

    using System; using System.Linq; namespace Linq101 { class QueryExecution { /// <summary> /// ...

  9. java编程思想-并发思维导图

  10. My.Ioc 的性能

    IoC/DI 这个概念,最初是由 Martin Fowler 提出来的.之后,很快在 Java 社区大行其道.在 .net 社区,IoC 的流行要比 Java 晚一些.尽管如此,现在开源社区中也已经出 ...