NLTK的内置函数

1. 词语索引

(1) concordance函数给出一个指定单词每一次出现，连同上下文一起显示。

>>>text1.concordance('monstrous')

(2) similar函数查找文中上下文结构相似的词，如the___pictures 和 the___size 等。

>>> text1.similar("monstrous")

(3) common_contexts 函数检测、查找两个或两个以上的词共同的上下文。

>>> text2.common_contexts(["monstrous", "very"])
be_glad am_glad a_pretty is_pretty a_lucky
>>>

2. 词语离散图

判断词在文本中的位置：从文本开头算起在它前面有多少词。这个位置信息可以用离散图表示。

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
>>>

3. 词语计数

>>>len(text3)

44764

4. 文本-->词表并排序

sorted(set(text3))

5. 词汇丰富度

>>> from __future__ import division
>>> len(text3) / len(set(text3))
16.050197203298673
>>>

6. 词在文本中出现的次数和百分比

>>> text3.count("smote")
5
>>> 100 * text4.count('a') / len(text4)
1.4643016433938312
>>>

7. 索引列表

(1) 表示元素位置的数字叫做元素的索引。

>>> text1[50]
'grammars'
>>>

(2) 找出一个词第一次出现的索引。

>>> text1.index('grammars')
50
>>>

8. 切片可以获取到文本中的词汇(文本片段)。

>>>text1[100:120]['and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', '-', 'fish', 'is', 'to', 'be', 'called', 'in', 'our', 'tongue', 'leaving', 'out']
>>>

9. NLTK 频率分布类中定义的函数

例子描述
fdist = FreqDist(samples) 创建包含给定样本的频率分布
fdist.inc(sample) 增加样本
fdist['monstrous'] 计数给定样本出现的次数
fdist.freq('monstrous') 给定样本的频率
fdist.N() 样本总数
fdist.keys() 以频率递减顺序排序的样本链表
for sample in fdist: 以频率递减的顺序遍历样本
fdist.max() 数值最大的样本
fdist.tabulate() 绘制频率分布表
fdist.plot() 绘制频率分布图

fdist.plot(cumulative=True) 绘制累积频率分布图
fdist1 < fdist2 测试样本在fdist1 中出现的频率是否小于fdist2

text1.concordance("monstrous") # 搜索单词，并显示上下文
text1.similar("monstrous") # 搜索具有相似上下文的单词
text2.common_context(["monstrous", "very"]) #两个或两个以上的词的共同的上下文
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) # 将语料按时间顺序拼接，此命令即可画出这些单词在语料中的位置，可以用来研究随时间推移语言使用上的变化
text3.generate() # 根据语料3的词序列统计信息生成随机文本【计算机写SCI论文的原理？】

len(text3) / len(set(text3)) # 计算平均词频或者叫词汇丰富度
100* text3.count("smote") / len(text3) # 计算特定词在文本中的百分比
标识符: All words
类型：Unique words

FreqDist(text1).keys()[:50] # 查看text1中频率最高的前50个词，FreeDist([])用来计算列表中元素的频率
FreqDist(text1).hapaxes() # 查看频率为1的词
bigrams(['more', 'is', 'said', 'than', 'done']) # 构造双连词，即[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
text4.collocations() # 返回文本中的双连词

fdist = FreqDist(samples) 创建包含给定样本的频率分布
fdist.inc(sample) 增加样本
fdist['monstrous'] 计数给定样本出现的次数
fdist.freq('monstrous') 给定样本的频率
fdist.N() 样本总数
fdist.keys() 以频率递减顺序排序的样本链表
for sample in fdist: 以频率递减的顺序遍历样本
fdist.max() 数值最大的样本
fdist.tabulate() 绘制频率分布表
fdist.plot() 绘制频率分布图
fdist.plot(cumulative=True) 绘制累积频率分布图
fdist1 < fdist2 测试样本在 fdist1 中出现的频率是否小于 fdist2

nltk.Text(gutenberg.words("autsten-emma.txt') # 索引文本，下一步才能使用concordance等函数.
gutenberg.raw(fileid) # 给出原始文本内容
gutenberg.words(fileid) # 词数
gutenberg.sents(fileid) # 句数
wordlists = PlaintextCorpusReader(corpus_root, '.*') # 读入自己的语料库

cfdist= ConditionalFreqDist(pairs) 从配对链表中创建条件频率分布
cfdist.conditions() 将条件按字母排序
cfdist[condition] 此条件下的频率分布
cfdist[condition][sample] 此条件下给定样本的频率
cfdist.tabulate() 为条件频率分布制表
cfdist.tabulate(samples, conditions) 指定样本和条件限制下制表
cfdist.plot() 为条件频率分布绘图
cfdist.plot(samples, conditions) 指定样本和条件限制下绘图
cfdist1 < cfdist2 测试样本在 cfdist1 中出现次数是否小于在 cfdist2 中出现次数

条件概率的应用:

# -*- encoding: utf-8 -*-

import nltk

def generate_model(cfdist, word, num=15):
    for i in range(num):
        print word
        word = cfdist[word].max()

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

print cfd['living']

generate_model(cfd, 'living')

nltk.corpus.stopwords.words('english') # stop words, 停用词
nltk.corpus.names # 姓名

wordnet.synsets('car') # 同义词集
wordnet.lemmas('car') # 获取所有包含词car的词条

下载、读取、处理网络文本
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
raw = nltk.clean_html(html) # 清除html标记，但导航等内容还是无法清除

import feedparser
blog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
blog['feed']['title']
post = blog.entries[2]

tokens = nltk.word_tokenize(raw) # 分词
text = nltk.Text(tokens) # 下一步才能使用text.collocations()等函数

# 解码
import codecs
f = codecs.open(path, encoding='latin2')

# 正则
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') ==> ['ing']
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') ==> ['processing']

re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') ==> [('processe', 's')]
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') ==> [('processe', 'es')]
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language') ==> [('processe', '')]

# 查找上、下位词
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

将得到：
speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;

# 词干提取
tokens = nltk.word_tokenize(raw)
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]

# 词形归并
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

# 分词
nltk.regexp_tokenize()

# Python 过程风格与声明风格
# 找到文本中最长的词

maxlen = max(len(word) for word in text)
[word for word in text if len(word) == maxlen] # 熟悉并经常使用

lengths = map(len, nltk.corpus.brown.sents(categories="news"))
avg = sum(lengths) / len(lengths)

set() # 后台已经做了索引，集合成员地查找尽可能使用set

matplotlib # 绘图工具
NetworkX # 网络可视化

转--NLTK的内置函数的更多相关文章

Entity Framework 6 Recipes 2nd Edition（11-12）译 -> 定义内置函数
11-12. 定义内置函数问题想要定义一个在eSQL 和LINQ 查询里使用的内置函数. 解决方案我们要在数据库中使用IsNull 函数,但是EF没有为eSQL 或LINQ发布这个函数. 假设我 ...
Oracle内置函数：时间函数，转换函数，字符串函数，数值函数，替换函数
dual单行单列的隐藏表,看不见但是可以用,经常用来调内置函数.不用新建表时间函数 sysdate 系统当前时间 add_months 作用:对日期的月份进行加减写法:add_months(日期 ...
python内置函数
python内置函数官方文档:点击在这里我只列举一些常见的内置函数用法 1.abs()[求数字的绝对值] >>> abs(-13) 13 2.all() 判断所有集合元素都为真的 ...
DAY5 python内置函数+验证码实例
内置函数用验证码作为实例字符串和字节的转换字符串到字节字节到字符串
python之常用内置函数
python内置函数,可以通过python的帮助文档 Build-in Functions,在终端交互下可以通过命令查看 >>> dir("__builtins__&quo ...
freemarker内置函数和用法
原文链接:http://www.iteye.com/topic/908500 在我们应用Freemarker 过程中,经常会操作例如字符串,数字,集合等,却不清楚Freemrker 有没有类似于Jav ...
set、def、lambda、内置函数、文件操作
set : 无序,不重复,可以嵌套 .add (添加元素) .update(接收可迭代对象)---等于批量添加 .diffrents()两个集合不同差 .sysmmetric difference( ...
SQL Server 内置函数、临时对象、流程控制
SQL Server 内置函数日期时间函数 --返回当前系统日期时间 select getdate() as [datetime],sysdatetime() as [datetime2] getd ...
Python-Day3知识点——深浅拷贝、函数基本定义、内置函数
一.深浅拷贝 import copy #浅拷贝 n1={'k1':'wu','k2':123,'k3':['carl',852]} n2=n1 n3=copy.copy(n1) print(id(n1 ...

随机推荐

Hadoop-1.0.4集群搭建笔记
这篇文章介绍的是简单的配置Hadoop集群的方法,适合实验和小型项目用,正式运行的集群需要用更正规的方法和更详细的参数配置,不适合使用这篇文章. 相关随笔: 用python + hadoop stre ...
SpringBoot使用Redis数据库
(1)pom.xml文件引入jar包,如下: <dependency> <groupId>org.springframework.boot</groupId> &l ...
Tinkoff Challenge - Elimination Round B. Igor and his way to work（dfs+优化）
http://codeforces.com/contest/793/problem/B 题意:一个地图,有起点和终点还有障碍点,求从起点出发到达终点,经过的路径上转弯次数是否能不超过2. 思路: 直接 ...
Linux的硬链接和软链接
1.Linux链接概念Linux链接分两种,一种被称为硬链接(Hard Link),另一种被称为符号链接(Symbolic Link), 也就是软链接.默认情况下,ln命令产生硬链接. [硬连接]硬连 ...
BZOJ 3572 【HNOI2014】世界树
题目链接:世界树首先看到$\sum m_i\le 3\times 10^5$这个条件,显然这道题就需要用虚树了. 在我们构建出虚树之后,就可以用两遍$dfs$来求出离每个点最近的议事处了.然 ...
linux文件锁的应用，POSIX，unix标准，linux标准
1. perl,flock加锁.java也能加锁. 2. 先创建文件并打开,才能加锁(写打开?). 3. 可以用于判断进程是否一直在运行(用另一进程判断),如果锁一直在,则进程在:锁不在,则原进程或意 ...
install ros-indigo-pcl-conversions
CMake Warning at /opt/ros/indigo/share/catkin/cmake/catkinConfig.cmake: (find_package): Could not fi ...
解决Error: ENOENT: no such file or directory, scandir 安装node-sass报错
新项目开发需要安装依赖,但是安装完之后通过gulp运行项目,产生了一下的报错: 解决方案是执行一些方法: npm rebuild node-sass 可是有时就是网络问题导致上面命令安装失败,查下失败 ...
Hystrix熔断机制原理剖析
一.前言在分布式系统架构中多个系统之间通常是通过远程RPC调用进行通信,也就是 A 系统调用 B 系统服务,B 系统调用 C 系统的服务.当尾部应用 C 发生故障而系统 B 没有服务降级时候可能会导 ...
使用yum安装pip
PIP 简介:pip 是一个现代的,通用的 Python 包管理工具.提供了对 Python 包的查找.下载.安装.卸载的功能.功能类似于RedHat里面的yum 使用yum安装pip 因为测试环境搭 ...

转--NLTK的内置函数

NLTK的内置函数

转--NLTK的内置函数的更多相关文章

随机推荐

热门专题