【389】Implement N-grams using NLTK
Ref: n-grams in python, four, five, six grams?
Ref: "Elegant n-gram generation in Python"
import nltk sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print("1 gram:\n", tokens, "\n") # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print("2 grams:\n", [i for i in tokens_2], "\n") # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print("3 grams:\n", [i for i in tokens_3], "\n") # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print("4 grams:\n", [i for i in tokens_4], "\n") outputs:
1 gram:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 2 grams:
[('At', 'eight'), ('eight', "o'clock"), ("o'clock", 'on'), ('on', 'Thursday'), ('Thursday', 'morning'), ('morning', 'Arthur'), ('Arthur', 'did'), ('did', "n't"), ("n't", 'feel'), ('feel', 'very'), ('very', 'good'), ('good', '.')] 3 grams:
[('At', 'eight', "o'clock"), ('eight', "o'clock", 'on'), ("o'clock", 'on', 'Thursday'), ('on', 'Thursday', 'morning'), ('Thursday', 'morning', 'Arthur'), ('morning', 'Arthur', 'did'), ('Arthur', 'did', "n't"), ('did', "n't", 'feel'), ("n't", 'feel', 'very'), ('feel', 'very', 'good'), ('very', 'good', '.')] 4 grams:
[('At', 'eight', "o'clock", 'on'), ('eight', "o'clock", 'on', 'Thursday'), ("o'clock", 'on', 'Thursday', 'morning'), ('on', 'Thursday', 'morning', 'Arthur'), ('Thursday', 'morning', 'Arthur', 'did'), ('morning', 'Arthur', 'did', "n't"), ('Arthur', 'did', "n't", 'feel'), ('did', "n't", 'feel', 'very'), ("n't", 'feel', 'very', 'good'), ('feel', 'very', 'good', '.')]
Another method to output:
import nltk sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print("1 gram:\n", tokens, "\n") # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print("2 grams:\n", [' '.join(list(i)) for i in tokens_2], "\n") # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print("3 grams:\n", [' '.join(list(i)) for i in tokens_3], "\n") # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print("4 grams:\n", [' '.join(list(i)) for i in tokens_4], "\n") outputs:
1 gram:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 2 grams:
['At eight', "eight o'clock", "o'clock on", 'on Thursday', 'Thursday morning', 'morning Arthur', 'Arthur did', "did n't", "n't feel", 'feel very', 'very good', 'good .'] 3 grams:
["At eight o'clock", "eight o'clock on", "o'clock on Thursday", 'on Thursday morning', 'Thursday morning Arthur', 'morning Arthur did', "Arthur did n't", "did n't feel", "n't feel very", 'feel very good', 'very good .'] 4 grams:
["At eight o'clock on", "eight o'clock on Thursday", "o'clock on Thursday morning", 'on Thursday morning Arthur', 'Thursday morning Arthur did', "morning Arthur did n't", "Arthur did n't feel", "did n't feel very", "n't feel very good", 'feel very good .']
获取一段文字中的大写字母开头的词组和单词
import nltk
from nltk.corpus import stopwords
a = "I am Alex Lee. I am from Denman Prospect and I love this place very much. We don't like apple. The big one is good."
tokens = nltk.word_tokenize(a)
caps = []
for i in range(1, 4):
for eles in nltk.ngrams(tokens, i):
length = len(list(eles))
for j in range(length):
if eles[j][0].islower() or not eles[j][0].isalpha():
break
elif j == length - 1:
caps.append(' '.join(list(eles))) caps = list(set(caps))
caps = [c for c in caps if c.lower() not in stopwords.words('english')]
print(caps) outputs:
['Denman', 'Prospect', 'Alex Lee', 'Lee', 'Alex', 'Denman Prospect']
【389】Implement N-grams using NLTK的更多相关文章
- 【leetcode】Implement strStr() (easy)
Implement strStr(). Returns the index of the first occurrence of needle in haystack, or -1 if needle ...
- 【leetcode】Implement strStr()
Implement strStr() Implement strStr(). Returns the index of the first occurrence of needle in haysta ...
- 【Leetcode】【Easy】Implement strStr()
Implement strStr(). Returns the index of the first occurrence of needle in haystack, or -1 if needle ...
- 【LeetCode225】 Implement Stack using Queues★
1.题目 2.思路 3.java代码 import java.util.LinkedList; import java.util.Queue; public class MyStack { priva ...
- 【LeetCode232】 Implement Queue using Stacks★
1.题目描述 2.思路 思路简单,这里用一个图来举例说明: 3.java代码 public class MyQueue { Stack<Integer> stack1=new Stack& ...
- 【LeetCode】Implement strStr()(实现strStr())
这道题是LeetCode里的第28道题. 题目描述: 实现 strStr() 函数. 给定一个 haystack 字符串和一个 needle 字符串,在 haystack 字符串中找出 needle ...
- 28. Implement strStr()【easy】
28. Implement strStr()[easy] Implement strStr(). Returns the index of the first occurrence of needle ...
- 【LeetCode】哈希表 hash_table(共88题)
[1]Two Sum (2018年11月9日,k-sum专题,算法群衍生题) 给了一个数组 nums, 和一个 target 数字,要求返回一个下标的 pair, 使得这两个元素相加等于 target ...
- 【LeetCode】String to Integer (atoi) 解题报告
这道题在LeetCode OJ上难道属于Easy.可是通过率却比較低,究其原因是须要考虑的情况比較低,非常少有人一遍过吧. [题目] Implement atoi to convert a strin ...
随机推荐
- Android知识点textview加横线的属性
textView.getPaint().setFlags(Paint. UNDERLINE_TEXT_FLAG ); //下划线 textView.getPaint().setAntiAlias(tr ...
- sqlserver数据库设计完整性与约束
use StudentManageDB go --创建主键约束 if exists(select * from sysobjects where name='pk_StudentId') alter ...
- mysql 之审计 init-connect+binlog完成审计功能
mysql基于init-connect+binlog完成审计功能 目前社区版本的mysql的审计功能还是比较弱的,基于插件的审计目前存在于Mysql的企业版.Percona和MariaDB上,但是my ...
- Maven Project pom.xml属性解析
pom.xml文件: groupId 定义了项目属于哪个组,根据自己的情况命名,比如谷歌公司的angular项目,就取名为 com.google.angular artifactId 定义了当前Ma ...
- python读写json+字典保存
解决方案 json 模块提供了一种很简单的方式来编码和解码JSON数据. 其中两个主要的函数是 json.dumps()和 json.loads() , 要比其他序列化函数库如pickle的接口少得多 ...
- cocos源码分析--LayerColor的绘制过程
1开始,先创建一个LayerColor Scene *scene=Scene::create(); director->runWithScene(scene); //目标 auto layer ...
- 学习Python 新去处:Python 官方中文文档
Python 作为世界上最好用的语言,官方支持的文档一直没有中文.小伙伴们已经习惯了原汁原味的英文文档,但如果有官方中文文档,那么查阅或理解速度都会大大提升.本文将介绍隐藏在 Python 官网的中文 ...
- Android标题头滑动渐变,Titlebar滑动渐变,仿美团饿了么标题头渐变;
原理就是滑动中改变透明度: 核心代码: rv.addOnScrollListener(new RecyclerView.OnScrollListener() { @Override public vo ...
- vue从入门到女装??:从零开始搭建后台管理系统(二)用vue-docute生成线上文档
教程 vue从入门到女装??:从零开始搭建后台管理系统(一)安装框架 一个系统开发完成了总要有操作说明手册,接口文档之类的东西吧?这种要全部纯手写就很麻烦了,可以借助一些插件,比如: vue-docu ...
- 《C++数据结构-快速拾遗》 手写链表
注释:吕鑫老师C++对于找工作真的是很好的教程,基本什么方面都讲的很细致,但是对于大多数人只有快进快进再快进~~ 注释:基本链表信息自己百度,这里只是一个快速拾遗过程. 1.链表定义 typedef ...