【448】NLP, NER, PoS
目录:
- 停用词 —— stopwords
- 介词 —— prepositions —— part of speech
- Named Entity Recognition (NER) 3.1 Stanford NER
3.2 spaCy
3.3 NLTK - 句子中单词提取(Word extraction)
1. 停用词(stopwords)
ref: Removing stop words with NLTK in Python
ref: Remove Stop Words
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english')) output:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
2. 介词(prepositions, part of speech)
ref: How do I remove verbs, prepositions, conjunctions etc from my text? [closed]
ref: Alphabetical list of part-of-speech tags used in the Penn Treebank Project:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]
3. Named Entity Recognition (NER)
ref: Introduction to Named Entity Recognition
ref: Named Entity Recognition with NLTK and SpaCy
- Standford NER
- spaCy
- NLTK
3.1 Stanford NER
article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.''' import nltk
from nltk.tag import StanfordNERTagger print('NTLK Version: %s' % nltk.__version__) stanford_ner_tagger = StanfordNERTagger(
r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\classifiers\english.muc.7class.distsim.crf.ser.gz",
r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\stanford-ner-3.9.2.jar"
) results = stanford_ner_tagger.tag(article.split()) print('Original Sentence: %s' % (article))
for result in results:
tag_value = result[0]
tag_type = result[1]
if tag_type != 'O':
print('Type: %s, Value: %s' % (tag_type, tag_value)) output:
NTLK Version: 3.4
Original Sentence:
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.
Type: DATE, Value: Tuesday
Type: LOCATION, Value: Europe
Type: ORGANIZATION, Value: Asia-Pacific
Type: LOCATION, Value: Japan
Type: PERCENT, Value: 1.7
Type: PERCENT, Value: percent
Type: ORGANIZATION, Value: Nikkei
Type: PERCENT, Value: 3.1
Type: PERCENT, Value: percent
Type: LOCATION, Value: European
Type: LOCATION, Value: Union
Type: PERSON, Value: Theresa
Type: PERSON, Value: May
3.2 spaCy
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp(article)
for X in doc.ents:
print('Value: %s, Type: %s' % (X.text, X.label_)) output:
Value: Asian, Type: NORP
Value: Tuesday, Type: DATE
Value: Europe, Type: LOC
Value: MSCI’s, Type: ORG
Value: Asia-Pacific, Type: LOC
Value: Japan, Type: GPE
Value: 1.7 percent, Type: PERCENT
Value: 1-1/2, Type: CARDINAL
Value: Australian, Type: NORP
Value: 1.6 percent, Type: PERCENT
Value: Japan, Type: GPE
Value: 3.1 percent, Type: PERCENT
Value: Apple, Type: ORG
Value: 1.286, Type: MONEY
Value: three, Type: CARDINAL
Value: Nov.1, Type: NORP
Value: the
European Union, Type: ORG
Value: Brexit, Type: GPE
Value: British, Type: NORP
Value: Theresa May, Type: PERSON
Value: Monday, Type: DATE
标签含义:https://spacy.io/api/annotation#pos-tagging
| Type | Description |
|---|---|
PERSON |
People, including fictional. |
NORP |
Nationalities or religious or political groups. |
FAC |
Buildings, airports, highways, bridges, etc. |
ORG |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
LANGUAGE |
Any named language. |
DATE |
Absolute or relative dates or periods. |
TIME |
Times smaller than a day. |
PERCENT |
Percentage, including ”%“. |
MONEY |
Monetary values, including unit. |
QUANTITY |
Measurements, as of weight or distance. |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type. |
3.3 NLTK
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker') def fn_preprocess(art):
art = nltk.word_tokenize(art)
art = nltk.pos_tag(art)
return art
art_processed = fn_preprocess(article)
print(art_processed) output:
[('Asian', 'JJ'), ('shares', 'NNS'), ('skidded', 'VBN'), ('on', 'IN'), ('Tuesday', 'NNP'), ('after', 'IN'), ('a', 'DT'), ('rout', 'NN'), ('in', 'IN'), ('tech', 'JJ'), ('stocks', 'NNS'), ('put', 'VBD'), ('Wall', 'NNP'), ('Street', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('sword', 'NN'), (',', ','), ('while', 'IN'), ('a', 'DT'), ('sharp', 'JJ'), ('drop', 'NN'), ('in', 'IN'), ('oil', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('political', 'JJ'), ('risks', 'NNS'), ('in', 'IN'), ('Europe', 'NNP'), ('pushed', 'VBD'), ('the', 'DT'), ('dollar', 'NN'), ('to', 'TO'), ('16-month', 'JJ'), ('highs', 'NNS'), ('as', 'IN'), ('investors', 'NNS'), ('dumped', 'VBD'), ('riskier', 'JJR'), ('assets', 'NNS'), ('.', '.'), ('MSCI', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('broadest', 'JJS'), ('index', 'NN'), ('of', 'IN'), ('Asia-Pacific', 'NNP'), ('shares', 'NNS'), ('outside', 'IN'), ('Japan', 'NNP'), ('dropped', 'VBD'), ('1.7', 'CD'), ('percent', 'NN'), ('to', 'TO'), ('a', 'DT'), ('1-1/2', 'JJ'), ('week', 'NN'), ('trough', 'NN'), (',', ','), ('with', 'IN'), ('Australian', 'JJ'), ('shares', 'NNS'), ('sinking', 'VBG'), ('1.6', 'CD'), ('percent', 'NN'), ('.', '.'), ('Japan', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('Nikkei', 'NNP'), ('dived', 'VBD'), ('3.1', 'CD'), ('percent', 'NN'), ('led', 'VBN'), ('by', 'IN'), ('losses', 'NNS'), ('in', 'IN'), ('electric', 'JJ'), ('machinery', 'NN'), ('makers', 'NNS'), ('and', 'CC'), ('suppliers', 'NNS'), ('of', 'IN'), ('Apple', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('iphone', 'NN'), ('parts', 'NNS'), ('.', '.'), ('Sterling', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('$', '$'), ('1.286', 'CD'), ('after', 'IN'), ('three', 'CD'), ('straight', 'JJ'), ('sessions', 'NNS'), ('of', 'IN'), ('losses', 'NNS'), ('took', 'VBD'), ('it', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('lowest', 'JJS'), ('since', 'IN'), ('Nov.1', 'NNP'), ('as', 'IN'), ('there', 'EX'), ('were', 'VBD'), ('still', 'RB'), ('considerable', 'JJ'), ('unresolved', 'JJ'), ('issues', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('European', 'NNP'), ('Union', 'NNP'), ('over', 'IN'), ('Brexit', 'NNP'), (',', ','), ('British', 'NNP'), ('Prime', 'NNP'), ('Minister', 'NNP'), ('Theresa', 'NNP'), ('May', 'NNP'), ('said', 'VBD'), ('on', 'IN'), ('Monday', 'NNP'), ('.', '.')]
4. 句子中单词提取(Word extraction)
ref: An introduction to Bag of Words and how to code it in Python for NLP
import re
def word_extraction(sentence):
ignore = ['a', "the", "is"]
words = re.sub("[^\w]", " ", sentence).split()
cleaned_text = [w.lower() for w in words if w not in ignore]
return cleaned_text a = "alex is. good guy."
print(word_extraction(a)) output:
['alex', 'good', 'guy']
【448】NLP, NER, PoS的更多相关文章
- 【数据处理】各门店POS销售导入
--抓取西部POS数据DELETE FROM POSLSBF INSERT INTO POSLSBFselect * from [192.168.1.100].[SCMIS].DBO.possrlbf ...
- 论文笔记【一】Chinese NER Using Lattice LSTM
论文:Chinese NER Using Lattice LSTM 论文链接:https://arxiv.org/abs/1805.02023 论文作者:Yue Zhang∗and Jie Yang∗ ...
- 【LDA】nlp
http://pythonhosted.org/lda/getting_started.html http://radimrehurek.com/gensim/
- 448. Find All Numbers Disappeared in an Array【easy】
448. Find All Numbers Disappeared in an Array[easy] Given an array of integers where 1 ≤ a[i] ≤ n (n ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
- 【Nodejs】理想论坛帖子爬虫1.01
用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...
- 【BZOJ-1146】网络管理Network DFS序 + 带修主席树
1146: [CTSC2008]网络管理Network Time Limit: 50 Sec Memory Limit: 162 MBSubmit: 3495 Solved: 1032[Submi ...
- 通用js函数集锦<来源于网络> 【二】
通用js函数集锦<来源于网络> [二] 1.数组方法集2.cookie方法集3.url方法集4.正则表达式方法集5.字符串方法集6.加密方法集7.日期方法集8.浏览器检测方法集9.json ...
- 【BZOJ3940】【BZOJ3942】[Usaco2015 Feb]Censoring AC自动机/KMP/hash+栈
[BZOJ3942][Usaco2015 Feb]Censoring Description Farmer John has purchased a subscription to Good Hoov ...
随机推荐
- JS使用Cookie
包:https://www.npmjs.com/package/js-cookie 一.安装 npm install js-cookie --save 二.引用 import Cookies from ...
- Docker基础用法篇
Docker基础用法篇 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.安装docker 1>.依赖的基础环境 64 bits CPU Linux Kerner 3.10+ ...
- php中函数的类型提示和文件读取功能
这个没有深入. <?php function addNumbers(int $a, int $b, bool $printSum): int { $sum = $a + $b; if ($pri ...
- vs code c/c++编程配置文件
之前的C语言课程老师只讲了C没有接触C++,但是觉得C++挺重要的,而且python和java再去转exe有点麻烦,所以还是学一下C++. 问过朋友推荐了几个IDE,最后他用的是visual stud ...
- myeclipse常用快捷(持续更新)
最近开始转用myeclipse,总结一下快捷方式:(我喜欢用的) [Ctrl+O] 显示类中方法和属性的大纲,能快速定位类的方法和属性,在查找Bug时非常有用. [Ctrl+M] 窗口最大 ...
- php过滤敏感词
<?php /** * 敏感词过滤工具类 * 使用方法 * echo FilterTools::filterContent("你妈的我操一色狼杂种二山食物"," ...
- python - django 控制台输出 sql 语句
只需要在 settings.py 文件中加入以下配置即可. LOGGING = { 'version': 1, 'disable_existing_loggers': False, 'handlers ...
- Spring之IOC(控制反转)与AOP(面向切面编程)
控制反转——Spring通过一种称作控制反转(IoC)的技术促进了松耦合.当应用了IoC,一个对象依赖的其它对象会通过被动的方式传递进来,而不是这个对象自己创建或者查找依赖对象.可以认为IoC与JND ...
- docker 下载安装镜像
docker安装成功后. 1.搜索镜像 # docker search java 可使用 docker search命令搜索存放在 Docker Hub(这是docker官方提供的存放所有docker ...
- 4个MySQL优化工具AWR,帮你准确定位数据库瓶颈!(转载)
对于正在运行的mysql,性能如何,参数设置的是否合理,账号设置的是否存在安全隐患,你是否了然于胸呢? 俗话说工欲善其事,必先利其器,定期对你的MYSQL数据库进行一个体检,是保证数据库安全运行的重要 ...