【448】NLP, NER, PoS
目录:
- 停用词 —— stopwords
- 介词 —— prepositions —— part of speech
- Named Entity Recognition (NER) 3.1 Stanford NER
3.2 spaCy
3.3 NLTK - 句子中单词提取(Word extraction)
1. 停用词(stopwords)
ref: Removing stop words with NLTK in Python
ref: Remove Stop Words
- import nltk
- # nltk.download('stopwords')
- from nltk.corpus import stopwords
- print(stopwords.words('english'))
- output:
- ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
2. 介词(prepositions, part of speech)
ref: How do I remove verbs, prepositions, conjunctions etc from my text? [closed]
ref: Alphabetical list of part-of-speech tags used in the Penn Treebank Project:
- >>> import nltk
- >>> sentence = """At eight o'clock on Thursday morning
- ... Arthur didn't feel very good."""
- >>> tokens = nltk.word_tokenize(sentence)
- >>> tokens
- ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
- 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
- >>> tagged = nltk.pos_tag(tokens)
- >>> tagged[0:6]
- [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
- ('Thursday', 'NNP'), ('morning', 'NN')]
3. Named Entity Recognition (NER)
ref: Introduction to Named Entity Recognition
ref: Named Entity Recognition with NLTK and SpaCy
- Standford NER
- spaCy
- NLTK
3.1 Stanford NER
- article = '''
- Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
- sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
- riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
- week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
- electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
- sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
- European Union over Brexit, British Prime Minister Theresa May said on Monday.'''
- import nltk
- from nltk.tag import StanfordNERTagger
- print('NTLK Version: %s' % nltk.__version__)
- stanford_ner_tagger = StanfordNERTagger(
- r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\classifiers\english.muc.7class.distsim.crf.ser.gz",
- r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\stanford-ner-3.9.2.jar"
- )
- results = stanford_ner_tagger.tag(article.split())
- print('Original Sentence: %s' % (article))
- for result in results:
- tag_value = result[0]
- tag_type = result[1]
- if tag_type != 'O':
- print('Type: %s, Value: %s' % (tag_type, tag_value))
- output:
- NTLK Version: 3.4
- Original Sentence:
- Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
- sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
- riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
- week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
- electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
- sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
- European Union over Brexit, British Prime Minister Theresa May said on Monday.
- Type: DATE, Value: Tuesday
- Type: LOCATION, Value: Europe
- Type: ORGANIZATION, Value: Asia-Pacific
- Type: LOCATION, Value: Japan
- Type: PERCENT, Value: 1.7
- Type: PERCENT, Value: percent
- Type: ORGANIZATION, Value: Nikkei
- Type: PERCENT, Value: 3.1
- Type: PERCENT, Value: percent
- Type: LOCATION, Value: European
- Type: LOCATION, Value: Union
- Type: PERSON, Value: Theresa
- Type: PERSON, Value: May
3.2 spaCy
- import spacy
- from spacy import displacy
- from collections import Counter
- import en_core_web_sm
- nlp = en_core_web_sm.load()
- doc = nlp(article)
- for X in doc.ents:
- print('Value: %s, Type: %s' % (X.text, X.label_))
- output:
- Value: Asian, Type: NORP
- Value: Tuesday, Type: DATE
- Value: Europe, Type: LOC
- Value: MSCI’s, Type: ORG
- Value: Asia-Pacific, Type: LOC
- Value: Japan, Type: GPE
- Value: 1.7 percent, Type: PERCENT
- Value: 1-1/2, Type: CARDINAL
- Value: Australian, Type: NORP
- Value: 1.6 percent, Type: PERCENT
- Value: Japan, Type: GPE
- Value: 3.1 percent, Type: PERCENT
- Value: Apple, Type: ORG
- Value: 1.286, Type: MONEY
- Value: three, Type: CARDINAL
- Value: Nov.1, Type: NORP
- Value: the
- European Union, Type: ORG
- Value: Brexit, Type: GPE
- Value: British, Type: NORP
- Value: Theresa May, Type: PERSON
- Value: Monday, Type: DATE
标签含义:https://spacy.io/api/annotation#pos-tagging
Type | Description |
---|---|
PERSON |
People, including fictional. |
NORP |
Nationalities or religious or political groups. |
FAC |
Buildings, airports, highways, bridges, etc. |
ORG |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
LANGUAGE |
Any named language. |
DATE |
Absolute or relative dates or periods. |
TIME |
Times smaller than a day. |
PERCENT |
Percentage, including ”%“. |
MONEY |
Monetary values, including unit. |
QUANTITY |
Measurements, as of weight or distance. |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type. |
3.3 NLTK
- import nltk
- from nltk import word_tokenize, pos_tag, ne_chunk
- nltk.download('words')
- nltk.download('averaged_perceptron_tagger')
- nltk.download('punkt')
- nltk.download('maxent_ne_chunker')
- def fn_preprocess(art):
- art = nltk.word_tokenize(art)
- art = nltk.pos_tag(art)
- return art
- art_processed = fn_preprocess(article)
- print(art_processed)
- output:
- [('Asian', 'JJ'), ('shares', 'NNS'), ('skidded', 'VBN'), ('on', 'IN'), ('Tuesday', 'NNP'), ('after', 'IN'), ('a', 'DT'), ('rout', 'NN'), ('in', 'IN'), ('tech', 'JJ'), ('stocks', 'NNS'), ('put', 'VBD'), ('Wall', 'NNP'), ('Street', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('sword', 'NN'), (',', ','), ('while', 'IN'), ('a', 'DT'), ('sharp', 'JJ'), ('drop', 'NN'), ('in', 'IN'), ('oil', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('political', 'JJ'), ('risks', 'NNS'), ('in', 'IN'), ('Europe', 'NNP'), ('pushed', 'VBD'), ('the', 'DT'), ('dollar', 'NN'), ('to', 'TO'), ('16-month', 'JJ'), ('highs', 'NNS'), ('as', 'IN'), ('investors', 'NNS'), ('dumped', 'VBD'), ('riskier', 'JJR'), ('assets', 'NNS'), ('.', '.'), ('MSCI', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('broadest', 'JJS'), ('index', 'NN'), ('of', 'IN'), ('Asia-Pacific', 'NNP'), ('shares', 'NNS'), ('outside', 'IN'), ('Japan', 'NNP'), ('dropped', 'VBD'), ('1.7', 'CD'), ('percent', 'NN'), ('to', 'TO'), ('a', 'DT'), ('1-1/2', 'JJ'), ('week', 'NN'), ('trough', 'NN'), (',', ','), ('with', 'IN'), ('Australian', 'JJ'), ('shares', 'NNS'), ('sinking', 'VBG'), ('1.6', 'CD'), ('percent', 'NN'), ('.', '.'), ('Japan', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('Nikkei', 'NNP'), ('dived', 'VBD'), ('3.1', 'CD'), ('percent', 'NN'), ('led', 'VBN'), ('by', 'IN'), ('losses', 'NNS'), ('in', 'IN'), ('electric', 'JJ'), ('machinery', 'NN'), ('makers', 'NNS'), ('and', 'CC'), ('suppliers', 'NNS'), ('of', 'IN'), ('Apple', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('iphone', 'NN'), ('parts', 'NNS'), ('.', '.'), ('Sterling', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('$', '$'), ('1.286', 'CD'), ('after', 'IN'), ('three', 'CD'), ('straight', 'JJ'), ('sessions', 'NNS'), ('of', 'IN'), ('losses', 'NNS'), ('took', 'VBD'), ('it', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('lowest', 'JJS'), ('since', 'IN'), ('Nov.1', 'NNP'), ('as', 'IN'), ('there', 'EX'), ('were', 'VBD'), ('still', 'RB'), ('considerable', 'JJ'), ('unresolved', 'JJ'), ('issues', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('European', 'NNP'), ('Union', 'NNP'), ('over', 'IN'), ('Brexit', 'NNP'), (',', ','), ('British', 'NNP'), ('Prime', 'NNP'), ('Minister', 'NNP'), ('Theresa', 'NNP'), ('May', 'NNP'), ('said', 'VBD'), ('on', 'IN'), ('Monday', 'NNP'), ('.', '.')]
4. 句子中单词提取(Word extraction)
ref: An introduction to Bag of Words and how to code it in Python for NLP
- import re
- def word_extraction(sentence):
- ignore = ['a', "the", "is"]
- words = re.sub("[^\w]", " ", sentence).split()
- cleaned_text = [w.lower() for w in words if w not in ignore]
- return cleaned_text
- a = "alex is. good guy."
- print(word_extraction(a))
- output:
- ['alex', 'good', 'guy']
【448】NLP, NER, PoS的更多相关文章
- 【数据处理】各门店POS销售导入
--抓取西部POS数据DELETE FROM POSLSBF INSERT INTO POSLSBFselect * from [192.168.1.100].[SCMIS].DBO.possrlbf ...
- 论文笔记【一】Chinese NER Using Lattice LSTM
论文:Chinese NER Using Lattice LSTM 论文链接:https://arxiv.org/abs/1805.02023 论文作者:Yue Zhang∗and Jie Yang∗ ...
- 【LDA】nlp
http://pythonhosted.org/lda/getting_started.html http://radimrehurek.com/gensim/
- 448. Find All Numbers Disappeared in an Array【easy】
448. Find All Numbers Disappeared in an Array[easy] Given an array of integers where 1 ≤ a[i] ≤ n (n ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
- 【Nodejs】理想论坛帖子爬虫1.01
用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...
- 【BZOJ-1146】网络管理Network DFS序 + 带修主席树
1146: [CTSC2008]网络管理Network Time Limit: 50 Sec Memory Limit: 162 MBSubmit: 3495 Solved: 1032[Submi ...
- 通用js函数集锦<来源于网络> 【二】
通用js函数集锦<来源于网络> [二] 1.数组方法集2.cookie方法集3.url方法集4.正则表达式方法集5.字符串方法集6.加密方法集7.日期方法集8.浏览器检测方法集9.json ...
- 【BZOJ3940】【BZOJ3942】[Usaco2015 Feb]Censoring AC自动机/KMP/hash+栈
[BZOJ3942][Usaco2015 Feb]Censoring Description Farmer John has purchased a subscription to Good Hoov ...
随机推荐
- Python_math模块
1.math模块常用方法: import math #π的值 print(math.pi) #计算90度的正弦值 print(math.sin(math.pi/2)) #幂运算,2的十次方 print ...
- mongodb 常见问题处理方法收集
问题1:非正常关闭服务或关机后 mongod服务无法正常启动 在使用中发现mongodb 的服务可能因为非正常关闭而启动不了,这时我们通过 删除data目录下的 *.lock文件,再运行下/mongo ...
- aspose将word转pdf时乱码,或者出现小方框问题
通常来讲,出现这种问题一般是因为Linux服务器没有安装中文字体 查看Linux目前的所有字体 fc-list #查看Linux目前的所有中文字体 fc-list :lang=zh #将window ...
- Jenkins持续集成邮件发送
jenkins下载:https://jenkins.io/downloadgeneric java package(war) 1.tomcat部署: 0.jdk环境 1.修改conf目录下的serve ...
- 如何查看自己steam库里游戏是哪个区的
1 开启Steam开发者模式,切换到控制台,以便调出游戏区域数据 1.1 首先找到Steam的快捷方式,在目标一行中最后输入 -dev (前面带空格),然后重新运行. 1.2 如下图上方标签切换到控制 ...
- class Pagination(object)分页源码
class Pagination(object): def init(self, current_page, all_count, per_page_num=10, pager_count=11): ...
- tensorflow Dataset及TFRecord一些要点【持续更新】
关于tensorflow结合Dataset与TFRecord这方面看到挺好一篇文章: https://cloud.tencent.com/developer/article/1088751 githu ...
- Python 简单批量请求接口实例
#coding:utf-8 ''' Created on 2017年11月10日 @author: li.liu ''' import urllib import time str1=''' http ...
- Flume高级之自定义MySQLSource
1 自定义Source说明 Source是负责接收数据到Flume Agent的组件.Source组件可以处理各种类型.各种格式的日志数据,包括avro.thrift.exec.jms.spoolin ...
- Linux配置静态IP以及解决配置静态IP后无法上网的问题
式一.图形界面配置