目录:

  1. 停用词 —— stopwords
  2. 介词 —— prepositions —— part of speech
  3. Named Entity Recognition (NER)  3.1 Stanford NER
      3.2 spaCy
      3.3 NLTK
  4. 句子中单词提取(Word extraction)

1. 停用词(stopwords)

ref: Removing stop words with NLTK in Python

ref: Remove Stop Words

  1. import nltk
  2. # nltk.download('stopwords')
  3. from nltk.corpus import stopwords
  4. print(stopwords.words('english'))
  5.  
  6. output:
  7. ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

2. 介词(prepositions, part of speech)

ref: How do I remove verbs, prepositions, conjunctions etc from my text? [closed]

ref: Alphabetical list of part-of-speech tags used in the Penn Treebank Project:

  1. >>> import nltk
  2. >>> sentence = """At eight o'clock on Thursday morning
  3. ... Arthur didn't feel very good."""
  4. >>> tokens = nltk.word_tokenize(sentence)
  5. >>> tokens
  6. ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
  7. 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
  8. >>> tagged = nltk.pos_tag(tokens)
  9. >>> tagged[0:6]
  10. [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
  11. ('Thursday', 'NNP'), ('morning', 'NN')]

3. Named Entity Recognition (NER)

ref: Introduction to Named Entity Recognition

ref: Named Entity Recognition with NLTK and SpaCy

  • Standford NER
  • spaCy
  • NLTK

3.1 Stanford NER

  1. article = '''
  2. Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
  3. sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
  4. riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
  5. week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
  6. electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
  7. sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
  8. European Union over Brexit, British Prime Minister Theresa May said on Monday.'''
  9.  
  10. import nltk
  11. from nltk.tag import StanfordNERTagger
  12.  
  13. print('NTLK Version: %s' % nltk.__version__)
  14.  
  15. stanford_ner_tagger = StanfordNERTagger(
  16. r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\classifiers\english.muc.7class.distsim.crf.ser.gz",
  17. r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\stanford-ner-3.9.2.jar"
  18. )
  19.  
  20. results = stanford_ner_tagger.tag(article.split())
  21.  
  22. print('Original Sentence: %s' % (article))
  23. for result in results:
  24. tag_value = result[0]
  25. tag_type = result[1]
  26. if tag_type != 'O':
  27. print('Type: %s, Value: %s' % (tag_type, tag_value))
  28.  
  29. output:
  30. NTLK Version: 3.4
  31. Original Sentence:
  32. Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
  33. sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
  34. riskier assets. MSCIs broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
  35. week trough, with Australian shares sinking 1.6 percent. Japans Nikkei dived 3.1 percent led by losses in
  36. electric machinery makers and suppliers of Apples iphone parts. Sterling fell to $1.286 after three straight
  37. sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
  38. European Union over Brexit, British Prime Minister Theresa May said on Monday.
  39. Type: DATE, Value: Tuesday
  40. Type: LOCATION, Value: Europe
  41. Type: ORGANIZATION, Value: Asia-Pacific
  42. Type: LOCATION, Value: Japan
  43. Type: PERCENT, Value: 1.7
  44. Type: PERCENT, Value: percent
  45. Type: ORGANIZATION, Value: Nikkei
  46. Type: PERCENT, Value: 3.1
  47. Type: PERCENT, Value: percent
  48. Type: LOCATION, Value: European
  49. Type: LOCATION, Value: Union
  50. Type: PERSON, Value: Theresa
  51. Type: PERSON, Value: May

3.2 spaCy

  1. import spacy
  2. from spacy import displacy
  3. from collections import Counter
  4. import en_core_web_sm
  5. nlp = en_core_web_sm.load()
  6. doc = nlp(article)
  7. for X in doc.ents:
  8. print('Value: %s, Type: %s' % (X.text, X.label_))
  9.  
  10. output:
  11. Value: Asian, Type: NORP
  12. Value: Tuesday, Type: DATE
  13. Value: Europe, Type: LOC
  14. Value: MSCIs, Type: ORG
  15. Value: Asia-Pacific, Type: LOC
  16. Value: Japan, Type: GPE
  17. Value: 1.7 percent, Type: PERCENT
  18. Value: 1-1/2, Type: CARDINAL
  19. Value: Australian, Type: NORP
  20. Value: 1.6 percent, Type: PERCENT
  21. Value: Japan, Type: GPE
  22. Value: 3.1 percent, Type: PERCENT
  23. Value: Apple, Type: ORG
  24. Value: 1.286, Type: MONEY
  25. Value: three, Type: CARDINAL
  26. Value: Nov.1, Type: NORP
  27. Value: the
  28. European Union, Type: ORG
  29. Value: Brexit, Type: GPE
  30. Value: British, Type: NORP
  31. Value: Theresa May, Type: PERSON
  32. Value: Monday, Type: DATE

标签含义:https://spacy.io/api/annotation#pos-tagging

Type Description
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage, including ”%“.
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL “first”, “second”, etc.
CARDINAL Numerals that do not fall under another type.

3.3 NLTK

  1. import nltk
  2. from nltk import word_tokenize, pos_tag, ne_chunk
  3. nltk.download('words')
  4. nltk.download('averaged_perceptron_tagger')
  5. nltk.download('punkt')
  6. nltk.download('maxent_ne_chunker')
  7.  
  8. def fn_preprocess(art):
  9. art = nltk.word_tokenize(art)
  10. art = nltk.pos_tag(art)
  11. return art
  12. art_processed = fn_preprocess(article)
  13. print(art_processed)
  14.  
  15. output:
  16. [('Asian', 'JJ'), ('shares', 'NNS'), ('skidded', 'VBN'), ('on', 'IN'), ('Tuesday', 'NNP'), ('after', 'IN'), ('a', 'DT'), ('rout', 'NN'), ('in', 'IN'), ('tech', 'JJ'), ('stocks', 'NNS'), ('put', 'VBD'), ('Wall', 'NNP'), ('Street', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('sword', 'NN'), (',', ','), ('while', 'IN'), ('a', 'DT'), ('sharp', 'JJ'), ('drop', 'NN'), ('in', 'IN'), ('oil', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('political', 'JJ'), ('risks', 'NNS'), ('in', 'IN'), ('Europe', 'NNP'), ('pushed', 'VBD'), ('the', 'DT'), ('dollar', 'NN'), ('to', 'TO'), ('16-month', 'JJ'), ('highs', 'NNS'), ('as', 'IN'), ('investors', 'NNS'), ('dumped', 'VBD'), ('riskier', 'JJR'), ('assets', 'NNS'), ('.', '.'), ('MSCI', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('broadest', 'JJS'), ('index', 'NN'), ('of', 'IN'), ('Asia-Pacific', 'NNP'), ('shares', 'NNS'), ('outside', 'IN'), ('Japan', 'NNP'), ('dropped', 'VBD'), ('1.7', 'CD'), ('percent', 'NN'), ('to', 'TO'), ('a', 'DT'), ('1-1/2', 'JJ'), ('week', 'NN'), ('trough', 'NN'), (',', ','), ('with', 'IN'), ('Australian', 'JJ'), ('shares', 'NNS'), ('sinking', 'VBG'), ('1.6', 'CD'), ('percent', 'NN'), ('.', '.'), ('Japan', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('Nikkei', 'NNP'), ('dived', 'VBD'), ('3.1', 'CD'), ('percent', 'NN'), ('led', 'VBN'), ('by', 'IN'), ('losses', 'NNS'), ('in', 'IN'), ('electric', 'JJ'), ('machinery', 'NN'), ('makers', 'NNS'), ('and', 'CC'), ('suppliers', 'NNS'), ('of', 'IN'), ('Apple', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('iphone', 'NN'), ('parts', 'NNS'), ('.', '.'), ('Sterling', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('$', '$'), ('1.286', 'CD'), ('after', 'IN'), ('three', 'CD'), ('straight', 'JJ'), ('sessions', 'NNS'), ('of', 'IN'), ('losses', 'NNS'), ('took', 'VBD'), ('it', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('lowest', 'JJS'), ('since', 'IN'), ('Nov.1', 'NNP'), ('as', 'IN'), ('there', 'EX'), ('were', 'VBD'), ('still', 'RB'), ('considerable', 'JJ'), ('unresolved', 'JJ'), ('issues', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('European', 'NNP'), ('Union', 'NNP'), ('over', 'IN'), ('Brexit', 'NNP'), (',', ','), ('British', 'NNP'), ('Prime', 'NNP'), ('Minister', 'NNP'), ('Theresa', 'NNP'), ('May', 'NNP'), ('said', 'VBD'), ('on', 'IN'), ('Monday', 'NNP'), ('.', '.')]

  

4. 句子中单词提取(Word extraction)

ref: An introduction to Bag of Words and how to code it in Python for NLP

  1. import re
  2. def word_extraction(sentence):
  3. ignore = ['a', "the", "is"]
  4. words = re.sub("[^\w]", " ", sentence).split()
  5. cleaned_text = [w.lower() for w in words if w not in ignore]
  6. return cleaned_text
  7.  
  8. a = "alex is. good guy."
  9. print(word_extraction(a))
  10.  
  11. output:
  12. ['alex', 'good', 'guy']

【448】NLP, NER, PoS的更多相关文章

  1. 【数据处理】各门店POS销售导入

    --抓取西部POS数据DELETE FROM POSLSBF INSERT INTO POSLSBFselect * from [192.168.1.100].[SCMIS].DBO.possrlbf ...

  2. 论文笔记【一】Chinese NER Using Lattice LSTM

    论文:Chinese NER Using Lattice LSTM 论文链接:https://arxiv.org/abs/1805.02023 论文作者:Yue Zhang∗and Jie Yang∗ ...

  3. 【LDA】nlp

    http://pythonhosted.org/lda/getting_started.html http://radimrehurek.com/gensim/

  4. 448. Find All Numbers Disappeared in an Array【easy】

    448. Find All Numbers Disappeared in an Array[easy] Given an array of integers where 1 ≤ a[i] ≤ n (n ...

  5. 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】

    转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...

  6. 【Nodejs】理想论坛帖子爬虫1.01

    用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...

  7. 【BZOJ-1146】网络管理Network DFS序 + 带修主席树

    1146: [CTSC2008]网络管理Network Time Limit: 50 Sec  Memory Limit: 162 MBSubmit: 3495  Solved: 1032[Submi ...

  8. 通用js函数集锦<来源于网络> 【二】

    通用js函数集锦<来源于网络> [二] 1.数组方法集2.cookie方法集3.url方法集4.正则表达式方法集5.字符串方法集6.加密方法集7.日期方法集8.浏览器检测方法集9.json ...

  9. 【BZOJ3940】【BZOJ3942】[Usaco2015 Feb]Censoring AC自动机/KMP/hash+栈

    [BZOJ3942][Usaco2015 Feb]Censoring Description Farmer John has purchased a subscription to Good Hoov ...

随机推荐

  1. Python_math模块

    1.math模块常用方法: import math #π的值 print(math.pi) #计算90度的正弦值 print(math.sin(math.pi/2)) #幂运算,2的十次方 print ...

  2. mongodb 常见问题处理方法收集

    问题1:非正常关闭服务或关机后 mongod服务无法正常启动 在使用中发现mongodb 的服务可能因为非正常关闭而启动不了,这时我们通过 删除data目录下的 *.lock文件,再运行下/mongo ...

  3. aspose将word转pdf时乱码,或者出现小方框问题

    通常来讲,出现这种问题一般是因为Linux服务器没有安装中文字体  查看Linux目前的所有字体 fc-list #查看Linux目前的所有中文字体 fc-list :lang=zh #将window ...

  4. Jenkins持续集成邮件发送

    jenkins下载:https://jenkins.io/downloadgeneric java package(war) 1.tomcat部署: 0.jdk环境 1.修改conf目录下的serve ...

  5. 如何查看自己steam库里游戏是哪个区的

    1 开启Steam开发者模式,切换到控制台,以便调出游戏区域数据 1.1 首先找到Steam的快捷方式,在目标一行中最后输入 -dev (前面带空格),然后重新运行. 1.2 如下图上方标签切换到控制 ...

  6. class Pagination(object)分页源码

    class Pagination(object): def init(self, current_page, all_count, per_page_num=10, pager_count=11): ...

  7. tensorflow Dataset及TFRecord一些要点【持续更新】

    关于tensorflow结合Dataset与TFRecord这方面看到挺好一篇文章: https://cloud.tencent.com/developer/article/1088751 githu ...

  8. Python 简单批量请求接口实例

    #coding:utf-8 ''' Created on 2017年11月10日 @author: li.liu ''' import urllib import time str1=''' http ...

  9. Flume高级之自定义MySQLSource

    1 自定义Source说明 Source是负责接收数据到Flume Agent的组件.Source组件可以处理各种类型.各种格式的日志数据,包括avro.thrift.exec.jms.spoolin ...

  10. Linux配置静态IP以及解决配置静态IP后无法上网的问题

    式一.图形界面配置