python机器学习-乳腺癌细胞挖掘（博主亲自录制视频）

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

项目合作QQ:231469242

https://en.wikipedia.org/wiki/Part-of-speech_tagging

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

Principle

Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rare—in natural languages (as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:

The sailor dogs the hatch.

Correct grammatical tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; semantic analysis can also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and 2) an action applied to the object "hatch" (in this context, "dogs" is a nautical term meaning "fastens (a watertight door) securely").

Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "case" (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. Linguists distinguish parts of speech to various fine degrees, reflecting a chosen "tagging system".

In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech, and found that about as many words were ambiguous there as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as 'Ncmsan for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no.

History

The Brown Corpus

Research on part-of-speech tagging has been closely tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kučera and W. Nelson Francis, in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).

The Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree).

This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS (linguistics) and VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus.

For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the semantics or even the pragmatics of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.

Use of Hidden Markov Models

In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words.

More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen a noun followed by a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb.

When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range.

It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language parsing (1997) [2], that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns will approach 90% accuracy because many words are unambiguous.

CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech (DeRose 1990, p. 82)).

HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.^[1]

Dynamic programming methods

In 1987, Steven DeRose^[2] and Ken Church^[3] independently developed dynamic programming algorithms to solve the same problem in vastly less time. Their methods were similar to the Viterbi algorithm known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus). Both methods achieved accuracy over 95%. DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.

These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well. Markov Models are now the standard method for part-of-speech assignment.

Unsupervised taggers

The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to bootstrap using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.

These two categories can be further subdivided into rule-based, stochastic, and neural approaches.

Other taggers and methods

Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill Tagger, Constraint Grammar, and the Baum-Welch algorithm (also known as the forward-backward algorithm). Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity. Unlike the Brill tagger where the rules are ordered sequentially, the POS and morphological tagging toolkit RDRPOSTagger stores rules in the form of a Ripple Down Rules tree.

Many machine learning methods have also been applied to the problem of POS tagging. Methods such as SVM, Maximum entropy classifier, Perceptron, and Nearest-neighbor have all been tried, and most can achieve accuracy above 95%.

A direct comparison of several methods is reported (with references) at [3]. This comparison uses the Penn tag set on some of the Penn Treebank data, so the results are directly comparable.

However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported there are the best that can be achieved with a given approach; nor even the best that have been achieved with a given approach.

A more recent development is using the structure regularization method for part-of-speech tagging, achieving 97.36% on the standard benchmark dataset.^[4]

Issues

While there is broad agreement about basic categories, a number of edge cases make it difficult to settle on a single "correct" set of tags, even in a single language such as English. For example, it is hard to say whether "fire" is an adjective or a noun in

 the big green fire truck

A second important example is the use/mention distinction, as in the following example, where "blue" could be replaced by a word from any POS (the Brown Corpus tag set appends the suffix "-NC" in such cases):

 the word "blue" has 4 letters.

Words in a language other than that of the "main" text are commonly tagged as "foreign", usually in addition to a tag for the role the foreign word is actually playing in context.

There are also many cases where POS categories and "words" do not map one to one, for example:

 David's

 gonna

 don't

 vice versa

 first-cut

 cannot

 pre- and post-secondary

 look (a word) up

In the last example, "look" and "up" arguably function as a single verbal unit, despite the possibility of other words coming between them. Some tag sets (such as Penn) break hyphenated words, contractions, and possessives into separate tokens, thus avoiding some but far from all such problems.

词性标注

http://www.hankcs.com/nlp/part-of-speech-tagging.html

词性标注（Part-of-Speech tagging 或POS tagging)，又称词类标注或者简称标注，是指为分词结果中的每个单词标注一个正确的词性的程序，也即确定每个词是名词、动词、形容词或其他词性的过程。在汉语中，词性标注比较简单，因为汉语词汇词性多变的情况比较少见，大多词语只有一个词性，或者出现频次最高的词性远远高于第二位的词性。据说，只需选取最高频词性，即可实现80%准确率的中文词性标注程序。

利用HMM即可实现更高准确率的词性标注，本文旨在介绍HanLP中的词性标注模块。

开源项目

本文代码已集成到HanLP中开源：http://www.hankcs.com/nlp/hanlp.html

训练

HanLP中使用了一阶隐马模型，在这个隐马尔可夫模型中，隐状态是词性，显状态是单词。

语料库

训练语料采用了2014人民日报切分语料：

人民网/nz 1月1日/t 讯/ng 据/p 《/w [纽约/nsf 时报/n]/nz 》/w 报道/v ，/w 美国/nsf 华尔街/nsf 股市/n 在/p 2013年/t 的/ude1 最后/f 一天/mq 继续/v 上涨/vn ，/w 和/cc [全球/n 股市/n]/nz 一样/uyy ，/w 都/d 以/p [最高/a 纪录/n]/nz 或/c 接近/v [最高/a 纪录/n]/nz 结束/v 本/rz 年/qt 的/ude1 交易/vn 。/w
《/w [纽约/nsf 时报/n]/nz 》/w 报道/v 说/v ，/w 标普/nz 500/m 指数/n 今年/t 上升/vi 29.6%/m ，/w 为/p 1997年/t 以来/f 的/ude1 最大/gm 涨幅/n ；/w [道琼斯/ntc 工业/n 平均/a 指数/n]/nz 上升/vi 26.5%/m ，/w 为/p 1996年/t 以来/f 的/ude1 最大/gm 涨幅/n ；/w [纳斯/nrf 达/v 克/q]/nz 上涨/vi 38.3%/m 。/w
就/d 12月31日/t 来说/uls ，/w 由于/p 就业/vn 前景/n 看好/v 和/cc [经济/n 增长/v]/nz 明年/t 可能/v 加速/vn ，/w 消费者/n 信心/n 上升/vi 。/w 工商/n 协进会/nis （/w ConferenceBoard/x ）/w 报告/n ，/w 12月/t 消费者/n 信心/n 上升/vi 到/v 78.1/m ，/w 明显/a 高于/v 11月/t 的/ude1 72/m 。/w
另据/nz 《/w [华尔街/nsf 日报/n]/nz 》/w 报道/v ，/w 2013年/t 是/vshi 1995年/t 以来/f [美国/nsf 股市/n]/nz 表现/v 最好/d 的/ude1 一年/mq 。/w 这/rzv 一年/mq 里/f ，/w 投资/v [美国/nsf 股市/n]/nz 的/ude1 明智/a 做法/n 是/vshi 追/v 着/uzhe “/w 傻钱/nz ”/w 跑/v 。/w 所谓/v 的/ude1 “/w 傻钱/nz ”/w 策略/n ，/w 其实/d 就是/v 买入/vn 并/cc 持有/v 美国/nsf 股票/n 这样/rzv 的/ude1 普通/a 组合/vn 。/w 这个/rz 策略/n 要/v 比/p [对冲/vn 基金/n]/nz 和/cc 其它/rz 专业/n 投资者/nnd 使用/v 的/ude1 更为/d 复杂/a 的/ude1 投资/vn 方法/n 效果/n 好/a 得/ude3 多/a 。/w （/w 老/a 任/v ）/w

单词词性频次词典

统计所有单词的各个词性的出现频次，得到核心词典：

爱 v 3622 vn 598
爱因斯坦 nrf 20
爱国 a 178
爱国主义 n 68
飙升 v 200 vn 8
顺风 vi 27 vn 2
顺风吹火 i 1
顺风球 n 1
顺风耳 n 4
顺风车 nz 126
购 v 217 vg 151 vn 106
购书 v 7 vn 5
购买 v 3875 vn 637
购买人 n 7
购买力 n 42
购买户 n 1
购买欲 n 1
购买群 n 1
购买者 n 93
购买证 n 1
购入 v 115 vn 18
……

从词典可以看出，汉语词汇的确词性单一，且存在歧义的词性多集中在“动词v”和“名动词vn”上。另外，我拿到的2014人民日报切分语料感觉没有经过严格的人工校对，许多单词词性单一，且存在不少错误。也许等我有机会（经济实力或学术背景），可以拿更高质量的语料来训练。所幸HanLP同时维护了一个通用的语料处理包，暂且埋下伏笔吧。

转移矩阵

统计每个标签的转移频次，得到如下转移矩阵：

事实上，完整的转移矩阵非常大，请下载观看：词性标注转移矩阵.xls

标注

利用上述转移矩阵和核心词典词频可以计算出HMM中的初始概率、转移概率、发射概率，进而完成求解。关于维特比算法和实现请参考《通用维特比算法的Java实现》。

测试

以“我的爱就是爱自然语言处理”为例：

String text = "我的爱就是爱自然语言处理";
Segment segment = new Segment();
System.out.println("未标注：" + segment.seg(text));
segment.enableSpeechTag(true);
System.out.println("标注后：" + segment.seg(text));

输出

未标注：[我/rr, 的/ude1, 爱/v, 就是/v, 爱/v, 自然语言/gm, 处理/vn]

标注后：[我/rr, 的/ude1, 爱/vn, 就是/v, 爱/v, 自然语言/gm, 处理/vn]

前后两个“爱”的词性并不相同，前者是名动词，后者是动词。

再比如

未标注：[教授/nnt, 正在/d, 教授/nnt, 自然语言/gm, 处理/vn, 课程/n]

标注后：[教授/nnt, 正在/d, 教授/v, 自然语言/gm, 处理/vn, 课程/n]

HanLP的词性标注初见成效。

HanLP词性标注集

HanLP使用的HMM词性标注模型训练自2014年人民日报切分语料，随后增加了少量98年人民日报中独有的词语。所以，HanLP词性标注集兼容《ICTPOS3.0汉语词性标记集》，并且兼容《现代汉语语料库加工规范——词语切分与词性标注》。

It is unclear whether it is best to treat words such as "be", "have", and "do" as categories in their own right (as in the Brown Corpus), or as simply verbs (as in the LOB Corpus and the Penn Treebank). "be" has more forms than other English verbs, and occurs in quite different grammatical contexts, complicating the issue.

The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the Eagles Guidelines see wide use, and include versions for multiple languages.

POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as Greek and Latin can be very large; tagging words in agglutinative languages such as Inuit may be virtually impossible. At the other extreme, Petrov, D. Das, and R. McDonald ("A Universal Part-of-Speech Tagset" http://arxiv.org/abs/1104.2086) have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition, etc.). Whether a very small set of very broad tags or a much larger set of more precise ones is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets.

A different issue is that some cases are in fact ambiguous. Beatrice Santorini gives examples in "Part-of-speech Tagging Guidelines for the Penn Treebank Project," (3rd rev, June 1990 [4]), including the following (p. 32) case in which entertaining can be either an adjective or a verb, and there is no syntactic way to decide:

 The Duchess was entertaining last night.

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（欢迎关注博主主页，学习python视频资源）

自然语言15.1_Part of Speech Tagging 词性标注的更多相关文章

自然语言15_Part of Speech Tagging with NLTK
https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tut ...
词性标注 parts of speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging ...
自然语言处理工具pyhanlp分词与词性标注
Pyhanlp分词与词性标注的相关内容记得此前是有分享过的.可能时间太久记不太清楚了.以下文章是分享自“baiziyu”所写(小部分内容有修改),供大家学习参考之用. 简介 pyhanlp是HanLP ...
Part of Speech Tagging
Natural Language Processing with Python Charpter 6.1 suffix_fdist处代码稍微改动. import nltk from nltk.corp ...
Java自然语言处理NLP工具包
1. Java自然语言处理 LingPipe LingPipe是一个自然语言处理的Java开源工具包.LingPipe目前已有很丰富的功能,包括主题分类(Top Classification).命名实 ...
python and 我爱自然语言处理
曾经因为NLTK的缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开发语言是C/C++,但平时的很多文本数据处理任务都交给了Python.离开腾讯创业后,第一个作品课程图谱也 ...
最佳实践：深度学习用于自然语言处理(Deep Learning for NLP Best Practices) - 阅读笔记
https://www.wxnmh.com/thread-1528249.htm https://www.wxnmh.com/thread-1528251.htm https://www.wxnmh. ...
Python自然语言处理工具小结
Python自然语言处理工具小结作者:白宁超 2016年11月21日21:45:26 目录 [Python NLP]干货!详述Python NLTK下如何使用stanford NLP工具包(1) [ ...
自然语言14_Stemming words with NLTK
https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/ # -*- ...

随机推荐

python环境搭建-设置PyCharm软件的配色方案和Python解释器
设置PyCharm软件的配色方案设置Python解释器(用于Python2 or 3 的切换)
线性回归&&code
# -*- coding: utf-8 -*- import numpy as np import matplotlib.pyplot as plt from certifi import __mai ...
cygwin-使用介绍
cygwin使用: 使用上的方便性很是不错,启动Cygwin以后,会在Windows下得到一个Bash Shell,由于Cygwin是以Windows下的服务运行的,所以很多情况下和在Linux下有很 ...
java.net.URL请求远程文件下载
1:浏览器请求下载 public void listStockcodeUplaod(HttpServletRequest req, HttpServletResponse res) throws Ex ...
eclipse中maven install和build，clean
eclipse插件,m2eclipse 1.maven install相当于maven原生的命令: mvn install 2.aven build是 m2eclipse这个插件自己创造的概念,需要你 ...
js学习笔记6----作用域及解析机制
1.作用域: 域:空间.范围.区域… 作用:读.写 script 全局变量,全局函数自上而下函数由里到外 {} 2.js解析: ⑴ “找一些东西”:var. function. 参数…… ...
如何让ie 7 支持box-shadow
box-shadow是一个很好用并且也常用的css 3属性,但是,如果我们要保证它能在ie 8及更低的版本下运行的话,需要借助一些其他的插件或文件.在这里我主要讲一下,如何用PIE.htc来解决ie ...
使用jquery.qrcode生成二维码支持logo，和中文
/* utf.js - UTF-8 <=> UTF-16 convertion * * Copyright (C) 1999 Masanao Izumo <iz@onicos.co. ...
【SDOI2010题集整合】BZOJ1922~1927&1941&1951&1952&1972&1974&1975
BZOJ1922大陆争霸思路:带限制的单源最短路限制每个点的条件有二,路程和最早能进入的时间,那么对两个值一起限制跑最短路,显然想要访问一个点最少满足max(dis,time) 那么每次把相连的点 ...
spring的自动装配基础
当开始看别人的代码使用注解的时候,以为照着别人的代码写,也写一个注释就能实现这样的功能,但是,现在开始考虑自动装配时怎样实现的. 首先,如果如果知道如何手动在xml配置中"装配bean&qu ...

自然语言15.1_Part of Speech Tagging 词性标注

Contents