自然语言18_Named-entity recognition
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share
命名实体(Named Entity)类别识别
除了在预测用户意图方面的用途,查询日志还可以用来识别命名实体。命名实体识别是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、时 间、日期、货币及其他专有名词等。它是自然语言处理实用化的重要内容,在信息提取、句法分析、机器翻译等应用领域中具有重要的基础性作用。命名实体识别一 方面要识别实体边界,另一方面要识别实体类别(人名、地名、机构名或其他)。就汉语系统来讲,确定实体边界主要和分词相关,发现命名实体的基本方法,一般 首先找一些与定义相关的特征词,例如:什么是XX,XX是什么,这是XX。找到具有这样模式的查询串后,即可以在查询日志中通过频率统计等方法,找到命名 实体。这里重点讨论第二方面的内容,即类别识别。
之所以会用查询日志来进行命名实体的类别识别,是因为命名实体的类别并非是一个封闭集,而是一个不断变化着的集合。一个命名实体,随着时间的变化, 往往会具有不同的属性。以大家熟悉的"哈利·波特"为例,它开始是一部小说,然后又推出了同名的电影,后来还出了游戏,而这一过程是随着时间变化的,也就 是说在不同时间段,这些类别在用户查询需求中受关注程度是不一样的。
类别分析首先统计和命名实体相关的查询,如图6-9所示[Jiang 2010]。
|
| 图6-9 命名实体的查询次数 |
对于这些查询,如果要把它对应到相应的3种类型:书、游戏和电影上去,可以采用WS-LDA(Weakly Supervised Latent Dirichlet Allocation)模型[Guo 2009b],通过训练的方式得到对于某个含实体的查询词属于某一类型的概率模型,然后在查询端采用训练出的模型判定某个含实体查询潜在的类别,如图 6-10所示。
|
| 图6-10 实体分类LDA模型 |
以电影、图片、小说、摘要、评论等与实体相伴随的词和事先标注的样本集作为训练对象,得到以下的概率分布,如图6-11所示。
|
| 图6-11 实体分类训练集 |
当实际查询某个实体时,根据含该实体的查询中伴随次出现的频率,作为w,即可用上述模型进行预测,以得到某查询属于电影、书或是游戏的概率。
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Most research on NER systems has been structured as taking an unannotated block of text, such as this one:
- Jim bought 300 shares of Acme Corp. in 2006.
And producing an annotated block of text that highlights the names of entities:
- [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.
In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.
State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.[1][2]
Contents
Problem definition
In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as terms for certain biological species and substances.[3]
Full named-entity recognition is often broken down, conceptually and possibly also in implementations,[4] as two distinct problems: detection of names, and classification of the names by the type of entity they refer to (e.g. person, organization, location and other[5]). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that "Bank of America" is a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem is formally similar to chunking.
Temporal expressions and some numerical expressions (i.e., money, percentages, etc.) may also be considered as named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context in which it is used.[6]
Certain hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes.[7] Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.[8] More recently, in 2011 Ritter used a hierarchy based on common Freebase entity types in ground-breaking experiments on NER over social media text.[9]
Formal evaluation
To evaluate the quality of a NER system's output, several measures have been defined. While accuracy on the token level is one possibility, it suffers from two problems: the vast majority of tokens in real-world text are not part of entity names as usually defined, so the baseline accuracy (always predict "not an entity") is extravagantly high, typically >90%; and mispredicting the full span of an entity name is not properly penalized (finding only a person's first name when their last name follows is scored as ½ accuracy).
In academic conferences such as CoNLL, a variant of the F1 score has been defined as follows:[5]
- Precision is the number of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. I.e. when [Person Hans] [Person Blick] is predicted but [Person Hans Blick] was required, precision for the predicted name is zero. Precision is then averaged over all predicted entity names.
- Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions.
- F1 score is the harmonic mean of these two.
It follows from the above definition that any prediction that misses a single token, includes a spurious token, or has the wrong class, is a hard error and does not contribute to either precision or recall.
Evaluation models based on a token-by-token matching have been proposed.[10] Such models are able to handle also partially overlapping matches, yet fully rewarding only exact matches. They allow a finer grained evaluation and comparison of extraction systems, taking into account also the degree of mismatch in non-exact predictions.
Approaches
NER systems have been created that use linguistic grammar-based techniques as well as statistical models, i.e. machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches have been suggested to avoid part of the annotation effort.[11][12]
Many different classifier types have been used to perform machine-learned NER, with conditional random fields being a typical choice.[13]
Problem domains
Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.[14] Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.
Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities. The most common entity of interest in that domain has been names of genes and gene products. There has been also considerable interest in the recognition of chemical entities and drugs in the context of the CHEMDNER competition, with 27 teams participating in this task.[15]
Current challenges and research
Despite the high F1 numbers reported on the MUC-7 dataset, the problem of Named Entity Recognition is far from being solved. The main efforts are directed to reducing the annotation labor by employing semi-supervised learning,[11][16] robust performance across domains[17][18] and scaling up to fine-grained entity types.[8][19] In recent years, many projects have turned to crowdsourcing, which is a promising solution to obtain high-quality aggregate human judgments for supervised and semi-supervised machine learning approaches to NER.[20] Another challenging task is devising models to deal with linguistically complex contexts such as Twitter and search queries.[21]
A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia[22][23][24] can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system:
<ENTITY url="http://en.wikipedia.org/wiki/Michael_I._Jordan"> Michael Jordan </ENTITY> is a professor at <ENTITY url="http://en.wikipedia.org/wiki/University_of_California,_Berkeley"> Berkeley </ENTITY>
Software
- GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API
- OpenNLP includes rule-based and statistical named-entity recognition
- Stanford University also has the Stanford Named Entity Recognizer
https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149(博主视频教学主页)

自然语言18_Named-entity recognition的更多相关文章
- 自然语言18.1_Named Entity Recognition with NLTK
QQ:231469242 欢迎nltk爱好者交流 https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?c ...
- Java自然语言处理NLP工具包
1. Java自然语言处理 LingPipe LingPipe是一个自然语言处理的Java开源工具包.LingPipe目前已有很丰富的功能,包括主题分类(Top Classification).命名实 ...
- 自然语言处理领域重要论文&资源全索引
自然语言处理(NLP)是人工智能研究中极具挑战的一个分支.随着深度学习等技术的引入,NLP领域正在以前所未有的速度向前发展.但对于初学者来说,这一领域目前有哪些研究和资源是必读的?最近,Kyubyon ...
- 自然语言处理系列-4条件随机场(CRF)及其tensorflow实现
前些天与一位NLP大牛交流,请教其如何提升技术水平,其跟我讲务必要重视“NLP的最基本知识”的掌握.掌握好最基本的模型理论,不管是对日常工作和后续论文的发表都有重要的意义.小Dream听了不禁心里一颤 ...
- 【NLP汉语自然语言处理与实践】分词_笔记
一.两种分词标准: 1. 粗粒度. 将词作为最小基本单位.比如:浙江大学. 主要用于自然语言处理的各种应用. 2. 细粒度. 不仅对词汇继续切分,也对词汇内部的语素进行切分.比如:浙江/大学. 主要用 ...
- 达观数据CTO纪达麒:小标注数据量下自然语言处理实战经验
自然语言处理在文本信息抽取.自动审校.智能问答.情感分析等场景下都有非常多的实际应用需求,在人工智能领域里有极为广泛的应用场景.然而在实际工程应用中,最经常面临的挑战是我们往往很难有大量高质量的标注语 ...
- 曼孚科技:AI自然语言处理(NLP)领域常用的16个术语
自然语言处理(NLP)是人工智能领域一个十分重要的研究方向.NLP研究的是实现人与计算机之间用自然语言进行有效沟通的各种理论与方法. 本文整理了NLP领域常用的16个术语,希望可以帮助大家更好地理解 ...
- 自然语言处理(NLP)——简介
自然语言处理(NLP Natural Language Processing)是一种专业分析人类语言的人工智能.就是在机器语⾔和⼈类语言之间沟通的桥梁,以实现人机交流的目的. 在人工智能出现之前,机器 ...
- 自然语言17_Chinking with NLTK
https://www.pythonprogramming.net/chinking-nltk-tutorial/?completed=/chunking-nltk-tutorial/ 代码 # -* ...
随机推荐
- 网上找到的一个jquery版网页换肤特效
这个跟我之前在锋利的JQuery那本书里看到的那个一模一样. <!DOCTYPE html> <html> <head> <meta name="& ...
- window配置nginx+php+mysql
博客来源: http://www.cnblogs.com/wuzhenbo/p/3493518.html 启用nginx D:\nginx>nginx.exe 重启 D:\nginx> ...
- android定时器
Handler+Timer+TimerTask 三.采用Handler与timer及TimerTask结合的方法. 1.定义定时器.定时器任务及Handler句柄 private final Time ...
- BIM软件小技巧:Revit2014所有快捷键汇总表格
命令 快捷键 路径 修改 MD 创建>选择; 插入>选择; 注释>选择; 视图>选择; 管理>选择; 修改>选择; 建筑>选择; 结构>选择; 系统 ...
- Entity Framework在WCF中序列化的问题
问题描述 如果你在WCF中用Entity Framework来获取数据并返回实体对象,那么对下面的错误一定不陌生. 接收对 http://localhost:5115/ReService.svc 的 ...
- XAML: x:DeferLoadStrategy, x:Null
x:DeferLoadStrategy="Lazy" - 用于指定一个 UIElement 为一个延迟加载元素 x:Null - null 示例1.x:DeferLoadStrat ...
- java内存管理机制
JAVA 内存管理总结 1. java是如何管理内存的 Java的内存管理就是对象的分配和释放问题.(两部分) 分配 :内存的分配是由程序完成的,程序员需要通过关键字new 为每个对象申请内存空间 ( ...
- Java实现本地 fileCopy
前言: Java中流是重要的内容,基础的文件读写与拷贝知识点是很多面试的考点.故通过本文进行简单测试总结. 2.图展示[文本IO/二进制IO](这是参考自网上的一张总结图,相当经典,方便对比记忆) 3 ...
- iOS9 支持http
在Info.plist中添加NSAppTransportSecurity类型Dictionary. 在NSAppTransportSecurity下添加NSAllowsArbitraryLoads类型 ...
- Hive 按某列的部分排序 以及 删列操作
脑袋果然还是智商不足. 涉及到的小需求: 某个表test 有一列 tc: a字符串+b字符串+c字符串 拼接组成 把test表,按b字符串排序 输出 遇到的问题: select 里面必须包含 orde ...
