1.环境准备

3.5之后的版本都需要java8以上的环境才能运行。需要进行中文处理的话，比较占用内存，3G左右的内存消耗。

笔者使用的maven进行依赖的引入，使用的是3.9.1版本。

直接在pom文件中加入下面的依赖：

        <dependency>

            <groupId>edu.stanford.nlp</groupId>

            <artifactId>stanford-corenlp</artifactId>

            <version>3.9.</version>

        </dependency>

        <dependency>

            <groupId>edu.stanford.nlp</groupId>

            <artifactId>stanford-corenlp</artifactId>

            <version>3.9.</version>

            <classifier>models</classifier>

        </dependency>

        <dependency>

            <groupId>edu.stanford.nlp</groupId>

            <artifactId>stanford-corenlp</artifactId>

            <version>3.9.</version>

            <classifier>models-chinese</classifier>

        </dependency>

3个包分别是CoreNLP的算法包、英文语料包、中文预料包。这3个包的总大小为1.43G。maven默认镜像在国外，而这几个依赖包特别大，可以找有着三个依赖的国内镜像试一下。笔者用的是自己公司的maven仓库。

2.代码调用

需要注意的是，因为我是需要进行中文的命名实体识别，因此需要使用中文分词和中文的词典。我们可以先打开引入的jar包的结构：

其中有个StanfordCoreNLP-chinese.properties文件，这里面设定了进行中文自然语言处理的一些参数。主要指定相应的pipeline的操作步骤以及对应的预料文件的位置。实际上我们可能用不到所有的步骤，或者要使用不同的语料库，因此可以自定义配置文件，然后再引入。那在我的项目中，我就直接读取了该properties文件。

attention：此处笔者要使用的是ner功能，但可能不想使用其他的一些annotation，想去掉。然而，Stanford CoreNLP有一些局限，就是在ner执行之前，一定需要

tokenize, ssplit, pos, lemma

的引入，当然这增加了很大的时间耗时。

其实我们可以先来分析一下这个properties文件：

# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)

annotators = tokenize, ssplit, pos, lemma, ner, parse, coref

# segment

tokenize.language = zh

segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz

segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese

segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz

segment.sighanPostProcessing = true

# sentence split

ssplit.boundaryTokenRegex = [.。]|[!?！？]+

# pos

pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

# ner 此处设定了ner使用的语言、模型（crf），目前SUTime只支持英文，不支持中文，所以设置为false。

ner.language = chinese

ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz

ner.applyNumericClassifiers = true

ner.useSUTime = false

# regexner

ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab

ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE

# parse

parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz

# depparse

depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz

depparse.language = chinese

# coref

coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch

coref.input.type = raw

coref.postprocessing = true

coref.calculateFeatureImportance = false

coref.useConstituencyTree = true

coref.useSemantics = false

coref.algorithm = hybrid

coref.path.word2vec =

coref.language = zh

coref.defaultPronounAgreement = true

coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz

coref.print.md.log = false

coref.md.type = RULE

coref.md.liberalChineseMD = false

# kbp

kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex

kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex

kbp.language = zh

kbp.model = none

# entitylink

entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz

那我们就直接在代码中引入这个properties文件，参考代码如下：

package com.baidu.corenlp;

import java.util.List;

import java.util.Map;

import java.util.Properties;

import edu.stanford.nlp.coref.CorefCoreAnnotations;

import edu.stanford.nlp.coref.data.CorefChain;

import edu.stanford.nlp.ling.CoreAnnotations;

import edu.stanford.nlp.ling.CoreLabel;

import edu.stanford.nlp.pipeline.Annotation;

import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.semgraph.SemanticGraph;

import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;

import edu.stanford.nlp.trees.Tree;

import edu.stanford.nlp.trees.TreeCoreAnnotations;

import edu.stanford.nlp.util.CoreMap;

/**

 * Created by sonofelice on 2018/3/27.

 */

public class TestNLP {

    public void test() throws Exception {

        //构造一个StanfordCoreNLP对象，配置NLP的功能，如lemma是词干化，ner是命名实体识别等

        Properties props = new Properties();

        props.load(this.getClass().getResourceAsStream("/StanfordCoreNLP-chinese.properties"));

        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        String text = "袁隆平是中国科学院的院士,他于2009年10月到中国山东省东营市东营区永乐机场附近承包了一千亩盐碱地,"

                + "开始种植棉花, 年产量达到一万吨, 哈哈, 反正棣琦说的是假的,逗你玩儿,明天下午2点来我家吃饭吧。"

                + "棣琦是山东大学毕业的,目前在百度做java开发,位置是东北旺东路102号院,手机号14366778890";

        long startTime = System.currentTimeMillis();

        // 创造一个空的Annotation对象

        Annotation document = new Annotation(text);

        // 对文本进行分析

        pipeline.annotate(document);

        //获取文本处理结果

        List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

        for (CoreMap sentence : sentences) {

            // traversing the words in the current sentence

            // a CoreLabel is a CoreMap with additional token-specific methods

            for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {

                //                // 获取句子的token（可以是作为分词后的词语）

                String word = token.get(CoreAnnotations.TextAnnotation.class);

                System.out.println(word);

                //词性标注

                String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);

                System.out.println(pos);

                // 命名实体识别

                String ne = token.get(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class);

                String ner = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

                System.out.println(word + " | analysis : {  original : " + ner + "," + " normalized : "

                        + ne + "}");

                //词干化处理

                String lema = token.get(CoreAnnotations.LemmaAnnotation.class);

                System.out.println(lema);

            }

            // 句子的解析树

            Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);

            System.out.println("句子的解析树:");

            tree.pennPrint();

            // 句子的依赖图

            SemanticGraph graph =

                    sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);

            System.out.println("句子的依赖图");

            System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST));

        }

        long endTime = System.currentTimeMillis();

        long time = endTime - startTime;

        System.out.println("The analysis lasts " + time + " seconds * 1000");

        // 指代词链

        //每条链保存指代的集合

        // 句子和偏移量都从1开始

        Map<Integer, CorefChain> corefChains = document.get(CorefCoreAnnotations.CorefChainAnnotation.class);

        if (corefChains == null) {

            return;

        }

        for (Map.Entry<Integer, CorefChain> entry : corefChains.entrySet()) {

            System.out.println("Chain " + entry.getKey() + " ");

            for (CorefChain.CorefMention m : entry.getValue().getMentionsInTextualOrder()) {

                // We need to subtract one since the indices count from 1 but the Lists start from 0

                List<CoreLabel> tokens = sentences.get(m.sentNum - ).get(CoreAnnotations.TokensAnnotation.class);

                // We subtract two for end: one for 0-based indexing, and one because we want last token of mention

                // not one following.

                System.out.println(

                        "  " + m + ", i.e., 0-based character offsets [" + tokens.get(m.startIndex - ).beginPosition()

                                +

                                ", " + tokens.get(m.endIndex - ).endPosition() + ")");

            }

        }

    }

}

public static void main(String[] args) throws  Exception {
    TestNLP nlp=new TestNLP();
    nlp.test();
}

当然，我在运行过程中，只保留了ner相关的分析，别的功能注释掉了。输出结果如下：

::16.000 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos

::19.387 [main] INFO  e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [3.4 sec].

::19.388 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma

::19.389 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator ner

::21.938 [main] INFO  e.s.n.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [2.5 sec].

::22.099 [main] WARN  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Entry has multiple types for ner: 巴伐利亚 STATE_OR_PROVINCE    MISC,GPE,LOCATION    .  Taking type to be MISC

::22.100 [main] WARN  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Entry has multiple types for ner: 巴伐利亚 州 STATE_OR_PROVINCE    MISC,GPE,LOCATION    .  Taking type to be MISC

::22.100 [main] INFO  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Read  unique entries out of  from edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab,  TokensRegex patterns.

::22.532 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator parse

::35.855 [main] INFO  e.s.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/srparser/chineseSR.ser.gz ... done [13.3 sec].

::35.859 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator coref

::43.139 [main] INFO  e.s.n.pipeline.CorefMentionAnnotator - Using mention detector type: rule

::43.148 [main] INFO  e.s.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from  file:

::43.148 [main] INFO  e.s.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz

::43.329 [main] INFO  e.s.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: .

::43.379 [main] INFO  edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].

::43.380 [main] INFO  e.s.nlp.wordseg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].

袁隆平 | analysis : {  original : PERSON, normalized : null}

是 | analysis : {  original : O, normalized : null}

中国 | analysis : {  original : ORGANIZATION, normalized : null}

科学院 | analysis : {  original : ORGANIZATION, normalized : null}

的 | analysis : {  original : O, normalized : null}

院士 | analysis : {  original : TITLE, normalized : null}

, | analysis : {  original : O, normalized : null}

他 | analysis : {  original : O, normalized : null}

于 | analysis : {  original : O, normalized : null}

2009年 | analysis : {  original : DATE, normalized : --XX}

10月 | analysis : {  original : DATE, normalized : --XX}

到 | analysis : {  original : O, normalized : null}

中国 | analysis : {  original : COUNTRY, normalized : null}

山东省 | analysis : {  original : STATE_OR_PROVINCE, normalized : null}

东营市 | analysis : {  original : CITY, normalized : null}

东营区 | analysis : {  original : FACILITY, normalized : null}

永乐 | analysis : {  original : FACILITY, normalized : null}

机场 | analysis : {  original : FACILITY, normalized : null}

附近 | analysis : {  original : O, normalized : null}

承包 | analysis : {  original : O, normalized : null}

了 | analysis : {  original : O, normalized : null}

一千 | analysis : {  original : NUMBER, normalized : }

亩 | analysis : {  original : O, normalized : null}

盐 | analysis : {  original : O, normalized : null}

碱地 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

开始 | analysis : {  original : O, normalized : null}

种植 | analysis : {  original : O, normalized : null}

棉花 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

年产量 | analysis : {  original : O, normalized : null}

达到 | analysis : {  original : O, normalized : null}

一万 | analysis : {  original : NUMBER, normalized : }

吨 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

哈哈 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

反正 | analysis : {  original : O, normalized : null}

棣琦 | analysis : {  original : PERSON, normalized : null}

说 | analysis : {  original : O, normalized : null}

的 | analysis : {  original : O, normalized : null}

是 | analysis : {  original : O, normalized : null}

假 | analysis : {  original : O, normalized : null}

的 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

逗 | analysis : {  original : O, normalized : null}

你 | analysis : {  original : O, normalized : null}

玩儿 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

明天 | analysis : {  original : DATE, normalized : XXXX-XX-XX}

下午 | analysis : {  original : TIME, normalized : null}

2点 | analysis : {  original : TIME, normalized : null}

来 | analysis : {  original : O, normalized : null}

我 | analysis : {  original : O, normalized : null}

家 | analysis : {  original : O, normalized : null}

吃饭 | analysis : {  original : O, normalized : null}

吧 | analysis : {  original : O, normalized : null}

。 | analysis : {  original : O, normalized : null}

棣琦 | analysis : {  original : PERSON, normalized : null}

是 | analysis : {  original : O, normalized : null}

山东 | analysis : {  original : ORGANIZATION, normalized : null}

大学 | analysis : {  original : ORGANIZATION, normalized : null}

毕业 | analysis : {  original : O, normalized : null}

的 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

目前 | analysis : {  original : DATE, normalized : null}

在 | analysis : {  original : O, normalized : null}

百度 | analysis : {  original : ORGANIZATION, normalized : null}

做 | analysis : {  original : O, normalized : null}

java | analysis : {  original : O, normalized : null}

开发 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

位置 | analysis : {  original : O, normalized : null}

是 | analysis : {  original : O, normalized : null}

东北 | analysis : {  original : LOCATION, normalized : null}

旺 | analysis : {  original : O, normalized : null}

东路 | analysis : {  original : O, normalized : null}

 | analysis : {  original : NUMBER, normalized : }

号院 | analysis : {  original : O, normalized : null}

, | analysis : {  original : O, normalized : null}

手机号 | analysis : {  original : O, normalized : null}

 | analysis : {  original : NUMBER, normalized : }

 | analysis : {  original : NUMBER, normalized : }

The analysis lasts  seconds * 

Process finished with exit code

我们可以看到，整个工程的启动耗时还是挺久的。分析过程也比较耗时，819毫秒。

并且结果也不够准确，跟我在其官网在线demo得到的结果还是有些差异的：

使用Standford coreNLP进行中文命名实体识别的更多相关文章

DL4NLP —— 序列标注：BiLSTM-CRF模型做基于字的中文命名实体识别
三个月之前 NLP 课程结课,我们做的是命名实体识别的实验.在MSRA的简体中文NER语料(我是从这里下载的,非官方出品,可能不是SIGHAN 2006 Bakeoff-3评测所使用的原版语料)上训练 ...
基于BERT预训练的中文命名实体识别TensorFlow实现
BERT-BiLSMT-CRF-NERTensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuni ...
零基础入门--中文命名实体识别（BiLSTM+CRF模型，含代码）
自己也是一个初学者,主要是总结一下最近的学习,大佬见笑. 中文分词说到命名实体抽取,先要了解一下基于字标注的中文分词.比如一句话 "我爱北京天安门”. 分词的结果可以是 “我/爱/北京/天安 ...
基于 bi-LSTM和CRF的中文命名实体识别
follow: https://github.com/zjy-ucas/ChineseNER 这里边主要识别的实体如图所示,其实也就主要识别人名PER,机构ORG和地点LOC: B表示开始的字节,I ...
NLP 基于kashgari和BERT实现中文命名实体识别（NER）
准备工作,先准备 python 环境,下载 BERT 语言模型 Python 3.6 环境需要安装kashgari Backend pypi version desc TensorFlow 2.x ...
命名实体识别(NER)
一.任务 Named Entity Recognition,简称NER.主要用于提取时间.地点.人物.组织机构名. 二.应用知识图谱.情感分析.机器翻译.对话问答系统都有应用.比如,需要利用命名实体 ...
【NER】对命名实体识别(槽位填充)的一些认识
命名实体识别 1. 问题定义广义的命名实体识别是指识别出待处理文本中三大类(实体类.时间类和数字类).七小类(人名.机构名.地名.日期.货币和百分比)命名实体.但实际应用中不只是识别上述所说的实体类 ...
基于bert的命名实体识别，pytorch实现，支持中文/英文【源学计划】
声明:为了帮助初学者快速入门和上手,开始源学计划,即通过源代码进行学习.该计划收取少量费用,提供有质量保证的源码,以及详细的使用说明. 第一个项目是基于bert的命名实体识别(name entity ...
中文电子病历命名实体识别（CNER）研究进展
中文电子病历命名实体识别(CNER)研究进展中文电子病历命名实体识别(Chinese Clinical Named Entity Recognition, Chinese-CNER)任务目标是从给定 ...

随机推荐

shell监控脚本,不考虑多用户情况
#!/bin/bash CheckProcess() { if [ "$1" = "" ]; then fi PROCESS_NUM=`ps -ef | gre ...
Android基础总结（六）Activity
创建第二个Activity(掌握) 需要在清单文件中为其配置一个activity标签标签中如果带有这个子节点,则会在系统中多创建一个快捷图标 <intent-filter> <ac ...
TF和SD
TF卡又称T-Flash卡,全名:TransFLash,又名:Micro SD SD卡(Secure Digital Memory Card,安全数码卡)
专题实验 SQL
merge merge into copy_emp ce using emp e on (ce.empno = e.empno) when matched then update set ename ...
关于Unity的C#基础学习（一）
一.程序包含 1.数据:运行过程中产生的 2.代码:代码指令数据和代码都是存放到内存中的,代码指令在程序加载的时候放到内存,数据是在程序运行的时候在内存中动态地生成,随时会被回收,要定义变量来存放数 ...
UEFI + win8 + ubuntu16.04双系统安装
主要参考 https://linux.cn/article-3178-1.html https://linux.cn/article-3061-1.html 其他 https://jingyan.ba ...
【BZOJ】1635: [Usaco2007 Jan]Tallest Cow 最高的牛（差分序列）
http://www.lydsy.com/JudgeOnline/problem.php?id=1635 差分序列是个好东西啊....很多地方都用了啊,,, 线性的进行区间操作orz 有题可知 h[a ...
Mac OSX使用 XAMPP path 下的php
修改-/.bash_profile文件或.zshrc文件 export XAMPP_HOME=/Applications/XAMPP export PATH=${XAMPP_HOME}/bin:${P ...
sublime text 2安装Emment插件
写个自己看的 1. 命令行模式 ctrl+` 可以调出命令行模式(view->show console),主要支持python语法等,没试用过只知 quit()可以退出 ..不过sublime的 ...
Ubuntu16.04 Tomcat9的安装
1. 从http://tomcat.apache.org/download-90.cgi 下载apache-tomcat-9.0.0.M11.tar.gz 2. 上传到Linux后移动到/opt/to ...

使用Standford coreNLP进行中文命名实体识别

1.环境准备

2.代码调用

使用Standford coreNLP进行中文命名实体识别的更多相关文章

随机推荐

热门专题