elasticsearch之集成中文分词器

IK是基于字典的一款轻量级的中文分词工具包，可以通过elasticsearch的插件机制集成；

一、集成步骤

1.在elasticsearch的安装目录下的plugin下新建ik目录；

2.在github下载对应版本的ik插件；

https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v6.8.12

3.解压插件文件，并重启elasticsearch，可以看到如下已经加载了ik插件；

[2022-01-11T15:22:54,341][INFO ][o.e.p.PluginsService     ] [4EvvJl1] loaded plugin [analysis-ik]

二、体验IK的分析器

IK提供了ik_smart和ik_max_word两个分析器；

ik_max_word分析器会最大程度的对文本进行分词，分词的粒度还是比较细致的；

POST _analyze

{

  "analyzer": "ik_max_word",

  "text":"这次出差我们住的是闫团如家快捷酒店"

}

{

  "tokens" : [

    {

      "token" : "这次",

      "start_offset" : 0,

      "end_offset" : 2,

      "type" : "CN_WORD",

      "position" : 0

    },

    {

      "token" : "出差",

      "start_offset" : 2,

      "end_offset" : 4,

      "type" : "CN_WORD",

      "position" : 1

    },

    {

      "token" : "我们",

      "start_offset" : 4,

      "end_offset" : 6,

      "type" : "CN_WORD",

      "position" : 2

    },

    {

      "token" : "住",

      "start_offset" : 6,

      "end_offset" : 7,

      "type" : "CN_CHAR",

      "position" : 3

    },

    {

      "token" : "的",

      "start_offset" : 7,

      "end_offset" : 8,

      "type" : "CN_CHAR",

      "position" : 4

    },

    {

      "token" : "是",

      "start_offset" : 8,

      "end_offset" : 9,

      "type" : "CN_CHAR",

      "position" : 5

    },

    {

      "token" : "闫",

      "start_offset" : 9,

      "end_offset" : 10,

      "type" : "CN_CHAR",

      "position" : 6

    },

    {

      "token" : "团",

      "start_offset" : 10,

      "end_offset" : 11,

      "type" : "CN_CHAR",

      "position" : 7

    },

    {

      "token" : "如家",

      "start_offset" : 11,

      "end_offset" : 13,

      "type" : "CN_WORD",

      "position" : 8

    },

    {

      "token" : "快捷酒店",

      "start_offset" : 13,

      "end_offset" : 17,

      "type" : "CN_WORD",

      "position" : 9

    }

  ]

}

ik_smart相对来说粒度会比较粗；

POST _analyze

{

  "analyzer": "ik_smart",

  "text":"这次出差我们住的是闫团如家快捷酒店"

}

{

  "tokens" : [

    {

      "token" : "这次",

      "start_offset" : 0,

      "end_offset" : 2,

      "type" : "CN_WORD",

      "position" : 0

    },

    {

      "token" : "出差",

      "start_offset" : 2,

      "end_offset" : 4,

      "type" : "CN_WORD",

      "position" : 1

    },

    {

      "token" : "我们",

      "start_offset" : 4,

      "end_offset" : 6,

      "type" : "CN_WORD",

      "position" : 2

    },

    {

      "token" : "住",

      "start_offset" : 6,

      "end_offset" : 7,

      "type" : "CN_CHAR",

      "position" : 3

    },

    {

      "token" : "的",

      "start_offset" : 7,

      "end_offset" : 8,

      "type" : "CN_CHAR",

      "position" : 4

    },

    {

      "token" : "是",

      "start_offset" : 8,

      "end_offset" : 9,

      "type" : "CN_CHAR",

      "position" : 5

    },

    {

      "token" : "闫",

      "start_offset" : 9,

      "end_offset" : 10,

      "type" : "CN_CHAR",

      "position" : 6

    },

    {

      "token" : "团",

      "start_offset" : 10,

      "end_offset" : 11,

      "type" : "CN_CHAR",

      "position" : 7

    },

    {

      "token" : "如家",

      "start_offset" : 11,

      "end_offset" : 13,

      "type" : "CN_WORD",

      "position" : 8

    },

    {

      "token" : "快捷酒店",

      "start_offset" : 13,

      "end_offset" : 17,

      "type" : "CN_WORD",

      "position" : 9

    }

  ]

}

三、扩展ik字典

由于闫团是一个比较小的地方，ik的字典中并不包含导致分成两个单个的字符；我们可以将它添加到ik的字典中；

在ik的安装目录下config中新增my.dic文件，并将闫团放到文件中；完成之后修改IKAnalyzer.cfg.xml文件，添加新增的字典文件；

<properties>

	<comment>IK Analyzer 扩展配置</comment>

	<!--用户可以在这里配置自己的扩展字典 -->

	<entry key="ext_dict">my.dic</entry>

	 <!--用户可以在这里配置自己的扩展停止词字典-->

	<entry key="ext_stopwords"></entry>

	<!--用户可以在这里配置远程扩展字典 -->

	<!-- <entry key="remote_ext_dict">words_location</entry> -->

	<!--用户可以在这里配置远程扩展停止词字典-->

	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->

</properties>

重启elasticsearch并重新执行查看已经将地名作为一个分词了；

POST _analyze

{

  "analyzer": "ik_smart",

  "text":"这次出差我们住的是闫团如家快捷酒店"

}

{

  "tokens" : [

    {

      "token" : "这次",

      "start_offset" : 0,

      "end_offset" : 2,

      "type" : "CN_WORD",

      "position" : 0

    },

    {

      "token" : "出差",

      "start_offset" : 2,

      "end_offset" : 4,

      "type" : "CN_WORD",

      "position" : 1

    },

    {

      "token" : "我们",

      "start_offset" : 4,

      "end_offset" : 6,

      "type" : "CN_WORD",

      "position" : 2

    },

    {

      "token" : "住",

      "start_offset" : 6,

      "end_offset" : 7,

      "type" : "CN_CHAR",

      "position" : 3

    },

    {

      "token" : "的",

      "start_offset" : 7,

      "end_offset" : 8,

      "type" : "CN_CHAR",

      "position" : 4

    },

    {

      "token" : "是",

      "start_offset" : 8,

      "end_offset" : 9,

      "type" : "CN_CHAR",

      "position" : 5

    },

    {

      "token" : "闫团",

      "start_offset" : 9,

      "end_offset" : 11,

      "type" : "CN_WORD",

      "position" : 6

    },

    {

      "token" : "如家",

      "start_offset" : 11,

      "end_offset" : 13,

      "type" : "CN_WORD",

      "position" : 7

    },

    {

      "token" : "快捷酒店",

      "start_offset" : 13,

      "end_offset" : 17,

      "type" : "CN_WORD",

      "position" : 8

    }

  ]

}

四、体验HanLP分析器及自定义字典

HanLP是由一系列模型与算法组成的Java工具包，它从中文分词开始，覆盖词性标注、命名实体识别、句法分析、文本分类等常用的NLP任务，提供了丰富的API，被广泛用于Lucene、Solr和ES等搜索平台。就分词算法来说，它支持最短路分词、N-最短路分词和CRF分词等分词算法。

从以下地址下载hanLP插件包

https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.9.2/elasticsearch-analysis-hanlp-7.9.2.zip

安装hanLP插件包

bin\elasticsearch-plugin install file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip

-> Installing file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip

-> Downloading file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip

[=================================================] 100%??

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@     WARNING: plugin requires additional permissions     @

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

* java.io.FilePermission plugins/analysis-hanlp/data/-#plus read,write,delete

* java.io.FilePermission plugins/analysis-hanlp/hanlp.cache#plus read,write,delete

* java.lang.RuntimePermission getClassLoader

* java.lang.RuntimePermission setContextClassLoader

* java.net.SocketPermission * connect,resolve

* java.util.PropertyPermission * read,write

See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html

for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y

-> Installed analysis-hanlp

使用hanlp_standard分析器对文本进行分析

POST _analyze

{

  "analyzer": "hanlp_standard",

  "text":"这次出差我们住的是闫团如家快捷酒店"

}

{

  "tokens" : [

    {

      "token" : "这次",

      "start_offset" : 0,

      "end_offset" : 2,

      "type" : "r",

      "position" : 0

    },

    {

      "token" : "出差",

      "start_offset" : 2,

      "end_offset" : 4,

      "type" : "vi",

      "position" : 1

    },

    {

      "token" : "我们",

      "start_offset" : 4,

      "end_offset" : 6,

      "type" : "rr",

      "position" : 2

    },

    {

      "token" : "住",

      "start_offset" : 6,

      "end_offset" : 7,

      "type" : "vi",

      "position" : 3

    },

    {

      "token" : "的",

      "start_offset" : 7,

      "end_offset" : 8,

      "type" : "ude1",

      "position" : 4

    },

    {

      "token" : "是",

      "start_offset" : 8,

      "end_offset" : 9,

      "type" : "vshi",

      "position" : 5

    },

    {

      "token" : "闫团",

      "start_offset" : 9,

      "end_offset" : 11,

      "type" : "nr",

      "position" : 6

    },

    {

      "token" : "如家",

      "start_offset" : 11,

      "end_offset" : 13,

      "type" : "r",

      "position" : 7

    },

    {

      "token" : "快捷酒店",

      "start_offset" : 13,

      "end_offset" : 17,

      "type" : "ntch",

      "position" : 8

    }

  ]

}

我们可以看到hanLP自动将闫团分成一个词了；

执行如下测试，可以看到hanLP没有将小地方作为一个分词；

POST _analyze

{

  "analyzer": "hanlp_standard",

  "text":"闫团是一个小地方"

}

{

  "tokens" : [

    {

      "token" : "闫团",

      "start_offset" : 0,

      "end_offset" : 2,

      "type" : "nr",

      "position" : 0

    },

    {

      "token" : "是",

      "start_offset" : 2,

      "end_offset" : 3,

      "type" : "vshi",

      "position" : 1

    },

    {

      "token" : "一个",

      "start_offset" : 3,

      "end_offset" : 5,

      "type" : "mq",

      "position" : 2

    },

    {

      "token" : "小",

      "start_offset" : 5,

      "end_offset" : 6,

      "type" : "a",

      "position" : 3

    },

    {

      "token" : "地方",

      "start_offset" : 6,

      "end_offset" : 8,

      "type" : "n",

      "position" : 4

    }

  ]

}

为了自定义分词，我们在${ES_HOME}/plugins/analysis-hanlp/data/dictionary/custom下新建my.dic,并添加小地方；

然后从插件安装包拷贝hanlp.properties文件放到如下位置${ES_HOME}/config/analysis-hanlp/hanlp.properties，并修改CustomDictionaryPath；

CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; ModernChineseSupplementaryWord.txt; ChinesePlaceName.txt ns; PersonalName.txt; OrganizationName.txt; ShanghaiPlaceName.txt ns;data/dictionary/person/nrf.txt nrf;data/dictionary/custom/my.dic;

从起elasticsearch并执行测试

POST _analyze

{

  "analyzer": "hanlp",

  "text":"闫团是一个小地方"

}

{

  "tokens" : [

    {

      "token" : "闫团",

      "start_offset" : 0,

      "end_offset" : 2,

      "type" : "nr",

      "position" : 0

    },

    {

      "token" : "是",

      "start_offset" : 2,

      "end_offset" : 3,

      "type" : "vshi",

      "position" : 1

    },

    {

      "token" : "一个",

      "start_offset" : 3,

      "end_offset" : 5,

      "type" : "mq",

      "position" : 2

    },

    {

      "token" : "小地方",

      "start_offset" : 5,

      "end_offset" : 8,

      "type" : "n",

      "position" : 3

    }

  ]

}

elasticsearch之集成中文分词器的更多相关文章

solr 7+tomcat 8 + mysql实现solr 7基本使用(安装、集成中文分词器、定时同步数据库数据以及项目集成)
基本说明 Solr是一个开源项目,基于Lucene的搜索服务器,一般用于高级的搜索功能: solr还支持各种插件(如中文分词器等),便于做多样化功能的集成: 提供页面操作,查看日志和配置信息,功能全面 ...
elasticsearch使用ik中文分词器
elasticsearch使用ik中文分词器一.背景二.安装 ik 分词器 1.从 github 上找到和本次 es 版本匹配上的分词器 2.使用 es 自带的插件管理 elasticsearc ...
Elasticsearch系列---使用中文分词器
前言前面的案例使用standard.english分词器,是英文原生的分词器,对中文分词支持不太好.中文作为全球最优美.最复杂的语言,目前中文分词器较多,ik-analyzer.结巴中文分词.THU ...
如何在Elasticsearch中安装中文分词器(IK)和拼音分词器？
声明:我使用的Elasticsearch的版本是5.4.0,安装分词器前请先安装maven 一:安装maven https://github.com/apache/maven 说明: 安装maven需 ...
Elasticsearch：hanlp 中文分词器
HanLP 中文分词器是一个开源的分词器,是专为Elasticsearch而设计的.它是基于HanLP,并提供了HanLP中大部分的分词方式.它的源码位于: https://github.com/Ke ...
Elasticsearch：IK中文分词器
Elasticsearch内置的分词器对中文不友好,只会一个字一个字的分,无法形成词语,比如: POST /_analyze { "text": "我爱北京天安门&quo ...
如何在Elasticsearch中安装中文分词器(IK+pinyin)
如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题--中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组. ...
ElasticSearch安装中文分词器IKAnalyzer
# ElasticSearch安装中文分词器IKAnalyzer 本篇主要讲解如何在ElasticSearch中安装中文分词器IKAnalyzer,拆分的每个词都是我们熟知的词语,从而建立词汇与文档 ...
Elasticsearch之中文分词器插件es-ik（博主推荐）
前提什么是倒排索引? Elasticsearch之分词器的作用 Elasticsearch之分词器的工作流程 Elasticsearch之停用词 Elasticsearch之中文分词器 Elasti ...

随机推荐

Python语法入门之与用户交互、运算符
一.与用户交互输入获取用户输入 username = input('请输入您的用户名>>>:') '''将input获取到的用户输入绑定给变量名username''' print ...
Indirect函数（Excel函数集团）
此处文章均为本妖原创,供下载.学习.探讨! 文章下载源是Office365国内版1Driver,如有链接问题请联系我. 请勿用于商业!谢谢下载地址:https://officecommunity-m ...
props 使用场景及布局提升
一对一一边写html 一边写css一小块为单位html csscss html整块单位html csscss html react/first-react/src/views/Wk/index.jsx ...
windows下php安装redis扩展
查看当前PHP版本代码中添加 phpinfo(); 下载对应的redis扩展下载链接:https://pecl.php.net/package/redis 因为我的PHP版本是5.6的,所以red ...
Linux(Centos)部署Jenkins，并配置Git生成Jar包进行发布部署
需要先安装jdk.maven.git环境 jdk安装:https://www.cnblogs.com/pxblog/p/10512886.html maven安装:https://www.cnblog ...
c++ 设计模式概述之策略
代码写的不规范,目的是为了缩短文章篇幅,实际中请不要这样做. 1.概述类比现实生活中的场景,比如,我需要一块8G内存条,我可以选择:A.去线下实体店买,B.线上购买,C.其他渠道. 再比如,吃饭餐具 ...
CHARINDEX 用法
CHARINDEX 返回字符串中指定表达式的起始位置. 语法 CHARINDEX ( expression1 , expression2 [ , start_location ] ) 参数 expre ...
CS起源-havana地图红方打法分析
作者:海底淤泥 havana是美国第一人称射击游戏<反恐精英>中的地图之一,编号为cs_havana,这张地图发生在古巴哈瓦那的某座城市中,恐怖分子们挟持了几名美裔的重要政治人物,以此为筹 ...
第十五个知识点:RSA-OAEP和ECIES的密钥生成,加密和解密
第十五个知识点:RSA-OAEP和ECIES的密钥生成,加密和解密 1.RSA-OAEP RSA-OAEP是RSA加密方案和OAEP填充方案的同时使用.现实世界中它们同时使用.(这里介绍的只是&quo ...
Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks
目录概主要内容 Mustafa A., Khan S., Hayat M., Goecke R., Shen J., Shao L., Adversarial Defense by Restric ...

elasticsearch之集成中文分词器

elasticsearch之集成中文分词器的更多相关文章

随机推荐

热门专题