一、拼音分词的应用

拼音分词在日常生活中其实很常见，也许你每天都在用。打开淘宝看一看吧,输入拼音”zhonghua”,下面会有包含”zhonghua”对应的中文”中华”的商品的提示：

拼音分词是根据输入的拼音提示对应的中文，通过拼音分词提升搜索体验、加快搜索速度。下面介绍如何在Elasticsearch 5.1.1中配置和实现pinyin+iK分词。

二、IK分词器下载与安装

关于IK分词器的介绍不再多少，一言以蔽之，IK分词是目前使用非常广泛分词效果比较好的中文分词器。做ES开发的，中文分词十有八九使用的都是IK分词器。

下载地址:https://github.com/medcl/elasticsearch-analysis-ik
配置之前关闭elasticsearch，配置完成以后再重启。
IK的版本要和当前ES的版本一致，README中有说明。我使用的是ES是5.1.1，IK的版本为5.1.1(你也许会奇怪为什么IK上一个版本是1.X,下一个版本一下升到5.X?是因为Elastic官方为了统一版本号，之前es的版本是2.x,logstash的版本是2.x,同时Kibana的版本是4.x，ik的版本是1.x，这样版本很混乱。5.0之后，统一版本号，这样你使用5.1.1的es，其它软件的版本也使用5.1.1就好了)。

下载之后进入到elasticsearch-analysis-pinyin-master目录，mvn打包(没有安装maven的自行安装)，运行命令：

    mvn package

打包成功以后，会生成一个target文件夹，在elasticsearch-analysis-ik-master/target/releases目录下，找到elasticsearch-analysis-ik-5.1.1.zip，这就是我们需要的安装文件。解压elasticsearch-analysis-ik-5.1.1.zip，得到下面内容：

commons-codec-1.9.jar

commons-logging-1.2.jar

config

elasticsearch-analysis-ik-5.1.1.jar

httpclient-4.5.2.jar

httpcore-4.4.4.jar

plugin-descriptor.properties

然后在elasticsearch-5.1.1/plugins目录下新建一个文件夹ik，把elasticsearch-analysis-ik-5.1.1.zip解压后的文件拷贝到elasticsearch-5.1.1/plugins/ik目录下.截图方便理解。

三、pinyin分词器下载与安装

pinyin分词器的下载地址:
https://github.com/medcl/elasticsearch-analysis-pinyin

安装过程和IK一样，下载、打包、加入ES。这里不在重复上述步骤，给出最后配置截图

四、分词测试

IK和pinyin分词配置完成以后，重启ES。如果重启过程中ES报错，说明安装有错误，没有报错说明配置成功。

4.1 IK分词测试

创建一个索引:

curl -XPUT "http://localhost:9200/index"

测试分词效果:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_max_word&text=中华人民共和国"

分词结果:

   {

    "tokens": [{

        "token": "中华人民共和国",

        "start_offset": 0,

        "end_offset": 7,

        "type": "CN_WORD",

        "position": 0

    }, {

        "token": "中华人民",

        "start_offset": 0,

        "end_offset": 4,

        "type": "CN_WORD",

        "position": 1

    }, {

        "token": "中华",

        "start_offset": 0,

        "end_offset": 2,

        "type": "CN_WORD",

        "position": 2

    }, {

        "token": "华人",

        "start_offset": 1,

        "end_offset": 3,

        "type": "CN_WORD",

        "position": 3

    }, {

        "token": "人民共和国",

        "start_offset": 2,

        "end_offset": 7,

        "type": "CN_WORD",

        "position": 4

    }, {

        "token": "人民",

        "start_offset": 2,

        "end_offset": 4,

        "type": "CN_WORD",

        "position": 5

    }, {

        "token": "共和国",

        "start_offset": 4,

        "end_offset": 7,

        "type": "CN_WORD",

        "position": 6

    }, {

        "token": "共和",

        "start_offset": 4,

        "end_offset": 6,

        "type": "CN_WORD",

        "position": 7

    }, {

        "token": "国",

        "start_offset": 6,

        "end_offset": 7,

        "type": "CN_CHAR",

        "position": 8

    }, {

        "token": "国歌",

        "start_offset": 7,

        "end_offset": 9,

        "type": "CN_WORD",

        "position": 9

    }]

}

使用ik_smart分词:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_smart&text=中华人民共和国"

分词结果:

{

    "tokens": [{

        "token": "中华人民共和国",

        "start_offset": 0,

        "end_offset": 7,

        "type": "CN_WORD",

        "position": 0

    }, {

        "token": "国歌",

        "start_offset": 7,

        "end_offset": 9,

        "type": "CN_WORD",

        "position": 1

    }]

}

截图方便理解:

4.2拼音分词测试

测试拼音分词:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=pinyin&text=张学友"

分词结果:

{

    "tokens": [{

        "token": "zhang",

        "start_offset": 0,

        "end_offset": 1,

        "type": "word",

        "position": 0

    }, {

        "token": "xue",

        "start_offset": 1,

        "end_offset": 2,

        "type": "word",

        "position": 1

    }, {

        "token": "you",

        "start_offset": 2,

        "end_offset": 3,

        "type": "word",

        "position": 2

    }, {

        "token": "zxy",

        "start_offset": 0,

        "end_offset": 3,

        "type": "word",

        "position": 3

    }]

}

五、IK+pinyin分词配置

5.1创建索引与分析器设置

创建一个索引，并设置index分析器相关属性:

curl -XPUT "http://localhost:9200/medcl/" -d'

{

    "index": {

        "analysis": {

            "analyzer": {

                "ik_pinyin_analyzer": {

                    "type": "custom",

                    "tokenizer": "ik_smart",

                    "filter": ["my_pinyin", "word_delimiter"]

                }

            },

            "filter": {

                "my_pinyin": {

                    "type": "pinyin",

                    "first_letter": "prefix",

                    "padding_char": " "

                }

            }

        }

    }

}'

创建一个type并设置mapping:

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'

{

    "folks": {

        "properties": {

            "name": {

                "type": "keyword",

                "fields": {

                    "pinyin": {

                        "type": "text",

                        "store": "no",

                        "term_vector": "with_positions_offsets",

                        "analyzer": "ik_pinyin_analyzer",

                        "boost": 10

                    }

                }

            }

        }

    }

}'

5.2索引测试文档

索引2份测试文档。
文档1:

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

文档2:

curl -XPOST http://localhost:9200/medcl/folks/tina -d'{"name":"中华人民共和国国歌"}'

5.3测试(1)拼音分词

下面四条命命令都可以匹配”刘德华”

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"

5.4测试(2)IK分词测试

curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'

{

  "query": {

    "match": {

      "name.pinyin": "国歌"

    }

  },

  "highlight": {

    "fields": {

      "name.pinyin": {}

    }

  }

}'

返回结果:

{

  "took" : 2,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0

  },

  "hits" : {

    "total" : 1,

    "max_score" : 16.698704,

    "hits" : [

      {

        "_index" : "medcl",

        "_type" : "folks",

        "_id" : "tina",

        "_score" : 16.698704,

        "_source" : {

          "name" : "中华人民共和国国歌"

        },

        "highlight" : {

          "name.pinyin" : [

            "<em>中华人民共和国</em><em>国歌</em>"

          ]

        }

      }

    ]

  }

}

说明IK分词器起到了效果。

5.3测试(4)pinyin+ik分词测试：

curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'

{

  "query": {

    "match": {

      "name.pinyin": "zhonghua"

    }

  },

  "highlight": {

    "fields": {

      "name.pinyin": {}

    }

  }

}'

返回结果:

{

  "took" : 3,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0

  },

  "hits" : {

    "total" : 2,

    "max_score" : 5.9814634,

    "hits" : [

      {

        "_index" : "medcl",

        "_type" : "folks",

        "_id" : "tina",

        "_score" : 5.9814634,

        "_source" : {

          "name" : "中华人民共和国国歌"

        },

        "highlight" : {

          "name.pinyin" : [

            "<em>中华人民共和国</em>国歌"

          ]

        }

      },

      {

        "_index" : "medcl",

        "_type" : "folks",

        "_id" : "andy",

        "_score" : 2.2534127,

        "_source" : {

          "name" : "刘德华"

        },

        "highlight" : {

          "name.pinyin" : [

            "<em>刘德华</em>"

          ]

        }

      }

    ]

  }

}

截图如下:

使用pinyin分词以后，原始的字段搜索要加上.pinyin后缀，搜索原始字段没有返回结果：

六、参考资料

Elasticsearch 5 Ik+pinyin分词配置详解的更多相关文章

（转）Elasticsearch 5 Ik+pinyin分词配置详解
今天以这篇文章结束同城旅游网的面试,正好面试官也问到站内检索,可以尝试一下这篇文章介绍的方法.Elasticsearch 5 Ik+pinyin分词配置详解
日志分析工具ELK配置详解
日志分析工具ELK配置详解一.ELK介绍 1.1 elasticsearch 1.1.1 elasticsearch介绍 ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分 ...
elasticsearch-.yml（中文配置详解）
此elasticsearch-.yml配置文件,是在$ES_HOME/config/下 elasticsearch-.yml(中文配置详解) # ======================== El ...
elasticsearch使用ik中文分词器
elasticsearch使用ik中文分词器一.背景二.安装 ik 分词器 1.从 github 上找到和本次 es 版本匹配上的分词器 2.使用 es 自带的插件管理 elasticsearc ...
Log4j配置详解(转)
一.Log4j简介 Log4j有三个主要的组件:Loggers(记录器),Appenders (输出源)和Layouts(布局).这里可简单理解为日志类别,日志要输出的地方和日志以何种形式输出.综合使 ...
logback 常用配置详解<appender>
logback 常用配置详解 <appender> <appender>: <appender>是<configuration>的子节点,是负责写日志的 ...
[转]阿里巴巴数据库连接池 druid配置详解
一.背景 java程序很大一部分要操作数据库,为了提高性能操作数据库的时候,又不得不使用数据库连接池.数据库连接池有很多选择,c3p.dhcp.proxool等,druid作为一名后起之秀,凭借其出色 ...
libCURL开源库在VS2010环境下编译安装，配置详解
libCURL开源库在VS2010环境下编译安装,配置详解转自:http://my.oschina.net/u/1420791/blog/198247 http://blog.csdn.net/su ...
logback配置详解3<filter>
logback 常用配置详解(三) <filter> <filter>: 过滤器,执行一个过滤器会有返回个枚举值,即DENY,NEUTRAL,ACCEPT其中之一.返回DENY ...

随机推荐

nice & renice
[nice & renice & getpriority & setpriority] 1.nice & renice 参考:http://man.ddvip.com/ ...
Information Retrieval II
[Information Retrieval II] 搜索引擎分类: 1.目录式搜索引擎. 2.全文搜索引擎. 3.元搜索引擎(Meta-Search Engine). 搜索引擎的4个阶段:下载(cr ...
for 续1
--------siwuxie095 /f 是四个参数中最复杂的一个,非常强大,不过其复杂性令人望而生畏 /f 用途: 能够对字符串进行操作,也能够对命令的返 ...
learning.py报错
在廖雪峰大神的网站下学习了Python,其中有一个提供互动环境的Python脚本--learning.py,报了个错,看了下源文件的代码,安排了一下. 报错信息: This learning.py i ...
Kafka管理工具介绍
Kafka内部提供了许多管理脚本,这些脚本都放在$KAFKA_HOME/bin目录下,而这些类的实现都是放在源码的kafka/core/src/main/scala/kafka/tools/路径下. ...
ubuntu 基础环境
一.序言这里记录了安装ubuntu 系统,以及里面常用的东西,jdk,idea,maven,svn,git 等等工具的安装,因为这些动作不是经常操作的,因此这里做一个记录,方便新手或者忘记的时候看看 ...
websocket客户端实现
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title> ...
ubuntu系统下安装pyspider：解决pyspider启动时不启动phantomjs问题
问题描述: 在建立第一个虚拟环境时,运行pyspider正常.建立第二个虚拟环境时,运行pyspider再现下面错误.应该是phantomjs没有启动成功. 错误代码:(phantomjs:21507 ...
linux nkf 日文编码转换命令[转载]
对于日语的编码 windows : Shift-JIS Linux : 2.4内核使用EUC编码,2.6内核中使用UTF8编码检查文件编码 nkf -g filename 通常处理字符编码都使用i ...
对/proc/cpuinfo文件下的各个参数的说明及实践
我们通常要检查系统的cpu的相关信息,之前在进行查看cpu的信息的时候,我最长用的方式是,直接将/etc/cpuinfo下的所有的内容进行显示,然后通过对全部文件的查看,来提取我们需要的信息,虽然查 ...

Elasticsearch 5 Ik+pinyin分词配置详解

一、拼音分词的应用

二、IK分词器下载与安装

三、pinyin分词器下载与安装

四、分词测试

4.1 IK分词测试

4.2拼音分词测试

五、IK+pinyin分词配置

5.1创建索引与分析器设置

5.2索引测试文档

5.3测试(1)拼音分词

5.4测试(2)IK分词测试

5.3测试(4)pinyin+ik分词测试：

六、参考资料

Elasticsearch 5 Ik+pinyin分词配置详解的更多相关文章

随机推荐

热门专题