来源：https://github.com/medcl/elasticsearch-analysis-pinyin

Pinyin Analysis for Elasticsearch

This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin, integrates NLP tools (https://github.com/NLPchina/nlp-lang).

--------------------------------------------------

| Pinyin   Analysis Plugin      | Elasticsearch  |

--------------------------------------------------

| master                        | 5.x -> master  |

--------------------------------------------------

| 5.5.1                         | 5.5.1          |

--------------------------------------------------

| 5.3.3                         | 5.3.3          |

--------------------------------------------------

| 5.2.2                         | 5.2.2          |

--------------------------------------------------

| 5.1.2                         | 5.1.2          |

--------------------------------------------------

| 1.8.1                         | 2.4.1          |

--------------------------------------------------

| 1.7.5                         | 2.3.5          |

--------------------------------------------------

| 1.6.1                         | 2.2.1          |

--------------------------------------------------

| 1.5.0                         | 2.1.0          |

--------------------------------------------------

| 1.4.0                         | 2.0.x          |

--------------------------------------------------

| 1.3.0                         | 1.6.x          |

--------------------------------------------------

| 1.2.2                         | 1.0.x          |

--------------------------------------------------

The plugin includes analyzer: pinyin , tokenizer: pinyin and token-filter: pinyin.

** Optional Parameters **

keep_first_letter when this option enabled, eg: 刘德华>ldh, default: true
keep_separate_first_letter when this option enabled, will keep first letters separately, eg: 刘德华>l,d,h, default: false, NOTE: query result maybe too fuzziness due to term too frequency
limit_first_letter_length set max length of the first_letter result, default: 16
keep_full_pinyin when this option enabled, eg: 刘德华> [liu,de,hua], default: true
keep_joined_full_pinyin when this option enabled, eg: 刘德华> [liudehua], default: false
keep_none_chinese keep non chinese letter or number in result, default: true
keep_none_chinese_together keep non chinese letter together, default: true, eg: DJ音乐家 -> DJ,yin,yue,jia, when set to false, eg: DJ音乐家 -> D,J,yin,yue,jia, NOTE: keep_none_chinese should be enabled first
keep_none_chinese_in_first_letter keep non Chinese letters in first letter, eg: 刘德华AT2016->ldhat2016, default: true
keep_none_chinese_in_joined_full_pinyin keep non Chinese letters in joined full pinyin, eg: 刘德华2016->liudehua2016, default: false
none_chinese_pinyin_tokenize break non chinese letters into separate pinyin term if they are pinyin, default: true, eg: liudehuaalibaba13zhuanghan -> liu,de,hua,a,li,ba,ba,13,zhuang,han, NOTE: keep_none_chinese and keep_none_chinese_together should be enabled first
keep_original when this option enabled, will keep original input as well, default: false
lowercase lowercase non Chinese letters, default: true
trim_whitespace default: true
remove_duplicated_term when this option enabled, duplicated term will be removed to save index, eg: de的>de, default: false, NOTE: position related query maybe influenced

1.Create a index with custom pinyin analyzer

curl -XPUT http://localhost:9200/medcl/ -d'

{

    "index" : {

        "analysis" : {

            "analyzer" : {

                "pinyin_analyzer" : {

                    "tokenizer" : "my_pinyin"

                    }

            },

            "tokenizer" : {

                "my_pinyin" : {

                    "type" : "pinyin",

                    "keep_separate_first_letter" : false,

                    "keep_full_pinyin" : true,

                    "keep_original" : true,

                    "limit_first_letter_length" : 16,

                    "lowercase" : true,

                    "remove_duplicated_term" : true

                }

            }

        }

    }

}'

2.Test Analyzer, analyzing a chinese name, such as 刘德华

http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer

{

  "tokens" : [

    {

      "token" : "liu",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "de",

      "start_offset" : 1,

      "end_offset" : 2,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "hua",

      "start_offset" : 2,

      "end_offset" : 3,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "刘德华",

      "start_offset" : 0,

      "end_offset" : 3,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "ldh",

      "start_offset" : 0,

      "end_offset" : 3,

      "type" : "word",

      "position" : 4

    }

  ]

}

3.Create mapping

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'

{

    "folks": {

        "properties": {

            "name": {

                "type": "keyword",

                "fields": {

                    "pinyin": {

                        "type": "text",

                        "store": "no",

                        "term_vector": "with_offsets",

                        "analyzer": "pinyin_analyzer",

                        "boost": 10

                    }

                }

            }

        }

    }

}'

4.Indexing

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

5.Let's search

http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E

curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7

curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu

curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh

curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua

6.Using Pinyin-TokenFilter

curl -XPUT http://localhost:9200/medcl1/ -d'

{

    "index" : {

        "analysis" : {

            "analyzer" : {

                "user_name_analyzer" : {

                    "tokenizer" : "whitespace",

                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"

                }

            },

            "filter" : {

                "pinyin_first_letter_and_full_pinyin_filter" : {

                    "type" : "pinyin",

                    "keep_first_letter" : true,

                    "keep_full_pinyin" : false,

                    "keep_none_chinese" : true,

                    "keep_original" : false,

                    "limit_first_letter_length" : 16,

                    "lowercase" : true,

                    "trim_whitespace" : true,

                    "keep_none_chinese_in_first_letter" : true

                }

            }

        }

    }

}'

Token Test:刘德华张学友郭富城黎明四大天王

curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer

{

  "tokens" : [

    {

      "token" : "ldh",

      "start_offset" : 0,

      "end_offset" : 3,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "zxy",

      "start_offset" : 4,

      "end_offset" : 7,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "gfc",

      "start_offset" : 8,

      "end_offset" : 11,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "lm",

      "start_offset" : 12,

      "end_offset" : 14,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "sdtw",

      "start_offset" : 15,

      "end_offset" : 19,

      "type" : "word",

      "position" : 4

    }

  ]

}

7.Used in phrase query

option 1

  PUT /medcl/

  {

      "index" : {

          "analysis" : {

              "analyzer" : {

                  "pinyin_analyzer" : {

                      "tokenizer" : "my_pinyin"

                      }

              },

              "tokenizer" : {

                  "my_pinyin" : {

                      "type" : "pinyin",

                      "keep_first_letter":false,

                      "keep_separate_first_letter" : false,

                      "keep_full_pinyin" : true,

                      "keep_original" : false,

                      "limit_first_letter_length" : 16,

                      "lowercase" : true

                  }

              }

          }

      }

  }

  GET /medcl/folks/_search

  {

    "query": {"match_phrase": {

      "name.pinyin": "刘德华"

    }}

  }

option 2

  PUT /medcl/

  {

      "index" : {

          "analysis" : {

              "analyzer" : {

                  "pinyin_analyzer" : {

                      "tokenizer" : "my_pinyin"

                      }

              },

              "tokenizer" : {

                  "my_pinyin" : {

                      "type" : "pinyin",

                      "keep_first_letter":false,

                      "keep_separate_first_letter" : true,

                      "keep_full_pinyin" : false,

                      "keep_original" : false,

                      "limit_first_letter_length" : 16,

                      "lowercase" : true

                  }

              }

          }

      }

  }

  POST /medcl/folks/andy

  {"name":"刘德华"}

  GET /medcl/folks/_search

  {

    "query": {"match_phrase": {

      "name.pinyin": "刘德h"

    }}

  }

  GET /medcl/folks/_search

  {

    "query": {"match_phrase": {

      "name.pinyin": "刘dh"

    }}

  }

  GET /medcl/folks/_search

  {

    "query": {"match_phrase": {

      "name.pinyin": "dh"

    }}

  }

8.That's all, have fun.

elasticsearch-analysis-pinyin的更多相关文章

Elasticsearch IK+pinyin
如何在Elasticsearch中安装中文分词器(IK+pinyin) 如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字 ...
Elasticsearch：Pinyin 分词器
Elastic的Medcl提供了一种搜索Pinyin搜索的方法.拼音搜索在很多的应用场景中都有被用到.比如在百度搜索中,我们使用拼音就可以出现汉字: 对于我们中国人来说,拼音搜索也是非常直接的.那么在 ...
ElasticSearch安装拼音插件（pinyin）
环境介绍集群环境如下: Ubuntu14.04 ElasticSearch 2.3.1(3节点) JDK1.8.0_60 开发环境: Windows10 JDK 1.8.0_66 Maven 3.3 ...
elasticsearch+logstash_jdbc 实现mysql数据实时同步至es
jdk安装1.8版本,es.ls.ik.kibana版本一致我这里使用的6.6.2版本安装es tar xf elasticsearch-6.6.2.tar.gz mv elasticsearch- ...
Elasticsearch搜索资料汇总
Elasticsearch 简介 Elasticsearch(ES)是一个基于Lucene 构建的开源分布式搜索分析引擎,可以近实时的索引.检索数据.具备高可靠.易使用.社区活跃等特点,在全文检索.日 ...
Elasticsearch实现搜索推荐词
本篇介绍的是基于Elasticsearch实现搜索推荐词,其中需要用到Elasticsearch的pinyin插件以及ik分词插件,代码的实现这里提供了java跟C#的版本方便大家参考. 1.实现的结 ...
（转）How to Use Elasticsearch, Logstash, and Kibana to Manage MySQL Logs
A comprehensive log management and analysis strategy is vital, enabling organizations to understand ...
linux环境下配置solr5.3详细步骤
本人上周五刚刚配置了一遍centos下配置solr5.3版本,综合借鉴并改进了一些教程,贴出如下单位使用内网,本教程暂无截图,抱歉另,本人是使用.net编程调用solr的使用的是solrnet,在 ...
solr5Ik分词2
<fieldType name="text_ik" class="solr.TextField"><ana ...

随机推荐

docker 网络配置路由转发
建好flannel 网络之后 iptables -L -n 查看要全是accept iptables -P FORWARD ACCEPT 开启路由转发修改/etc/sysctl.conf文件,添加 ...
python大规模数据处理技巧之一：数据常用操作
面对读取上G的数据,python不能像做简单代码验证那样随意,必须考虑到相应的代码的实现形式将对效率的影响.如下所示,对pandas对象的行计数实现方式不同,运行的效率差别非常大.虽然时间看起来都微不 ...
jQuery的基础dom和css操作
1.元素以及内容操作 $(function () { // alert($("a").html()); // 获取元素中间的html内容,包括标签和文本内容 // alert($( ...
Qt Customize QVariant
Customize QVariant #include <QCoreApplication> #include <QVariant> #include <QDebug&g ...
Zookeeper 源码（六）Leader-Follower-Observer
Zookeeper 源码(六)Leader-Follower-Observer 上一节介绍了 Leader 选举的全过程,本节讲解一下 Leader-Follower-Observer 服务器的三种角 ...
tab切换代码优化
上次的tab切换的代码里面有很多重复的代码,需要做做优化,把重复的代码用函数封装起来调用. 优化前: <script> //获取id封装成一个函数$()方便调用 function $(id ...
Jurassic.ScriptEngine 使用
标记: Jurassic,js,net Jurassic.ScriptEngine是一个让net动态执行js的一个引擎.类似的有ironjs等.支持ECMAScript 5,非线程安全使用 usin ...
Angular4 配置问题
出现错误: Local workspace file ('angular.json') could not be found.Error: Local workspace file ('angular ...
SQL多行字符串按条件合并
USE [ARTEA.MES]GO /****** Object: UserDefinedFunction [dbo].[UnionPart] Script Date: 11/18/2015 15:3 ...
ASP .NET运行机制（visio图）

elasticsearch-analysis-pinyin

Pinyin Analysis for Elasticsearch

elasticsearch-analysis-pinyin的更多相关文章

随机推荐

热门专题