来源:https://github.com/medcl/elasticsearch-analysis-pinyin

Pinyin Analysis for Elasticsearch

This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin, integrates NLP tools (https://github.com/NLPchina/nlp-lang).

--------------------------------------------------
| Pinyin Analysis Plugin | Elasticsearch |
--------------------------------------------------
| master | 5.x -> master |
--------------------------------------------------
| 5.5.1 | 5.5.1 |
--------------------------------------------------
| 5.3.3 | 5.3.3 |
--------------------------------------------------
| 5.2.2 | 5.2.2 |
--------------------------------------------------
| 5.1.2 | 5.1.2 |
--------------------------------------------------
| 1.8.1 | 2.4.1 |
--------------------------------------------------
| 1.7.5 | 2.3.5 |
--------------------------------------------------
| 1.6.1 | 2.2.1 |
--------------------------------------------------
| 1.5.0 | 2.1.0 |
--------------------------------------------------
| 1.4.0 | 2.0.x |
--------------------------------------------------
| 1.3.0 | 1.6.x |
--------------------------------------------------
| 1.2.2 | 1.0.x |
--------------------------------------------------

The plugin includes analyzer: pinyin , tokenizer: pinyin and token-filter: pinyin.

** Optional Parameters **

  • keep_first_letter when this option enabled, eg: 刘德华>ldh, default: true
  • keep_separate_first_letter when this option enabled, will keep first letters separately, eg: 刘德华>l,d,h, default: false, NOTE: query result maybe too fuzziness due to term too frequency
  • limit_first_letter_length set max length of the first_letter result, default: 16
  • keep_full_pinyin when this option enabled, eg: 刘德华> [liu,de,hua], default: true
  • keep_joined_full_pinyin when this option enabled, eg: 刘德华> [liudehua], default: false
  • keep_none_chinese keep non chinese letter or number in result, default: true
  • keep_none_chinese_together keep non chinese letter together, default: true, eg: DJ音乐家 -> DJ,yin,yue,jia, when set to false, eg: DJ音乐家 -> D,J,yin,yue,jia, NOTE: keep_none_chinese should be enabled first
  • keep_none_chinese_in_first_letter keep non Chinese letters in first letter, eg: 刘德华AT2016->ldhat2016, default: true
  • keep_none_chinese_in_joined_full_pinyin keep non Chinese letters in joined full pinyin, eg: 刘德华2016->liudehua2016, default: false
  • none_chinese_pinyin_tokenize break non chinese letters into separate pinyin term if they are pinyin, default: true, eg: liudehuaalibaba13zhuanghan -> liu,de,hua,a,li,ba,ba,13,zhuang,han, NOTE: keep_none_chinese and keep_none_chinese_together should be enabled first
  • keep_original when this option enabled, will keep original input as well, default: false
  • lowercase lowercase non Chinese letters, default: true
  • trim_whitespace default: true
  • remove_duplicated_term when this option enabled, duplicated term will be removed to save index, eg: de的>de, default: false, NOTE: position related query maybe influenced

1.Create a index with custom pinyin analyzer

curl -XPUT http://localhost:9200/medcl/ -d'
{
"index" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}'

2.Test Analyzer, analyzing a chinese name, such as 刘德华

http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer
{
"tokens" : [
{
"token" : "liu",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "de",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "hua",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "刘德华",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 3
},
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 4
}
]
}

3.Create mapping

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
"folks": {
"properties": {
"name": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": "no",
"term_vector": "with_offsets",
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
}
}
}
}'

4.Indexing

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

5.Let's search

http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua

6.Using Pinyin-TokenFilter

curl -XPUT http://localhost:9200/medcl1/ -d'
{
"index" : {
"analysis" : {
"analyzer" : {
"user_name_analyzer" : {
"tokenizer" : "whitespace",
"filter" : "pinyin_first_letter_and_full_pinyin_filter"
}
},
"filter" : {
"pinyin_first_letter_and_full_pinyin_filter" : {
"type" : "pinyin",
"keep_first_letter" : true,
"keep_full_pinyin" : false,
"keep_none_chinese" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true,
"trim_whitespace" : true,
"keep_none_chinese_in_first_letter" : true
}
}
}
}
}'

Token Test:刘德华 张学友 郭富城 黎明 四大天王

curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer
{
"tokens" : [
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "zxy",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "gfc",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "lm",
"start_offset" : 12,
"end_offset" : 14,
"type" : "word",
"position" : 3
},
{
"token" : "sdtw",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 4
}
]
}

7.Used in phrase query

  • option 1

      PUT /medcl/
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "pinyin_analyzer" : {
    "tokenizer" : "my_pinyin"
    }
    },
    "tokenizer" : {
    "my_pinyin" : {
    "type" : "pinyin",
    "keep_first_letter":false,
    "keep_separate_first_letter" : false,
    "keep_full_pinyin" : true,
    "keep_original" : false,
    "limit_first_letter_length" : 16,
    "lowercase" : true
    }
    }
    }
    }
    }
    GET /medcl/folks/_search
    {
    "query": {"match_phrase": {
    "name.pinyin": "刘德华"
    }}
    }
  • option 2

      PUT /medcl/
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "pinyin_analyzer" : {
    "tokenizer" : "my_pinyin"
    }
    },
    "tokenizer" : {
    "my_pinyin" : {
    "type" : "pinyin",
    "keep_first_letter":false,
    "keep_separate_first_letter" : true,
    "keep_full_pinyin" : false,
    "keep_original" : false,
    "limit_first_letter_length" : 16,
    "lowercase" : true
    }
    }
    }
    }
    } POST /medcl/folks/andy
    {"name":"刘德华"} GET /medcl/folks/_search
    {
    "query": {"match_phrase": {
    "name.pinyin": "刘德h"
    }}
    } GET /medcl/folks/_search
    {
    "query": {"match_phrase": {
    "name.pinyin": "刘dh"
    }}
    } GET /medcl/folks/_search
    {
    "query": {"match_phrase": {
    "name.pinyin": "dh"
    }}
    }

8.That's all, have fun.

elasticsearch-analysis-pinyin的更多相关文章

  1. Elasticsearch IK+pinyin

    如何在Elasticsearch中安装中文分词器(IK+pinyin)   如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字 ...

  2. Elasticsearch:Pinyin 分词器

    Elastic的Medcl提供了一种搜索Pinyin搜索的方法.拼音搜索在很多的应用场景中都有被用到.比如在百度搜索中,我们使用拼音就可以出现汉字: 对于我们中国人来说,拼音搜索也是非常直接的.那么在 ...

  3. ElasticSearch安装拼音插件(pinyin)

    环境介绍 集群环境如下: Ubuntu14.04 ElasticSearch 2.3.1(3节点) JDK1.8.0_60 开发环境: Windows10 JDK 1.8.0_66 Maven 3.3 ...

  4. elasticsearch+logstash_jdbc 实现mysql数据实时同步至es

    jdk安装1.8版本,es.ls.ik.kibana版本一致我这里使用的6.6.2版本 安装es tar xf elasticsearch-6.6.2.tar.gz mv elasticsearch- ...

  5. Elasticsearch搜索资料汇总

    Elasticsearch 简介 Elasticsearch(ES)是一个基于Lucene 构建的开源分布式搜索分析引擎,可以近实时的索引.检索数据.具备高可靠.易使用.社区活跃等特点,在全文检索.日 ...

  6. Elasticsearch实现搜索推荐词

    本篇介绍的是基于Elasticsearch实现搜索推荐词,其中需要用到Elasticsearch的pinyin插件以及ik分词插件,代码的实现这里提供了java跟C#的版本方便大家参考. 1.实现的结 ...

  7. (转)How to Use Elasticsearch, Logstash, and Kibana to Manage MySQL Logs

    A comprehensive log management and analysis strategy is vital, enabling organizations to understand ...

  8. linux环境下配置solr5.3详细步骤

    本人上周五刚刚配置了一遍centos下配置solr5.3版本,综合借鉴并改进了一些教程,贴出如下 单位使用内网,本教程暂无截图,抱歉 另,本人是使用.net编程调用solr的使用的是solrnet,在 ...

  9. solr5Ik分词2

    <!--IK分词器--><fieldType name="text_ik" class="solr.TextField"><ana ...

随机推荐

  1. 自学jquery,下午实现前后台交互--成为牛逼的女程序员

    希望周末能够把搜索质量对比的页面做出来!!! 牛逼的薰衣草程序员,fighting

  2. OPENSSL 生成https 客户端证书

    下面说下拿服务器证书.(前提是服务器是https,客户端认证用的时候),服务端不给的时候,我们自己去拿(不给怼他!,哈哈,开个玩笑,都会给的) openssl s_client -connect 域名 ...

  3. 解题报告-1012. Numbers With Repeated Digits

    Given a positive integer N, return the number of positive integers less than or equal to N that have ...

  4. Qt5 How to translate App UI languages

    Adding new language file name in app.pro file. TRANSLATIONS += lg_ch.ts \ lg_en.ts \ lg_new.ts Runni ...

  5. 产品设计师 VS UX设计师:你更想成为哪一个?

    随着互联网的快速发展,越来越多的应届毕业生也成为设计师的一员.他们当中的许多人选择UX设计师作为第一份工作,也有一些人选择做一个产品设计师.你是否也想成为设计师呢?这两种设计师你更倾向于哪一个呢?在你 ...

  6. windows聚焦图片文件重命名bash脚本

    win10聚焦路径为: %localappdata%\Packages\Microsoft.Windows.ContentDeliveryManager_cw5n1h2txyewy\LocalStat ...

  7. 关于使用 ps脚本来处理图片的排层问题

    问题是这样,在三维软件 把模型切割,给切割的部件排上序号 如 :tianji_1-431_bujian13.png 中间一段数据就是表示的 1.431 该数据之后 会到ps总排层使用 在ps排层出三维 ...

  8. SpringBoot里的一些注解

    Spring不仅可以通过xml配置获取*.properties,还可以通过@Value注解的方式来获取,将properties配置文件中的属性值注入到java成员变量. 如果不想每次都写private ...

  9. Jmeter Cookie管理器 获取JSESSIONID

    1.打开jmeter.抓包添加Web请求后,添加Cookie管理器.直接添加就行.值要不要都一样 添加值:${COOKIE_JSESSIONID 域:${server} 2.点击载入到当前脚本 3.到 ...

  10. 原型模式及C++实现

    以下是我自己学习设计模式的感想. 原型模式 学过C++的都知道拷贝构造函数,复制一个对象分为浅拷贝和深拷贝. 浅拷贝:就是给对象中的每个成员变量进行复制,就是把A1类中的变量直接赋给A2类中变量,属于 ...