需求

雪花啤酒需要搜索雪花、啤酒、雪花啤酒、xh、pj、xh啤酒、雪花pj

ik导入

参考https://www.cnblogs.com/LQBlog/p/10443862.html,不需要修改源码步骤就行

拼音分词器导入

跟ik一样下载下来打包移动到es plugins 目录名字改为pinyin https://github.com/medcl/elasticsearch-analysis-pinyin

测试

get请求:http://127.0.0.1:9200/_analyze

body:

{

"analyzer":"pinyin",

"text":"雪花啤酒"

}

响应:

{

    "tokens": [

        {

            "token": "xue",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 0

        },

        {

            "token": "xhpj",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 0

        },

        {

            "token": "hua",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 1

        },

        {

            "token": "pi",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 2

        },

        {

            "token": "jiu",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 3

        }

    ]

}

说明导入成功

测试中文加拼音搜索

自定义mapping和自定义分词器

put请求:http://127.0.0.1:9200/opcm3

body:

{

    "settings": {

        "analysis": {

            "analyzer": {

                "ik_pinyin_analyzer": {//自定义一个分词器名字叫ik_pinyin_analyzer

                    "type": "custom",//表示自定义分词器

                    "tokenizer": "ik_smart",//使用ik分词 ik_smart为粗粒度分词 ik_max_word为最细粒度分词

                    "filter": ["my_pinyin"]//分词后结果 交给过滤器再次分词

                },

                "onlyOne_analyzer": {

                    "tokenizer": "onlyOne_pinyin"

                }

            },

            "tokenizer": {

                "onlyOne_pinyin": {

                    "type": "pinyin",

                    "keep_separate_first_letter": "true",

                    "keep_full_pinyin":"false"

                }

            },"filter": {

                "my_pinyin": {//定义过滤器

                    "type": "pinyin",

                    "keep_joined_full_pinyin": true,//分词的时候词组首字母分词后组合 如：雪花 分词:xuehua  xh

                    "keep_separate_first_letter": true//分词的时候支持首字母不单独分词如:会分词xue hua xuehua  xh  x,h

                    "none_chinese_pinyin_tokenize": true//xh 分词为x,h,xh

                }

            }

        }

    },

    "mappings": {

        "doc": {

            "properties": {

                "productName": {

                    "type": "text",

                    "analyzer": "ik_pinyin_analyzer",//指定分词索引为自定义分词 中文分词后再通过filter交给pinyin分词

                    "fields": {//暂时未用 只是保留让 自己能够知道有这种方式根据不同条件选择不同的搜索分词

                        "keyword_once_pinyin": {//新的分词字段 只分词不存在source productName.keyword_once_pinyin 查询时需要判断如果是单字母使用此搜索

                            "type": "text",

                            "analyzer": "onlyOne_analyzer"

                        }

                    }

                }

            }

        }

    }

}

filter个人理解

我的理解是 ik分词然后将分词后的逐项结果通过filter交给拼音分词雪花啤酒 ik会分成雪花,啤酒然后雪花交给pinyin会分词 xue,hua,xh,x,h 啤酒会分词 pi,jiu,p,j

插入测试数据

http://127.0.0.1:9200/opcm3/doc/1

{

    "productName":"雪花纯生勇闯天涯9度100ml"

}

put请求:http://127.0.0.1:9200/opcm3/doc/2

body：

{

    "productName":"金威纯生勇闯天涯9度100ml"

}

查看分词结果

get请求:http://127.0.0.1:9200/opcm3/topic/{id}/_termvectors?fields=productName

get请求:http://127.0.0.1:9200/opcm3/topic/{id}/_termvectors?fields=productName.keyword_once_pinyin

测试搜索

http://127.0.0.1:9200/opcm3/_search

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"雪花纯生"

            }

        }

    }

}

会查出雪花纯生和金威纯生看个人是模糊匹配还是相邻匹配选用match或者match_phrase

我的需求是相邻匹配改为

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"雪花纯生"

            }

        }

    }

}

则只会搜索出雪花纯生

搜索雪花纯生9度的产品

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"雪花纯生9度"

            }

        }

    }

}

会发现搜索不出来数据

原因请查阅:https://www.cnblogs.com/LQBlog/p/10580247.html

改为就能搜索出来:

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"雪花纯生9度",

                "slop":5

            }

        }

    }

}

pingpin分词还支持很多参数比如：

以上模型排查及解决

添加测试数据

{
"productName":"纯生"
}

{
"productName":"纯爽"
}

测试

搜索

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"纯生",

                "slop":5

            }

        }

    }

}

返回结果

{

    "took": 3,

    "timed_out": false,

    "_shards": {

        "total": 5,

        "successful": 5,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": 2,

        "max_score": 2.8277423,

        "hits": [

            {

                "_index": "opcm3",

                "_type": "doc",

                "_id": "1",

                "_score": 2.8277423,

                "_source": {

                    "productName": "纯爽"

                }

            },

            {

                "_index": "opcm3",

                "_type": "doc",

                "_id": "2",

                "_score": 1.4466299,

                "_source": {

                    "productName": "纯生"

                }

            }

        ]

    }

}

可以发现纯爽也出来了

排查

1.查看纯爽分词结果

http://127.0.0.1:9200/opcm3/doc/2/_termvectors?fields=productName

[c,chun,s,sheng]

[c,chun,s,shuang]

2.查看搜索分词

http://127.0.0.1:9200/opcm3/_validate/query?explain

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"纯生",

                "slop":5

            }

        }

    }

}

body

{

    "valid": true,

    "_shards": {

        "total": 1,

        "successful": 1,

        "failed": 0

    },

    "explanations": [

        {

            "index": "opcm3",

            "valid": true,

            "explanation": "productName:\"(c chun) (s sheng)\"~5"

        }

    ]

}

可以理解为index=(c or chun) and (s or shuang)

所以c,s 匹配了纯爽

解决办法

分词按最小粒度分搜索按最大粒度分

如纯生文档分词为[chun,sheng,chun,sheng,cs,c,s]

搜索分词为[chun,sheng,chunsheng]

一下模型就能满足搜索: 雪花，雪花cs ,雪花chunsheng ,xhcs,xh纯生,雪花纯生都能正确搜索出数据

{

    "settings": {

        "analysis": {

            "analyzer": {

                "ik_pinyin_analyzer": {

                    "type": "custom",

                    "tokenizer": "ik_smart",

                    "filter": ["pinyin_max_word_filter"]

                },

                "ik_pingying_smark": {

                     "type": "custom",

                     "tokenizer": "ik_smart",

                      "filter": ["pinyin_smark_word_filter"]

                }

            },

            "filter": {

                "pinyin_max_word_filter": {

                    "type": "pinyin",

                    "keep_full_pinyin": "true",#分词全拼如雪花 分词xue,hua

                    "keep_separate_first_letter":"true",#分词简写如雪花 分词xh

                    "keep_joined_full_pinyin":true#分词会quanpin 连接 比如雪花分词 xuehua

                },

                "pinyin_smark_word_filter": {

                    "type": "pinyin",

                    "keep_separate_first_letter": "false",#不分词简写如雪花 分词不分词xh

                    "keep_first_letter":"false"#不分词单个首字母 如雪花 不分词 x,h

                }

            }

        }

    },

    "mappings": {

        "doc": {

            "properties": {

                "productName": {

                    "type": "text",

                    "analyzer": "ik_pinyin_analyzer",#做文档所用的分词器

                    "search_analyzer":"ik_pingying_smark"#搜索使用的分词器

                }

            }

        }

    }

}

解决办法2

elasticsearch实战中文+拼音搜索的更多相关文章

elasticsearch之拼音搜索
拼音搜索在中文搜索环境中是经常使用的一种功能,用户只需要输入关键词的拼音全拼或者拼音首字母,搜索引擎就可以搜索出相关结果.在国内,中文输入法基本上都是基于汉语拼音的,这种在符合用户输入习惯的条件下缩短 ...
ElasticSearch 中文分词搜索环境搭建
ElasticSearch 是强大的搜索工具,并且是ELK套件的重要组成部分好记性不如乱笔头,这次是在windows环境下搭建es中文分词搜索测试环境,步骤如下 1.安装jdk1.8,配置好环境变量 ...
Elasticsearch实现类似 like '?%' 搜索
在做搜索的时候,下拉联想词的搜索肯定是最常见的一个场景,用户在输入的时候,要自动补全词干,说得简单点,就是以...开头搜索,如果是数据库,一句SQL就很容易实现,但在elasticsearch如何实现 ...
I-team 博客全文检索 Elasticsearch 实战
一直觉得博客缺点东西,最近还是发现了,当博客慢慢多起来的时候想要找一篇之前写的博客很是麻烦,于是作为后端开发的楼主觉得自己动手丰衣足食,也就有了这次博客全文检索功能Elasticsearch实战,这里 ...
Elasticsearch实战总结
上手elasticsearch有段时间了,主要以应用为主,未做深入的研究,下面就简单的日常作个简单的总结,做个记录. 版本问题 es版本繁杂,让首次使用的人无从下手.常见的有2+.5+版本,最新版已达 ...
【Solr】 solr对拼音搜索和拼音首字母搜索的支持
问:对于拼音和拼音首字母的支持,当你在搜商品的时候,如果想输入拼音和拼音首字母就给出商品的信息,怎么办呢? 实现方式有2种,但是他们其实是对应的. 用lucene实现 1.建索引, 多建一个索引字段 ...
为Elasticsearch添加中文分词，对比分词器效果
http://keenwon.com/1404.html Elasticsearch中,内置了很多分词器(analyzers),例如standard (标准分词器).english(英文分词)和chi ...
ElasticSearch实战－日志监控平台
1.概述在项目业务倍增的情况下,查询效率受到影响,这里我们经过讨论,引进了分布式搜索套件——ElasticSearch,通过分布式搜索来解决当下业务上存在的问题.下面给大家列出今天分析的目录: El ...
Elasticsearch java api 基本搜索部分详解
文档是结合几个博客整理出来的,内容大部分为转载内容.在使用过程中,对一些疑问点进行了整理与解析. Elasticsearch java api 基本搜索部分详解 ElasticSearch 常用的查询 ...

随机推荐

ArraySegment
第一个构造函数 Initializes a new instance of the ArraySegment<T> structure that delimits all the elem ...
Nearest-Neighbor Methods(ESL读书笔记)
Nearest-neighbor methods use those observations in the training set T closest in input space to x f ...
PCB 无需解压,直接读取Zip压缩包指定文件实现方法
最近有一项需求,将电测试点数后台批量写入到工程系统流程指示中,而电测试文件存在压缩包中,压缩包存在公共网络盘示例图: 一.采用原始方法(4步完成): 第1步:.网络盘ZIP拷到本地, 第2步:解压Z ...
yii 面包屑
Yii的Breadcrumbs 是Yii的路径插件,使用方法: <?php $this->widget('zii.widgets.CBreadcrumbs', array('links'= ...
SQL Server中char与varchar数据类型区别
在SQL Server中char类型的长度是不可变的,而varchar的长度是可变的 . 存入数据时: 如果数据类型为char时,当定义一个字段固定长度时,如果存进去数据长度小于char的长度,那么存 ...
max-age 和 Expires
网页的缓存是由HTTP消息头中的“Cache-control”来控制的,常见的取值有private.no-cache.max-age.must-revalidate等,默认为private. Ex ...
C++版的LLC代码
图像稀疏编码总结:LLC和SCSPM,文章对稀疏编码讲解非常详细. <Locality-constrained Linear Coding for Image Classification> ...
Quartz+Topshelf 作业
小记: 引用Quartz.Topshelf.Topshelf.Quartz 使用方法: http://www.cnblogs.com/mushroom/p/4952461.html http://ww ...
Oracle中REGEXP_SUBSTR函数
Oracle中REGEXP_SUBSTR函数 Oracle中REGEXP_SUBSTR函数的使用说明: 题目如下: 在oracle中,使用一条语句实现将'17,20,23'拆分成'17','20',' ...
promise原理及使用方法
Promise 的含义所谓Promise ,简单说就是一个容器,里面保存着某个未来才回结束的事件(通常是一个异步操作)的结果.从语法上说,Promise是一个对象,从它可以获取异步操作的消息. re ...

elasticsearch实战 中文+拼音搜索

需求