elasticsearch 口水篇(8)分词 中文分词 ik插件
先来一个标准分词(standard),配置如下:
curl -XPUT localhost:9200/local -d '{
"settings" : {
"analysis" : {
"analyzer" : {
"stem" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "stop", "porter_stem"]
}
}
}
},
"mappings" : {
"article" : {
"dynamic" : true,
"properties" : {
"title" : {
"type" : "string",
"analyzer" : "stem"
}
}
}
}
}'
index:local
type:article
default analyzer:stem (filter:小写、停用词等)
field:title
测试:
# Sample Analysis
curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Fight for your life}'
curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Bruno fights Tyson tomorrow}' # Index Data
curl -XPUT localhost:9200/local/article/1 -d'{"title": "Fight for your life"}'
curl -XPUT localhost:9200/local/article/2 -d'{"title": "Fighting for your life"}'
curl -XPUT localhost:9200/local/article/3 -d'{"title": "My dad fought a dog"}'
curl -XPUT localhost:9200/local/article/4 -d'{"title": "Bruno fights Tyson tomorrow"}' # search on the title field, which is stemmed on index and search
curl -XGET localhost:9200/local/_search?q=title:fight # searching on _all will not do anystemming, unless also configured on the mapping to be stemmed...
curl -XGET localhost:9200/local/_search?q=fight
例如:
Fight for your life
分词如下:
{"tokens":[
{"token":"fight","start_offset":1,"end_offset":6,"type":"<ALPHANUM>","position":1},
{"token":"your","start_offset":11,"end_offset":15,"type":"<ALPHANUM>","position":3},
{"token":"life","start_offset":16,"end_offset":20,"type":"<ALPHANUM>","position":4}
]}
部署ik分词器:
1)将ik分词器插件(es)拷贝到./plugins/analyzerIK/中
2)在elasticsearch.yml中配置
index.analysis.analyzer.ik.type : "ik"
3)在config中添加./config/ik
IKAnalyzer.cfg.xml
main.dic
quantifier.dic
ext.dic
stopword.dic
delete之前创建的index,重新配置如下:
curl -XPUT localhost:9200/local -d '{
"settings" : {
"analysis" : {
"analyzer" : {
"ik" : {
"tokenizer" : "ik"
}
}
}
},
"mappings" : {
"article" : {
"dynamic" : true,
"properties" : {
"title" : {
"type" : "string",
"analyzer" : "ik"
}
}
}
}
}'
测试:
curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d'
{
"text":"中华人民共和国国歌"
}
'
{
"tokens" : [ {
"token" : "text",
"start_offset" : 12,
"end_offset" : 16,
"type" : "ENGLISH",
"position" : 1
}, {
"token" : "中华人民共和国",
"start_offset" : 19,
"end_offset" : 26,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "国歌",
"start_offset" : 26,
"end_offset" : 28,
"type" : "CN_WORD",
"position" : 3
} ]
}
---------------------------------------
如果我们想返回最细粒度的分词结果,需要在elasticsearch.yml中配置如下:
index:
analysis:
analyzer:
ik:
alias: [ik_analyzer]
type: org.elasticsearch.index.analysis.IkAnalyzerProvider
ik_smart:
type: ik
use_smart: true
ik_max_word:
type: ik
use_smart: false
测试:
curl 'http://localhost:9200/index/_analyze?analyzer=ik_max_word&pretty=true' -d'
{
"text":"中华人民共和国国歌"
}
'
{
"tokens" : [ {
"token" : "text",
"start_offset" : 12,
"end_offset" : 16,
"type" : "ENGLISH",
"position" : 1
}, {
"token" : "中华人民共和国",
"start_offset" : 19,
"end_offset" : 26,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "中华人民",
"start_offset" : 19,
"end_offset" : 23,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "中华",
"start_offset" : 19,
"end_offset" : 21,
"type" : "CN_WORD",
"position" : 4
}, {
"token" : "华人",
"start_offset" : 20,
"end_offset" : 22,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "人民共和国",
"start_offset" : 21,
"end_offset" : 26,
"type" : "CN_WORD",
"position" : 6
}, {
"token" : "人民",
"start_offset" : 21,
"end_offset" : 23,
"type" : "CN_WORD",
"position" : 7
}, {
"token" : "共和国",
"start_offset" : 23,
"end_offset" : 26,
"type" : "CN_WORD",
"position" : 8
}, {
"token" : "共和",
"start_offset" : 23,
"end_offset" : 25,
"type" : "CN_WORD",
"position" : 9
}, {
"token" : "国",
"start_offset" : 25,
"end_offset" : 26,
"type" : "CN_CHAR",
"position" : 10
}, {
"token" : "国歌",
"start_offset" : 26,
"end_offset" : 28,
"type" : "CN_WORD",
"position" : 11
} ]
}
elasticsearch 口水篇(8)分词 中文分词 ik插件的更多相关文章
- elasticsearch 口水篇(1) 安装、插件
一)安装elasticsearch 1)下载elasticsearch-0.90.10,解压,运行\bin\elasticsearch.bat (windwos) 2)进入http://localho ...
- elasticsearch 口水篇(4)java客户端 - 原生esClient
上一篇(elasticsearch 口水篇(3)java客户端 - Jest)Jest是第三方客户端,基于REST Api进行调用(httpClient),本篇简单介绍下elasticsearch原生 ...
- ElasticSearch简介(三)——中文分词
很多时候,我们需要在ElasticSearch中启用中文分词,本文这里简单的介绍一下方法.首先安装中文分词插件.这里使用的是 ik,也可以考虑其他插件(比如 smartcn). $ ./bin/ela ...
- elasticsearch学习笔记-倒排索引以及中文分词
我们使用数据库的时候,如果查询条件太复杂,则会涉及到很多问题 1.无法维护,各种嵌套查询,各种复杂的查询,想要优化都无从下手 2.效率低下,一般语句复杂了之后,比如使用or,like %,,%查询之后 ...
- elasticsearch 口水篇(9)Facet
FACET 1)Terms Facet { "query" : { "match_all" : { } }, "facets" : { &q ...
- elasticsearch 口水篇(2)CRUD Sense
Sense 为了方便.直观的使用es的REST Api,我们可以使用sense.Sense是Chrome浏览器的一个插件,使用简单. 如图: Sense安装: https://chrome.googl ...
- elasticsearch 口水篇(7) Eclipse中部署ES源码、运行
ES源码可以直接从svn下载 https://github.com/elasticsearch/elasticsearch 下载后,用Maven导入(import——>Existing Mave ...
- elasticsearch 口水篇(6) Mapping 定义索引
前面我们感觉ES就想是一个nosql数据库,支持Free Schema. 接触过Lucene.solr的同学这时可能会思考一个问题——怎么定义document中的field?store.index.a ...
- elasticsearch 口水篇(3)java客户端 - Jest
elasticsearch有丰富的客户端,java客户端有Jest.其原文介绍如下: Jest is a Java HTTP Rest client for ElasticSearch.It is a ...
随机推荐
- [LeetCode&Python] Problem 905: Sort Array By Parity
Given an array A of non-negative integers, return an array consisting of all the even elements of A, ...
- Gym - 101002K:YATP (树分治+二分+斜率优化)
题意:给定带点权边权的树,定义路径的花费=路径边权和e+起点点权w[s]*终点点权w[t].N<2e5,e,w<1e6: 思路:首先,需要树分治. 然后得到方程dp[i]=min{ dis ...
- kmp--看毛片算法
string str; int next[N];// 核♥: next[k] 字符串前(k-1)个元素有next[k]个相等前后缀 // 初始化 next[0]=-1; next[1]=0; void ...
- day 020 常用模块02
主要内容: 什么是序列化 pickle shelve json configparser(模块) 一 序列化 我们在存储数据或者网络传输数据的时候,需要对我们的对象进行处理,把对象处理成方便存储和 传 ...
- es6的let与es5的var定义变量的区别
es6的let与es5的var定义变量的区别 自身新手第一次接触let关键字的时候,不知道let与var的区别,本能认为是一样,但非如此,比如下述的代码运行就会报错: let hello = 'hel ...
- 《DSP using MATLAB》Problem 5.9
代码: %% ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %% Output In ...
- hdu1686 Oulipo KMP/AC自动机
The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter 'e ...
- servlet路径映射中的完全路径匹配、目录匹配、扩展名匹配
在servlet路径映射中,关于url-pattern的配置有三种,分别是完全路径匹配.目录匹配.扩展名匹配 其优先级分别为:完全路径匹配>目录匹配>扩展名匹配: 一.三种路径印射的区别 ...
- linux(kali,centos)安装vm及其提示缺少c头文件解决方法
我电脑系统是kali最新版 首先去官网下一个vm安装包,给个直达网址 http://www.vmware.com/cn/products/workstation/workstation-evaluat ...
- Collection接口中方法的使用
Collection:集合的接口 1.Collection和ArrayList和List的关系 ArrayList implement(实现) List List ...