Elasticsearch 分词器

无论是内置的分析器（analyzer），还是自定义的分析器（analyzer），都由三种构件块组成的：character filters ， tokenizers ， token filters。

内置的analyzer将这些构建块预先打包到适合不同语言和文本类型的analyzer中。

Character filters （字符过滤器）

字符过滤器以字符流的形式接收原始文本，并可以通过添加、删除或更改字符来转换该流。

举例来说，一个字符过滤器可以用来把阿拉伯数字（٠‎١٢٣٤٥٦٧٨‎٩）‎转成成Arabic-Latin的等价物（0123456789）。

一个分析器可能有0个或多个字符过滤器，它们按顺序应用。

（PS：类似Servlet中的过滤器，或者拦截器，想象一下有一个过滤器链）

Tokenizer （分词器）

一个分词器接收一个字符流，并将其拆分成单个token （通常是单个单词），并输出一个token流。例如，一个whitespace分词器当它看到空白的时候就会将文本拆分成token。它会将文本“Quick brown fox!”转换为[Quick, brown, fox!]

（PS：Tokenizer 负责将文本拆分成单个token ，这里token就指的就是一个一个的单词。就是一段文本被分割成好几部分，相当于Java中的字符串的 split ）

分词器还负责记录每个term的顺序或位置，以及该term所表示的原单词的开始和结束字符偏移量。（PS：文本被分词后的输出是一个term数组）

一个分析器必须只能有一个分词器

Token filters （token过滤器）

token过滤器接收token流，并且可能会添加、删除或更改tokens。

例如，一个lowercase token filter可以将所有的token转成小写。stop token filter可以删除常用的单词，比如 the 。synonym token filter可以将同义词引入token流。

不允许token过滤器更改每个token的位置或字符偏移量。

一个分析器可能有0个或多个token过滤器，它们按顺序应用。

小结&回顾

analyzer（分析器）是一个包，这个包由三部分组成，分别是：character filters （字符过滤器）、tokenizer（分词器）、token filters（token过滤器）

一个analyzer可以有0个或多个character filters

一个analyzer有且只能有一个tokenizer

一个analyzer可以有0个或多个token filters

character filter 是做字符转换的，它接收的是文本字符流，输出也是字符流

tokenizer 是做分词的，它接收字符流，输出token流（文本拆分后变成一个一个单词，这些单词叫token）

token filter 是做token过滤的，它接收token流，输出也是token流

由此可见，整个analyzer要做的事情就是将文本拆分成单个单词，文本 ----> 字符 ----> token

这就好比是拦截器

1. 测试分析器

analyze API 是一个工具，可以帮助我们查看分析的过程。（PS：类似于执行计划）

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "whitespace",

  "text":     "The quick brown fox."

}

'

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "tokenizer": "standard",

  "filter":  [ "lowercase", "asciifolding" ],

  "text":      "Is this déja vu?"

}

'

输出：

{

    "tokens":[

        {

            "token":"The",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"quick",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"brown",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"fox.",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        }

    ]

}

可以看到，对于每个term，记录了它的位置和偏移量

2. Analyzer

2.1. 配置内置的分析器

内置的分析器不用任何配置就可以直接使用。当然，默认配置是可以更改的。例如，standard分析器可以配置为支持停止字列表:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "std_english": {

          "type":      "standard",

          "stopwords": "_english_"

        }

      }

    }

  },

  "mappings": {

    "_doc": {

      "properties": {

        "my_text": {

          "type":     "text",

          "analyzer": "standard",

          "fields": {

            "english": {

              "type":     "text",

              "analyzer": "std_english"

            }

          }

        }

      }

    }

  }

}

'

在这个例子中，我们基于standard分析器来定义了一个std_englisth分析器，同时配置为删除预定义的英语停止词列表。后面的mapping中，定义了my_text字段用standard，my_text.english用std_english分析器。因此，下面两个的分词结果会是这样的：

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "field": "my_text",

  "text": "The old brown cow"

}

'

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "field": "my_text.english",

  "text": "The old brown cow"

}

'

第一个由于用的standard分析器，因此分词的结果是：[ the, old, brown, cow ]

第二个用std_english分析的结果是：[ old, brown, cow ]

2.2. Standard Analyzer （默认）

如果没有特别指定的话，standard 是默认的分析器。它提供了基于语法的标记化（基于Unicode文本分割算法），适用于大多数语言。

例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "standard",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

上面例子中，那段文本将会输出如下terms：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

2.2.1. 配置

标准分析器接受下列参数：

max_token_length ：最大token长度，默认255
stopwords ：预定义的停止词列表，如_english_ 或包含停止词列表的数组，默认是 _none_
stopwords_path ：包含停止词的文件路径

2.2.2. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_english_analyzer": {

          "type": "standard",

          "max_token_length": ,

          "stopwords": "_english_"

        }

      }

    }

  }

}

'

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_english_analyzer",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

以上输出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

2.2.3. 定义

standard分析器由下列两部分组成：

Tokenizer

Standard Tokenizer

Token Filters

Standard Token Filter
Lower Case Token Filter
Stop Token Filter （默认被禁用）

你还可以自定义

curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "rebuilt_standard": {

          "tokenizer": "standard",

          "filter": [

            "lowercase"

          ]

        }

      }

    }

  }

}

'

2.3. Simple Analyzer

simple 分析器当它遇到只要不是字母的字符，就将文本解析成term，而且所有的term都是小写的。例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "simple",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输入结果如下：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.3.1. 自定义

curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "rebuilt_simple": {

          "tokenizer": "lowercase",

          "filter": [

          ]

        }

      }

    }

  }

}

'

2.4. Whitespace Analyzer

whitespace 分析器，当它遇到空白字符时，就将文本解析成terms

示例：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "whitespace",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输出结果如下：

[ The, , QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

2.5. Stop Analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了对删除停止词的支持。默认用的停止词是 _englisht_

（PS：意思是，假设有一句话“this is a apple”，并且假设“this” 和 “is”都是停止词，那么用simple的话输出会是[ this , is , a , apple ]，而用stop输出的结果会是[ a , apple ]，到这里就看出二者的区别了，stop 不会输出停止词，也就是说它不认为停止词是一个term）

（PS：所谓的停止词，可以理解为分隔符）

2.5.1. 示例输出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

    "analyzer": "stop",

    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

2.5.2. 配置

stop 接受以下参数：

stopwords ：一个预定义的停止词列表（比如，_englisht_）或者是一个包含停止词的列表。默认是 _english_
stopwords_path ：包含停止词的文件路径。这个路径是相对于Elasticsearch的config目录的一个路径

2.5.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_stop_analyzer": {

          "type": "stop",

          "stopwords": ["the", "over"]

        }

      }

    }

  }

}

'

上面配置了一个stop分析器，它的停止词有两个：the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_stop_analyzer",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

基于以上配置，这个请求输入会是这样的：

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

2.6. Pattern Analyzer

用Java正则表达式来将文本分割成terms，默认的正则表达式是\W+（非单词字符）

2.6.1. 示例输出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "pattern",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

由于默认按照非单词字符分割，因此输出会是这样的：

[ the, , quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.6.2. 配置

pattern 分析器接受如下参数：

pattern ：一个Java正则表达式，默认 \W+
flags ： Java正则表达式flags。比如：CASE_INSENSITIVE 、COMMENTS
lowercase ：是否将terms全部转成小写。默认true
stopwords ：一个预定义的停止词列表，或者包含停止词的一个列表。默认是 _none_
stopwords_path ：停止词文件路径

2.6.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_email_analyzer": {

          "type":      "pattern",

          "pattern":   "\\W|_",

          "lowercase": true

        }

      }

    }

  }

}

'

上面的例子中配置了按照非单词字符或者下划线分割，并且输出的term都是小写

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_email_analyzer",

  "text": "John_Smith@foo-bar.com"

}

'

因此，基于以上配置，本例输出如下：

[ john, smith, foo, bar, com ]

2.7. Language Analyzers

支持不同语言环境下的文本分析。内置（预定义）的语言有：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

2.8. 自定义Analyzer

前面也说过，一个分析器由三部分构成：

zero or more character filters
a tokenizer
zero or more token filters

2.8.1. 实例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_custom_analyzer": {

          "type":      "custom",

          "tokenizer": "standard",

          "char_filter": [

            "html_strip"

          ],

          "filter": [

            "lowercase",

            "asciifolding"

          ]

        }

      }

    }

  }

}

'

3. Tokenizer

3.1. Standard Tokenizer

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "tokenizer": "standard",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

4. 中文分词器

4.1. smartCN

一个简单的中文或中英文混合文本的分词器

这个插件提供 smartcn analyzer 和 smartcn_tokenizer tokenizer，而且不需要配置

# 安装

bin/elasticsearch-plugin install analysis-smartcn

# 卸载

bin/elasticsearch-plugin remove analysis-smartcn

下面测试一下

可以看到，“今天天气真好”用smartcn分析器的结果是：

[ 今天 ， 天气 ， 真 ， 好 ]

如果用standard分析器的话，结果会是：

[ 今 ，天 ，气 ， 真 ， 好 ]

4.2. IK分词器

下载对应的版本，这里我下载6.5.3

然后，在Elasticsearch的plugins目录下建一个ik目录，将刚才下载的文件解压到该目录下

最后，重启Elasticsearch

接下来，还是用刚才那句话来测试一下

输出结果如下：

{

    "tokens": [

        {

            "token": "今天天气",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "今天",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "天天",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "天气",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "真好",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        }

    ]

}

显然比smartcn要更好一点

5. 参考

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

https://github.com/medcl/elasticsearch-analysis-ik

Elasticsearch 分词器的更多相关文章

Elasticsearch——分词器对String的作用
更多内容参考:Elasticsearch学习总结关于String类型--分词与不分词在Elasticsearch中String是最基本的数据类型,如果不是数字或者标准格式的日期等这种很明显的类型, ...
elasticsearch分词器Jcseg安装手册
Jcseg是什么? Jcseg是基于mmseg算法的一个轻量级中文分词器,同时集成了关键字提取,关键短语提取,关键句子提取和文章自动摘要等功能,并且提供了一个基于Jetty的web服务器,方便各大语言 ...
ElasticSearch分词器
什么是分词器? 分词器,是将用户输入的一段文本,分析成符合逻辑的一种工具.到目前为止呢,分词器没有办法做到完全的符合人们的要求.和我们有关的分词器有英文的和中文的.英文的分词器过程:输入文本-关键词切 ...
ElasticSearch 分词器，了解一下
这篇文章主要来介绍下什么是 Analysis ,什么是分词器,以及 ElasticSearch 自带的分词器是怎么工作的,最后会介绍下中文分词是怎么做的. 首先来说下什么是 Analysis: 什么是 ...
elasticsearch分词器ik
1. 下载和es配套的版本 git clone https://github.com/medcl/elasticsearch-analysis-ik 2. 编译 cd elasticsearch-an ...
Elasticsearch(10) --- 内置分词器、中文分词器
Elasticsearch(10) --- 内置分词器.中文分词器这篇博客主要讲:分词器概念.ES内置分词器.ES中文分词器. 一.分词器概念 1.Analysis 和 Analyzer Analy ...
elasticsearch教程--中文分词器作用和使用
概述本文都是基于elasticsearch安装教程中的elasticsearch安装目录(/opt/environment/elasticsearch-6.4.0)为范例环境准备 ·全新最小 ...
使用Docker 安装Elasticsearch、Elasticsearch-head、IK分词器和使用
原文:使用Docker 安装Elasticsearch.Elasticsearch-head.IK分词器和使用 Elasticsearch的安装一.elasticsearch的安装 1.镜像拉取 ...
如何在Elasticsearch中安装中文分词器(IK+pinyin)
如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题--中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组. ...

随机推荐

vue小技巧之偷懒的文件路径——减少不必要的代码
众所周知,我们写vue项目的时候都会创建很多个文件,尤其是一些中大型项目,会有很深的文件夹,当你去引入的时候,要写很长的路径比如我要引入一个css文件, 必须得 import '../../../s ...
VS Code做项目的笔记
需要自己研究的东西:http://www.bootcss.com/ 画页面时的布局插件:http://blog.chinaunix.net/uid-22414998-id-2878529.html v ...
Navicat Premium 12.1.11.0安装与激活
本文介绍Navicat Premium 12.1.11.0的安装.激活与基本使用. 博主所提供的激活文件理论支持Navicat Premium 12.0.x系列和Navicat Premium 12. ...
SpringCloud Gateway 测试问题解决
本文针对于测试环境SpringCloud Gateway问题解决. 1.背景介绍本文遇到的问题都是在测试环境真正遇到的问题,不一定试用于所有人,仅做一次记录,便于遇到同样问题的干掉这些问题. 使用版 ...
avuex
今天做了的avuex终于发现了问题.作为前端小白,解决花了一上午,这是因为以前没有用过框架.还好终于憋出来了.具体如下,还望不要嘲笑自己查找好久原来是没有仔细看文档的原因,一定要记住,这是一个技术活 ...
Vue.js的安装及简单使用
一.Vue简介二.Vue.js的安装 2.1.npm安装 2.1.1.node.js介绍及安装简介: 简单的说 Node.js 就是运行在服务端的 JavaScript. Node.js 是一个基 ...
【转】Zookeeper 安装和配置
转自:http://coolxing.iteye.com/blog/1871009 Zookeeper的安装和配置十分简单, 既可以配置成单机模式, 也可以配置成集群模式. 下面将分别进行介绍. 单机 ...
XP Sp3 开机就要激活，否则无法登录windows桌面
参考网页:https://www.reddit.com/r/sysadmin/comments/5m9240/activating_windows_xp_in_2017_still_possible/ ...
Reveal : Xcode辅助界面调试工具
Reveal简介: Reveal是一款iOS界面调试工具,辅助Xcode进行界面调试,使用它可以在iOS开发的时候动态的查看和修改应用程序的界面. 软件下载首先去官网下载Reveal,下载地址:ht ...
Java_异常处理
这篇我们聊聊java中的异常.首先我们要知道什么是异常? Exception: exception翻译过来就是“意外”的意思.事实上,异常的本质就是程序的错误,包括程序逻辑错误和系统错误.错误在编写程 ...

Elasticsearch 分词器

Elasticsearch 分词器的更多相关文章

随机推荐

热门专题