es 分词器介绍

按照单词切分，不做处理

GET _analyze

{

  "analyzer": "standard",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,分词

      "end_offset" : 1,

      "type" : "<NUM>",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "<ALPHANUM>",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "<ALPHANUM>",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "<ALPHANUM>",

      "position" : 3

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "<ALPHANUM>",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "<ALPHANUM>",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "<ALPHANUM>",

      "position" : 6

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "<ALPHANUM>",

      "position" : 7

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "<ALPHANUM>",

      "position" : 8

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "<ALPHANUM>",

      "position" : 9

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "<ALPHANUM>",

      "position" : 10

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "<ALPHANUM>",

      "position" : 11

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "<ALPHANUM>",

      "position" : 12

    }

  ]

}

　　按照非字母的字符切分

GET _analyze

{

  "analyzer": "simple",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　按照空格切分不做任何处理

GET _analyze

{

  "analyzer": "whitespace",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "Quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "brawn-foxes",

      "start_offset" : 16,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening.",

      "start_offset" : 62,

      "end_offset" : 70,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　按词切分去掉修饰词

GET _analyze

{

  "analyzer": "stop",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　不进行切分直接输出

GET _analyze

{

  "analyzer": "keyword",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2 running Quick brawn-foxes leap over lazy dogs in the summer evening.",

      "start_offset" : 0,

      "end_offset" : 70,

      "type" : "word",

      "position" : 0

    }

  ]

}

　　通过正则表达式方式进行切割，默认非字符的方式切割

GET _analyze

{

  "analyzer": "pattern",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 11

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 12

    }

  ]

}

　　英语分词器

GET _analyze

{

  "analyzer": "english",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "<NUM>",

      "position" : 0

    },

    {

      "token" : "run",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "<ALPHANUM>",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "<ALPHANUM>",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "<ALPHANUM>",

      "position" : 3

    },

    {

      "token" : "fox",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "<ALPHANUM>",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "<ALPHANUM>",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "<ALPHANUM>",

      "position" : 6

    },

    {

      "token" : "lazi",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "<ALPHANUM>",

      "position" : 7

    },

    {

      "token" : "dog",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "<ALPHANUM>",

      "position" : 8

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "<ALPHANUM>",

      "position" : 11

    },

    {

      "token" : "even",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "<ALPHANUM>",

      "position" : 12

    }

  ]

}

　　中文分词器，一个字符一个字符切分

POST _analyze

{

  "analyzer": "standard",

  "text": "他说的确实在理"

}

{

  "tokens" : [

    {

      "token" : "他",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "<IDEOGRAPHIC>",

      "position" : 0

    },

    {

      "token" : "说",

      "start_offset" : 1,

      "end_offset" : 2,

      "type" : "<IDEOGRAPHIC>",

      "position" : 1

    },

    {

      "token" : "的",

      "start_offset" : 2,

      "end_offset" : 3,

      "type" : "<IDEOGRAPHIC>",

      "position" : 2

    },

    {

      "token" : "确",

      "start_offset" : 3,

      "end_offset" : 4,

      "type" : "<IDEOGRAPHIC>",

      "position" : 3

    },

    {

      "token" : "实",

      "start_offset" : 4,

      "end_offset" : 5,

      "type" : "<IDEOGRAPHIC>",

      "position" : 4

    },

    {

      "token" : "在",

      "start_offset" : 5,

      "end_offset" : 6,

      "type" : "<IDEOGRAPHIC>",

      "position" : 5

    },

    {

      "token" : "理",

      "start_offset" : 6,

      "end_offset" : 7,

      "type" : "<IDEOGRAPHIC>",

      "position" : 6

    }

  ]

}

es 分词器介绍的更多相关文章

Es学习第五课，分词器介绍和中文分词器配置
上课我们介绍了倒排索引,在里面提到了分词的概念,分词器就是用来分词的. 分词器是ES中专门处理分词的组件,英文为Analyzer,定义为:从一串文本中切分出一个一个的词条,并对每个词条进行标准化.它由 ...
es学习(三)：分词器介绍以及中文分词器ik的安装与使用
什么是分词把文本转换为一个个的单词,分词称之为analysis.es默认只对英文语句做分词,中文不支持,每个中文字都会被拆分为独立的个体. 示例 POST http://192.168.247.8: ...
es分词器
1.默认的分词器 standard standard tokenizer:以单词边界进行切分standard token filter:什么都不做lowercase token filter:将所有字 ...
Lucene的分词_中文分词器介绍
Paoding:庖丁解牛分词器.已经没有更新了. MMSeg:搜狗的词库. MMSeg分词器的一些截图: 步骤: 1.导入包 2.创建的时候使用MMSegAnalyzer分词器
HanLP-分类模块的分词器介绍
最近发现一个很勤快的大神在分享他的一些实操经验,看了一些他自己关于hanlp方面的文章,写的挺好的!转载过来分享给大家!以下为分享原文(无意义的内容已经做了删除) 如下图所示,HanLP的分类模块中单 ...
Elasticsearch：ICU分词器介绍
ICU Analysis插件是一组将Lucene ICU模块集成到Elasticsearch中的库. 本质上,ICU的目的是增加对Unicode和全球化的支持,以提供对亚洲语言更好的文本分割分析. 从 ...
lucene-一篇分词器介绍很好理解的文章
本文来自这里在前面的概念介绍中我们已经知道了分析器的作用,就是把句子按照语义切分成一个个词语.英文切分已经有了很成熟的分析器: StandardAnalyzer,很多情况下StandardAnalyz ...
Elasticsearch（ES）分词器的那些事儿
1. 概述分词器是Elasticsearch中很重要的一个组件,用来将一段文本分析成一个一个的词,Elasticsearch再根据这些词去做倒排索引. 今天我们就来聊聊分词器的相关知识. 2. 内置 ...
Elasticsearch系列---倒排索引原理与分词器
概要本篇主要讲解倒排索引的基本原理以及ES常用的几种分词器介绍. 倒排索引的建立过程倒排索引是搜索引擎中常见的索引方法,用来存储在全文搜索下某个单词在一个文档中存储位置的映射.通过倒排索引,我们输 ...

随机推荐

AcWing 845. 八数码
https://www.acwing.com/problem/content/847/ #include<bits/stdc++.h> using namespace std; int b ...
格式化输出_python
一.直接使用 +a='chen'b='xiao'c='zan'print(a+b+c) 二.利用占位符 %%s:占位符:%d:整数:%x:十六进制:%f:浮点数(默认6位小数)特别注意浮点数: 指定长 ...
图的dfs遍历模板（邻接表和邻接矩阵存储）
我们做算法题的目的是解决问题,完成任务,而不是创造算法,解题的过程是利用算法的过程而不是创造算法的过程,我们不能不能陷入这样的认识误区.而想要快速高效的利用算法解决算法题,积累算法模板就很重要,利用模 ...
C语言 puts
C语言 puts #include <stdio.h> int puts(const char *s); 功能:标准设备输出s字符串,在输出完成后自动输出一个'\n'. 参数: s:字符串 ...
浅谈ABB机器人（工具坐标，工件坐标，有效载荷）
工具坐标(tool): 使tcl坐标偏移到工具上,例如焊接工作,使机器人工作点切入焊枪点上 mass:工具的重量 xyz:偏移距离的大小验证:通过手动模式,切换至自定义工具,重定向工件坐标(wob ...
PHP基础学习笔记4
一.日期 1.1 date()函数语法:string date ( string $format [, int $timestamp ] ) 参数:参数描述format必需,规定时间戳的格式:tim ...
13.56Mhz下直接阻抗匹配调试步骤
直接匹配阻抗,天线与射频芯片在同一块板子,调试步骤与50欧姆阻抗匹配调试天线参数差不多,多了一部分射频芯片端的滤波部分的参数计算.下面介绍调试过程. 1.首先看一下射频芯片发射部分原理图:分析原理图时 ...
python 网页中文显示Unicode码
print repr(a).decode("unicode–escape") 注:a是要输出的结果,
题解【洛谷P2279】[HNOI2003]消防局的设立
题目描述 2020年,人类在火星上建立了一个庞大的基地群,总共有$n$个基地.起初为了节约材料,人类只修建了$n-1$条道路来连接这些基地,并且每两个基地都能够通过道路到达,所以所有的基地形成 ...
自己常用的Linux命令和Hadoop命令
记录自己常用的Linux命令: ss的启动命令:ssserver -c /etc/shadowsocks.json jupyter notebook的启动命令:jupyter notebook --a ...

es 分词器介绍

es 分词器介绍的更多相关文章

随机推荐

热门专题