Writing analyzers

There are times when you would like to analyze text in a bespoke fashion, either by configuring how one of Elasticsearch’s built-in analyzers works, or by combining analysis components together to build a custom analyzer.

The analysis chain

An analyzer is built of three components:

  • 0 or more character filters
  • exactly 1 tokenizer
  • 0 or more token filters

Check out the Elasticsearch documentation on the Anatomy of an analyzer to understand more.

Specifying an analyzer on a field mapping

An analyzer can be specified on a text datatype field mapping when creating a new field on a type, usually when creating the type mapping at index creation time, but also when adding a new field using the Put Mapping API.

Although you can add new types to an index, or add new fields to a type, you can’t add new analyzers or make changes to existing fields. If you were to do so, the data that has already been indexed would be incorrect and your searches would no longer work as expected.

When you need to make changes to existing fields, you should look at reindexing your data with the Reindex API

Here’s a simple example that specifies that the name field in Elasticsearch, which maps to the NamePOCO property on the Project type, uses the whitespace analyzer at index time

  1. var createIndexResponse = client.CreateIndex("my-index", c => c
  2. .Mappings(m => m
  3. .Map<Project>(mm => mm
  4. .Properties(p => p
  5. .Text(t => t
  6. .Name(n => n.Name)
  7. .Analyzer("whitespace")
  8. )
  9. )
  10. )
  11. )
  12. );

Configuring a built-in analyzer

Several built-in analyzers can be configured to alter their behaviour. For example, the standardanalyzer can be configured to support a list of stop words with the stop word token filter it contains.

Configuring a built-in analyzer requires creating an analyzer based on the built-in one

  1. var createIndexResponse = client.CreateIndex("my-index", c => c
  2. .Settings(s => s
  3. .Analysis(a => a
  4. .Analyzers(aa => aa
  5. .Standard("standard_english", sa => sa
  6. .StopWords("_english_")

  1. )
  2. )
  3. )
  4. )
  5. .Mappings(m => m
  6. .Map<Project>(mm => mm
  7. .Properties(p => p
  8. .Text(t => t
  9. .Name(n => n.Name)
  10. .Analyzer("standard_english")

  1. )
  2. )
  3. )
  4. )
  5. );

Pre-defined list of English stopwords within Elasticsearch

Use the standard_english analyzer configured

  1. {
  2. "settings": {
  3. "analysis": {
  4. "analyzer": {
  5. "standard_english": {
  6. "type": "standard",
  7. "stopwords": [
  8. "_english_"
  9. ]
  10. }
  11. }
  12. }
  13. },
  14. "mappings": {
  15. "project": {
  16. "properties": {
  17. "name": {
  18. "type": "text",
  19. "analyzer": "standard_english"
  20. }
  21. }
  22. }
  23. }
  24. }

Creating a custom analyzer

A custom analyzer can be composed when none of the built-in analyzers fit your needs. A custom analyzer is built from the components that you saw in the analysis chain and a position increment gap, that determines the size of gap that Elasticsearch should insert between array elements, when a field can hold multiple values e.g. a List<string> POCO property.

For this example, imagine we are indexing programming questions, where the question content is HTML and contains source code

  1. public class Question
  2. {
  3. public int Id { get; set; }
  4. public DateTimeOffset CreationDate { get; set; }
  5. public int Score { get; set; }
  6. public string Body { get; set; }
  7. }

Based on our domain knowledge of programming languages, we would like to be able to search questions that contain "C#", but using the standard analyzer, "C#" will be analyzed and produce the token "c". This won’t work for our use case as there will be no way to distinguish questions about "C#" from questions about another popular programming language, "C".

We can solve our issue with a custom analyzer

  1. var createIndexResponse = client.CreateIndex("questions", c => c
  2. .Settings(s => s
  3. .Analysis(a => a
  4. .CharFilters(cf => cf
  5. .Mapping("programming_language", mca => mca
  6. .Mappings(new []
  7. {
  8. "c# => csharp",
  9. "C# => Csharp"
  10. })
  11. )
  12. )
  13. .Analyzers(an => an
  14. .Custom("question", ca => ca
  15. .CharFilters("html_strip", "programming_language")
  16. .Tokenizer("standard")
  17. .Filters("standard", "lowercase", "stop")
  18. )
  19. )
  20. )
  21. )
  22. .Mappings(m => m
  23. .Map<Question>(mm => mm
  24. .AutoMap()
  25. .Properties(p => p
  26. .Text(t => t
  27. .Name(n => n.Body)
  28. .Analyzer("question")
  29. )
  30. )
  31. )
  32. )
  33. );

Our custom question analyzer will apply the following analysis to a question body

  1. strip HTML tags
  2. map both C# and c# to "CSharp" and "csharp", respectively (so the # is not stripped by the tokenizer)
  3. tokenize using the standard tokenizer
  4. filter tokens with the standard token filter
  5. lowercase tokens
  6. remove stop word tokens

full text query will also apply the same analysis to the query input against the question body at search time, meaning when someone searches including the input "C#", it will also be analyzed and produce the token "csharp", matching a question body that contains "C#" (as well as "csharp" and case invariants), because the search time analysis applied is the same as the index time analysis.

Index and Search time analysis

With the previous example, we probably don’t want to apply the same analysis to the query input of a full text query against a question body; we know for our problem domain that a query input is not going to contain HTML tags, so we would like to apply different analysis at search time.

An analyzer can be specified when creating the field mapping to use at search time, in addition to an analyzer to use at query time

  1. var createIndexResponse = client.CreateIndex("questions", c => c
  2. .Settings(s => s
  3. .Analysis(a => a
  4. .CharFilters(cf => cf
  5. .Mapping("programming_language", mca => mca
  6. .Mappings(new[]
  7. {
  8. "c# => csharp",
  9. "C# => Csharp"
  10. })
  11. )
  12. )
  13. .Analyzers(an => an
  14. .Custom("index_question", ca => ca

  1. .CharFilters("html_strip", "programming_language")
  2. .Tokenizer("standard")
  3. .Filters("standard", "lowercase", "stop")
  4. )
  5. .Custom("search_question", ca => ca

  1. .CharFilters("programming_language")
  2. .Tokenizer("standard")
  3. .Filters("standard", "lowercase", "stop")
  4. )
  5. )
  6. )
  7. )
  8. .Mappings(m => m
  9. .Map<Question>(mm => mm
  10. .AutoMap()
  11. .Properties(p => p
  12. .Text(t => t
  13. .Name(n => n.Body)
  14. .Analyzer("index_question")
  15. .SearchAnalyzer("search_question")
  16. )
  17. )
  18. )
  19. )
  20. );

Use an analyzer at index time that strips HTML tags

Use an analyzer at search time that does not strip HTML tags

With this in place, the text of a question body will be analyzed with the index_question analyzer at index time and the input to a full text query on the question body field will be analyzed with the search_question analyzer that does not use the html_strip character filter.

A Search analyzer can also be specified per query i.e. use a different analyzer for a particular request from the one specified in the mapping. This can be useful when iterating on and improving your search strategy.

Take a look at the analyzer documentation for more details around where analyzers can be specified and the precedence for a given request.

Writing analyzers的更多相关文章

  1. Elasticsearch搜索资料汇总

    Elasticsearch 简介 Elasticsearch(ES)是一个基于Lucene 构建的开源分布式搜索分析引擎,可以近实时的索引.检索数据.具备高可靠.易使用.社区活跃等特点,在全文检索.日 ...

  2. 4.3 Writing a Grammar

    4.3 Writing a Grammar Grammars are capable of describing most, but not all, of the syntax of program ...

  3. Spring Enable annotation – writing a custom Enable annotation

    原文地址:https://www.javacodegeeks.com/2015/04/spring-enable-annotation-writing-a-custom-enable-annotati ...

  4. Writing to a MySQL database from SSIS

    Writing to a MySQL database from SSIS 出处  :  http://blogs.msdn.com/b/mattm/archive/2009/01/07/writin ...

  5. Writing Clean Code 读后感

    最近花了一些时间看了这本书,书名是 <Writing Clean Code ── Microsoft Techniques for Developing Bug-free C Programs& ...

  6. JMeter遇到的问题一:Error writing to server(转)

    Java.io.IOException: Error writing to server异常:我测试500个并发时,系统没有问题:可当我把线程数加到800时,就出现错误了,在"查看结果树&q ...

  7. java.io.WriteAbortedException: writing aborted; java.io.NotSerializableException

    问题描述: 严重: IOException while loading persisted sessions: java.io.WriteAbortedException: writing abort ...

  8. Markdown syntax guide and writing on MWeb

    Philosophy Markdown is intended to be as easy-to-read and easy-to-write as is feasible.Readability, ...

  9. 《Writing Idiomatic Python》前两部分的中文翻译

    汇总了一下这本小书前两部分的内容: 翻译<Writing Idiomatic Python>(一):if语句.for循环 翻译<Writing Idiomatic Python> ...

随机推荐

  1. gain 基尼系数

    转至:http://blog.csdn.net/bitcarmanlee/article/details/51488204 在信息论与概率统计学中,熵(entropy)是一个很重要的概念.在机器学习与 ...

  2. ConcurrentHashMap的简单理解

    一.效率低下的HashTable容器HashTable容器使用synchronized来保证线程安全,但在线程竞争激烈的情况下HashTable的效率非常低下.因为当一个线程访问HashTable的同 ...

  3. 数论知识总结——史诗大作(这是一个flag)

    1.快速幂 计算a^b的快速算法,例如,3^5,我们把5写成二进制101,3^5=3^1*1+3^2*2+3^4*1 ll fast(ll a,ll b){ll ans=;,a=mul(a,a)))a ...

  4. cmake 及make 实践记录

    DEBIAN操作系统 预备操作: 安装 gcc g++ make cmake 开启Terminal 切换到超级用户 下载安装上述软件 A@debian:~$ su Password: root@deb ...

  5. Bitmap Images and Image Masks

    [Bitmap Images and Image Masks] Bitmap images and image masks are like any drawing primitive in Quar ...

  6. maven pom.xml中的 build说明

    在Maven的pom.xml文件中,Build相关配置包含两个部分,一个是<build>,另一个是<reporting>,这里我们只介绍<build>. 1. 在M ...

  7. CI框架下的PHP增删改查总结

    controllers下的 cquery.php文件 <?php class CQuery extends Controller { //构造函数 function CQuery() { par ...

  8. 后台web请求代码(含https,json提交)

    后台web请求 namespace XXXX.Utilites { using System; using System.Collections.Generic; using System.IO; u ...

  9. IIS 8 nodejs + iisnode 配置

    最近再Server 2012 + IIS 8 中配置NodeJS 运行环境,具体配置过程就不细说了(随便搜搜一堆),安装完nodejs 和 iisnode 之后,出现一个报错,如下: The iisn ...

  10. [Oracle]Oracle数据库数据被修改或者删除恢复数据

    1.SELECT * FROM CT_FIN_RiskItem--先查询表,确定数据的确不对(cfstatus第一行缺少)2.select * from CT_FIN_RiskItem as of t ...