Theory Behind Relevance Scoring

Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.

Don’t be alarmed! These concepts are not as complicated as the names make them appear. While this section mentions algorithms, formulae, and mathematical models, it is intended for consumption by mere humans. Understanding the algorithms themselves is not as important as understanding the factors that influence the outcome.

Boolean Model

The Boolean model simply applies the ANDOR, and NOT conditions expressed in the query to find all the documents that match. A query for

full AND text AND search AND (elasticsearch OR lucene)

will include only documents that contain all of the terms fulltext, and search, and eitherelasticsearch or lucene.

This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.

Term Frequency/Inverse Document Frequency (TF/IDF)

Once we have a list of matching documents, they need to be ranked by relevance. Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.

The weight of a term is determined by three factors, which we already introduced in What Is Relevance?. The formulae are included for interest’s sake, but you are not required to remember them.

Term frequency

How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:

tf(t in d) = √frequency 

The term frequency (tf) for term t in document d is the square root of the number of times the term appears in the document.

If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:

PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs"
}
}
}
}
}

Setting index_options to docs will disable term frequencies and term positions. A field with this mapping will not count how many times a term appears, and will not be usable for phrase or proximity queries. Exact-value not_analyzed string fields use this setting by default.

Inverse document frequency

How often does the term appear in all documents in the collection? The more often, the lower the weight. Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:

idf(t) = 1 + log ( numDocs / (docFreq + 1)) 

The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term.

 

ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.的更多相关文章

  1. ES搜索排序,文档相关度评分介绍——Vector Space Model

    Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...

  2. ES搜索排序,文档相关度评分介绍——Field-length norm

    Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...

  3. ES 文档与索引介绍

    在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...

  4. es搜索排序不正确

    沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...

  5. ES-PHP向ES批量添加文档报No alive nodes found in your cluster

    ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...

  6. atitit.vod search doc.doc 点播系统搜索功能设计文档

    atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...

  7. es之对文档进行更新操作

    5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...

  8. MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作

    映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...

  9. rbac介绍、自动生成接口文档、jwt介绍与快速签发认证、jwt定制返回格式

    今日内容概要 RBAC 自动生成接口文档 jwt介绍与快速使用 jwt定制返回格式 jwt源码分析 内容详细 1.RBAC(重要) # RBAC 是基于角色的访问控制(Role-Based Acces ...

随机推荐

  1. [翻译]MySQL 文档: Control Flow Functions(控制流函数)

    本文翻译自13.4 Control Flow Functions Table 13.6 Flow Control Operators 名称 描述 CASE Case 运算符 IF() if/else ...

  2. 重读金典------高质量C编程指南(林锐)-------第六章 函数设计

    函数设计最重要的无外乎两个方面,一个是函数的接口设计一个是内部实现的一些规则. 在C语言中,函数的参数和返回值的传递方式分为两种: 值传递与指针传递.而C++中,多了一个引用传递. 引用传递有些像指针 ...

  3. Html5上传插件封装

    前段时间将flash的上传控件替换成使用纯js实现的,在此记录 1.创建标签 <div class="camera-area" style="display:inl ...

  4. (九)jQuery中的动画(载)

    原文链接:http://blog.csdn.net/zfy865628361/article/details/50358367 首先,用jQuery做动画效果要求在标准模式下,否则可能会引起动画抖动. ...

  5. glob (programming) and spool (/var/spool)

    http://en.wikipedia.org/wiki/Glob_(programming) In computer programming, in particular in a Unix-lik ...

  6. ubuntu16.04 Cmake学习二

    本节主要总结编译程序的时候使用了第三方库的情况,以调用开源opencv-2.4.9为例子,具体安装详见http://www.cnblogs.com/xsfmg/p/5900420.html. 工程文件 ...

  7. 允许局域网内其他主机访问本地MySql数据库

    mysql的root账户,我在连接时通常用的是localhost或127.0.0.1,公司的测试服务器上的mysql也是localhost所以我想访问无法访问,测试暂停. 解决方法如下: 1,修改表, ...

  8. 【Python】IDLE启动错误

    启动IDLE时报Subprocess Startup Error错误 错误信息 IDLE's subprocess didn't make connection.Either IDLE cant't ...

  9. CocoaPods Podfile详解与使用

    1.为什么需要CocoaPods 在进行iOS开发的时候,总免不了使用第三方的开源库,比如SBJson.AFNetworking.Reachability等等.使用这些库的时候通常需要: 下载开源库的 ...

  10. ios中实现对UItextField,UITextView等输入框的字数限制

    本文转载至 http://blog.sina.com.cn/s/blog_9bf272cf01013lsd.html 2011-10-05 16:48 533人阅读 评论(0) 收藏 举报 1.    ...