ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.
Theory Behind Relevance Scoring
Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.

Don’t be alarmed! These concepts are not as complicated as the names make them appear. While this section mentions algorithms, formulae, and mathematical models, it is intended for consumption by mere humans. Understanding the algorithms themselves is not as important as understanding the factors that influence the outcome.
Boolean Model
The Boolean model simply applies the AND
, OR
, and NOT
conditions expressed in the query to find all the documents that match. A query for
full AND text AND search AND (elasticsearch OR lucene)
will include only documents that contain all of the terms full
, text
, and search
, and eitherelasticsearch
or lucene
.
This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.
Term Frequency/Inverse Document Frequency (TF/IDF)
Once we have a list of matching documents, they need to be ranked by relevance. Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.
The weight of a term is determined by three factors, which we already introduced in What Is Relevance?. The formulae are included for interest’s sake, but you are not required to remember them.
Term frequency
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:
tf(t in d) = √frequency
|
The term frequency ( |
If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs"
}
}
}
}
}
|
Setting |
Inverse document frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight. Common terms like and
or the
contribute little to relevance, as they appear in most documents, while uncommon terms like elastic
or hippopotamus
help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
|
The inverse document frequency ( |
ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.的更多相关文章
- ES搜索排序,文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
- ES搜索排序,文档相关度评分介绍——Field-length norm
Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...
- ES 文档与索引介绍
在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...
- es搜索排序不正确
沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...
- ES-PHP向ES批量添加文档报No alive nodes found in your cluster
ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...
- atitit.vod search doc.doc 点播系统搜索功能设计文档
atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...
- es之对文档进行更新操作
5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...
- MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作
映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...
- rbac介绍、自动生成接口文档、jwt介绍与快速签发认证、jwt定制返回格式
今日内容概要 RBAC 自动生成接口文档 jwt介绍与快速使用 jwt定制返回格式 jwt源码分析 内容详细 1.RBAC(重要) # RBAC 是基于角色的访问控制(Role-Based Acces ...
随机推荐
- UNP学习笔记(第一章 简介)
环境搭建 1.下载解压unpv13e.tar.gz 2.进入目录执行 ./configurecd lib //进入lib目录make //执行make命令 3.将生成的libunp.a静态库复制到/u ...
- apue学习笔记(第四章 文件和目录)
本章将描述文件系统的其他特性和文件的性质. 函数stat.fstat.fstatat和lstat #include <sys/stat.h> int stat(const char *re ...
- matlab-1
1.size():获取矩阵的行数和列数 (1)s=size(A), 当只有一个输出参数时,返回一个行向量,该行向量的第一个元素时矩阵的行数,第二个元素是矩阵的列数.(2)[r,c]=size(A),当 ...
- NodeJS 安装cnpm命令行工具
在安装之前,请确保已安装Git和NodeJS. cmd机内命令窗口,输入以下命令: git config --system http.sslcainfo /bin/curl-ca-bundle.crt ...
- 自己定义ProgressDialog载入图片
使用系统载入框 mDialog = new ProgressDialog(this); mDialog.setCancelable(true);//能否够被取消 mDialog.setMessage( ...
- 一步一步实现视频播放器client(二)
实现主体界面: 222.png (64.46 KB, 下载次数: 0) 下载附件 保存到相冊 前天 21:02 上传 比較常见的一种布局.以下几个button.点击后 ...
- Anaconda装OpenCV
感谢来源: http://blog.csdn.net/fairylrt/article/details/43560525 前两天看到段子说开源软件就是各种配置,这是一件很辛苦的事情. Anacond ...
- ArrayList中contains,remove方法返回为false的原因
这几天做一个项目时,遇到ArrayList.remove(Object)方法失败,而ArrayList"包含"删除的对象,这其中的"包含"不是完全包含,请看下面 ...
- eval(function(p,a,c,k,e,d){e=function(c)加解密
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...
- 使用mysqld_multi 实现Mysql 5.6.36 + 5.7.18 单机多实例多版本安装
Mysql 5.6.36 + 5.7.18 单机多实例多版本安装 随着硬件层面的发展,各种高性能服务器如雨后春笋般出现,但高性能服务器不免造成浪费, MySQL单机多实例,是指在一台物理服务器上运行多 ...