Vector Space Model

The vector space model provides a way of comparing a multiterm query against a document. The output is a single score that represents how well the document matches the query. In order to do this, the model represents both the document and the query as vectors.

A vector is really just a one-dimensional array containing numbers, for example:

[1,2,5,22,3,8]

In the vector space model, each number in the vector is the weight of a term, as calculated with term frequency/inverse document frequency.

While TF/IDF is the default way of calculating term weights for the vector space model, it is not the only way. Other models like Okapi-BM25 exist and are available in Elasticsearch. TF/IDF is the default because it is a simple, efficient algorithm that produces high-quality search results and has stood the test of time.

Imagine that we have a query for “happy hippopotamus.” A common word like happy will have a low weight, while an uncommon term like hippopotamus will have a high weight. Let’s assume that happyhas a weight of 2 and hippopotamus has a weight of 5. We can plot this simple two-dimensional vector—[2,5]—as a line on a graph starting at point (0,0) and ending at point (2,5), as shown inFigure 27, “A two-dimensional query vector for “happy hippopotamus” represented”.

Figure 27. A two-dimensional query vector for “happy hippopotamus” represented

Now, imagine we have three documents:

  1. I am happy in summer.
  2. After Christmas I’m a hippopotamus.
  3. The happy hippopotamus helped Harry.

We can create a similar vector for each document, consisting of the weight of each query term—happy and hippopotamus—that appears in the document, and plot these vectors on the same graph, as shown in Figure 28, “Query and document vectors for “happy hippopotamus””:

  • Document 1: (happy,____________)[2,0]
  • Document 2: ( ___ ,hippopotamus)[0,5]
  • Document 3: (happy,hippopotamus)[2,5]

Figure 28. Query and document vectors for “happy hippopotamus”

The nice thing about vectors is that they can be compared. By measuring the angle between the query vector and the document vector, it is possible to assign a relevance score to each document. The angle between document 1 and the query is large, so it is of low relevance. Document 2 is closer to the query, meaning that it is reasonably relevant, and document 3 is a perfect match.

In practice, only two-dimensional vectors (queries with two terms) can be plotted easily on a graph. Fortunately, linear algebra—the branch of mathematics that deals with vectors—provides tools to compare the angle between multidimensional vectors, which means that we can apply the same principles explained above to queries that consist of many terms.

You can read more about how to compare two vectors by using cosine similarity.

Now that we have talked about the theoretical basis of scoring, we can move on to see how scoring is implemented in Lucene.

ES搜索排序,文档相关度评分介绍——Vector Space Model的更多相关文章

  1. ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.

    Theory Behind Relevance Scoring Lucene (and thus Elasticsearch) uses the Boolean model to find match ...

  2. ES搜索排序,文档相关度评分介绍——Field-length norm

    Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...

  3. ES 文档与索引介绍

    在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...

  4. ES-PHP向ES批量添加文档报No alive nodes found in your cluster

    ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...

  5. atitit.vod search doc.doc 点播系统搜索功能设计文档

    atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...

  6. 认识DOM 文档对象模型DOM(Document Object Model)定义访问和处理HTML文档的标准方法。元素、属性和文本的树结构(节点树)。

    认识DOM 文档对象模型DOM(Document Object Model)定义访问和处理HTML文档的标准方法.DOM 将HTML文档呈现为带有元素.属性和文本的树结构(节点树). 先来看看下面代码 ...

  7. es之对文档进行更新操作

    5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...

  8. es搜索排序不正确

    沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...

  9. MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作

    映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...

随机推荐

  1. RF --系统关键字开发

    需求: 接收一个目录路径,自动遍历目录下以及子目录下的所有批处理(.bat) 文件并执行. 首先在..\Python27\Lib\site-packages 目录下创建 CustomLibrary 目 ...

  2. Centos6.0使用第三方YUM源(EPEL,RPMForge,RPMFusion)

    yum是centos下很方便的rpm包管理工具,配置第三方软件库使你的软件库更加丰富.以下简单的讲下配置的步骤. 首先,需要安装yum-priorities插件: yum install yum-pr ...

  3. 数据库sql的join多表

    摘录文章 SQL join 用于根据两个或多个表中的列之间的关系,从这些表中查询数据.注意,join后的数据记录数不一定就是左或右表的简单连接,图表只代表集合关系,在数量上并不准确,如这个条件后结果, ...

  4. 机器学习实战之SVM

    一引言: 支持向量机这部分确实很多,想要真正的去理解它,不仅仅知道理论,还要进行相关的代码编写和测试,二者想和结合,才能更好的帮助我们理解SVM这一非常优秀的分类算法 支持向量机是一种二类分类算法,假 ...

  5. Google论文BigTable拜读

    这周少打点dota2,争取把这篇论文读懂并呈现出来,和大家一起分享. 先把论文搞懂,然后再看下和论文搭界的知识,比如hbase,Chubby和Paxos算法. Bigtable: A Distribu ...

  6. PHP中输出文件,怎么区别什么时候该用readfile() , fread(), file_get_contents(), fgets()

    我在服务器端(Apache环境)上放了一个安卓apk安装包的下载链接,使用readfile()读取apk文件输出下载后,手机安装apk显示解析包错误.但apk本身没问题,下载后文件的大小也是完整的.服 ...

  7. bugzilla 系列1安装

    安装好mysql yum install gcc perl* mod_perl-devel -y wget https://ftp.mozilla.org/pub/mozilla.org/webtoo ...

  8. 简单理解ThreadLocal原理和适用场景

    https://blog.csdn.net/qq_36632687/article/details/79551828?utm_source=blogkpcl2 参考文章: 正确理解ThreadLoca ...

  9. 自己动手写CPU之第七阶段(2)——简单算术操作指令实现过程

    将陆续上传本人写的新书<自己动手写CPU>.今天是第25篇.我尽量每周四篇 亚马逊的预售地址例如以下,欢迎大家围观呵! http://www.amazon.cn/dp/b00mqkrlg8 ...

  10. java ResultSet获得总行数

    在Java中,获得ResultSet的总行数的方法有以下几种. 第一种:利用ResultSet的getRow方法来获得ResultSet的总行数 Statement stmt = con.create ...