Lucene in action 笔记 term vector——针对特定field建立的词频向量空间，不存！不会！影响搜索，其作用是告诉我们搜索结果是“如何”匹配的，用以提供高亮、计算相似度，在VSM模型中评分计算

摘自：http://makble.com/what-is-term-vector-in-lucene

given a document, find all its terms and the positions information of these terms. Index tell us which document matched , term vector tells us how and where its matched. A classic example is search result highlighting. The term vector contains all the necessary information let us do this. See how to high light a blog post with Lucene 6.0.0 and Gradle build How to do Lucene search highlight example

Another interesting thing we can do with term vector is find similar documents of a particular document, for example the "related posts" feature in a blog entry which is a list of links point to other documents similar to current blog entry. With term vector information we can actually calculate how much two documents similar with each other with a simple formula.

The term vector also play an important role when scoring matching documents in vector space model.

The term vectors is like a micro version of inverted index against only one document. This index will answer such query: for a search term how many times it occurs in this document and where it show up? Or simply: frequencies and positions.

The term vector is generated in the analyzing process. When analyzer generate tokens, it also provide position and offset information . You can specify whether to store these information in term vectors:

TermVector.YES: Only store number of occurrences.

TermVector.WITH_POSITIONS: Store number of occurrence and positions of terms, but no offset.

TermVector.WITH_OFFSETS: Store number of occurrence and offsets of terms, but no positions.

TermVector.WITH_POSITIONS_OFFSETS:number of occurrence and positions , offsets of terms.

TermVector.NO:Don't store any term vector information.

If those information is not stored, you can also compute it on the fly when searching.

摘自：http://blog.csdn.net/fxjtoday/article/details/5142661

Leveraging term vectors
所谓term vector, 就是对于documents的某一field,如title,body这种文本类型的, 建立词频的多维向量空间.每一个词就是一维, 这维的值就是这个词在这个field中的频率.

如果你要使用term vectors, 就要在indexing的时候对该field打开term vectors的选项:

Field options for term vectors
TermVector.YES – record the unique terms that occurred, and their counts, in each document, but do not store any positions or offsets information.
TermVector.WITH_POSITIONS – record the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets.
TermVector.WITH_OFFSETS – record the unique terms and their counts, with the offsets (start & end character position) of each occurrence of every term, but no positions.
TermVector.WITH_POSITIONS_OFFSETS – store unique terms and their counts, along with positions and offsets.
TermVector.NO – do not store any term vector information.
If Index.NO is specified for a field, then you must also specify TermVector.NO.

这样在index完后, 给定这个document id和field名称, 我们就可以从IndexReader读出这个term vector(前提是你在indexing时创建了terms vector):
TermFreqVector termFreqVector = reader.getTermFreqVector(id, "subject");
你可以遍历这个TermFreqVector去取出每个词和词频, 如果你在index时选择存下offsets和positions信息的话, 你在这边也可以取到.

有了这个term vector我们可以做一些有趣的应用:
1) Books like this
比较两本书是否相似,把书抽象成一个document文件, 具有author, subject fields. 那么就通过这两个field来比较两本书的相似度.
author这个field是multiple fields, 就是说可以有多个author, 那么第一步就是比author是否相同,
String[] authors = doc.getValues("author");
BooleanQuery authorQuery = new BooleanQuery(); // #3
for (int i = 0; i < authors.length; i++) { // #3
    String author = authors[i]; // #3
    authorQuery.add(new TermQuery(new Term("author", author)), BooleanClause.Occur.SHOULD); // #3
}
authorQuery.setBoost(2.0f);
最后还可以把这个查询的boost值设高, 表示这个条件很重要, 权重较高, 如果作者相同, 那么就很相似了.
第二步就用到term vector了, 这里用的很简单, 单纯的看subject field的term vector中的term是否相同,
TermFreqVector vector = // #4
reader.getTermFreqVector(id, "subject"); // #4
BooleanQuery subjectQuery = new BooleanQuery(); // #4
for (int j = 0; j < vector.size(); j++) { // #4
    TermQuery tq = new TermQuery(new Term("subject", vector.getTerms()[j]));
    subjectQuery.add(tq, BooleanClause.Occur.SHOULD); // #4
}

2) What category?
这个比上个例子高级一点, 怎么分类了,还是对于document的subject, 我们有了term vector.
所以对于两个document, 我们可以比较这两个文章的term vector在向量空间中的夹角, 夹角越小说明这个两个document越相似.
那么既然是分类就有个训练的过程, 我们必须建立每个类的term vector作为个标准, 来给其它document比较.
这里用map来实现这个term vector, (term, frequency), 用n个这样的map来表示n维. 我们就要为每个category来生成一个term vector, category和term vector也可以用一个map来连接.创建这个category的term vector, 这样做:
遍历这个类中的每个document, 取document的term vector, 把它加到category的term vector上.
private void addTermFreqToMap(Map vectorMap, TermFreqVector termFreqVector) {
    String[] terms = termFreqVector.getTerms();
    int[] freqs = termFreqVector.getTermFrequencies();
    for (int i = 0; i < terms.length; i++) {
        String term = terms[i];
        if (vectorMap.containsKey(term)) {
            Integer value = (Integer) vectorMap.get(term);
            vectorMap.put(term, new Integer(value.intValue() + freqs[i]));
        } else {
            vectorMap.put(term, new Integer(freqs[i]));
        }
   }
}
首先从document的term vector中取出term和frequency的list, 然后从category的term vector中取每一个term, 把document的term frequency加上去.OK了

有了这个每个类的category, 我们就要开始计算document和这个类的向量夹角了
cos = A*B/|A||B|
A*B就是点积, 就是两个向量每一维相乘, 然后全加起来.
这里为了简便计算, 假设document中term frequency只有两种情况, 0或1.就表示出现或不出现

3) MoreLikeThis

对于找到比较相似的文档，lucene还提供了个比较高效的接口，MoreLikeThis接口

http://lucene.apache.org/Java/1_9_1/api/org/apache/lucene/search/similar/MoreLikeThis.html

对于上面的方法我们可以比较每两篇文档的余弦值，然后对余弦值进行排序，找出最相似的文档，但这个方法的最大问题在于计算量太大，当文档数目很大时，几乎是无法接受的，当然有专门的方法去优化余弦法，可以使计算量大大减少，但这个方法精确，但门槛较高。

这个接口的原理很简单，对于一篇文档中，我们只需要提取出interestingTerm（即tf×idf高的词），然后用lucene去搜索包含相同词的文档，作为相似文档，这个方法的优点就是高效，但缺点就是不准确，这个接口提供很多参数，你可以配置来选择interestingTerm。

Lucene in action 笔记 term vector——针对特定field建立的词频向量空间，不存！不会！影响搜索，其作用是告诉我们搜索结果是“如何”匹配的，用以提供高亮、计算相似度，在VSM模型中评分计算的更多相关文章

django 模型中的计算字段
models.py class Person(models.Model): family_name= models.CharField(max_length=20, verbose_name='姓') ...
MongoDB全文搜索——目前尚不支持针对特定field的搜索
> db.articles.createIndex( { subject: "text" } ) { "createdCollectionAutomatically ...
Elasticsearch系列---Term Vector工具探查数据
概要本篇主要介绍一个Term Vector的概念和基本使用方法. term vector是什么? 每次有document数据插入时,elasticsearch除了对document进行正排.倒排索引 ...
超计算（Hyper computation）模型
超计算(Hyper computation)模型作者:Xyan Xcllet链接:https://www.zhihu.com/question/21579465/answer/106995708来源 ...
Solr In Action 笔记(2) 之评分机制(相似性计算)
Solr In Action 笔记(2) 之评分机制(相似性计算) 1 简述我们对搜索引擎进行查询时候,很少会有人进行翻页操作.这就要求我们对索引的内容提取具有高度的匹配性,这就搜索引擎文档的相似性 ...
一个基于特征向量的近似网页去重算法——term用SVM人工提取训练，基于term的特征向量，倒排索引查询相似文档，同时利用cos计算相似度
摘要在搜索引擎的检索结果页面中,用户经常会得到内容相似的重复页面,它们中大多是由于网站之间转载造成的.为提高检索效率和用户满意度,提出一种基于特征向量的大规模中文近似网页检测算法DDW(Det ...
基于MATLAB实现的云模型计算隶属度
”云”或者’云滴‘是云模型的基本单元,所谓云是指在其论域上的一个分布,可以用联合概率的形式(x, u)来表示云模型用三个数据来表示其特征期望:云滴在论域空间分布的期望,一般用符号Εx表示. 熵:不 ...
《Lucene in Action 第二版》第4章节学习总结 -- Lucene中的分析
通过第四章的学习,可以了解lucene的分析过程是怎样的,并且可以学会如何使用lucene内置分析器,以及自定义分析器.下面是具体总结 1. 分析(Analysis)是什么? 在lucene中,分析就 ...
《Lucene in Action 第二版》第三章节的学习总结----IndexSearcher以及Term和QueryParser
本章节告诉我们怎么用搜索.通过这章节的学习,虽然搜索的内部原理不清楚,但是至少应该学会简单的编写搜索程序了本章节,需要掌握如下几个主要API1.IndexSearcher类:搜索索引的门户,发起者. ...

随机推荐

msp430项目编程47
msp430综合项目---有线采集传输平台系统47 1.电路工作原理 2.代码(显示部分) 3.代码(功能实现) 4.项目总结
max file descriptors [4096] for elasticsearch proess is too low, increase to at least [65536]
修改文件 /etc/security/limits.conf 加入以下两行: sonar hard nofile 65536 sonar soft nofile 65536 #备注:sonar这里是 ...
省赛i题/求1~n内所有数对（x,y）,满足最大公约数是质数的对数
求1~n内所有数对(x,y),gcd(x,y)=质数,的对数. 思路:用f[n]求出,含n的对数,最后用sum[n]求和. 对于gcd(x,y)=a(设x<=y,a是质数),则必有gcd(x/a ...
Unable to locate Attribute with the the given name [] on this ManagedType
最近在写Springboot+hibernate的项目,遇到这个问题好坑,查了半天没发现我哪里配置错了,后来发现实体类声明字段大小写敏感,把声明字段改成小写就行了
eclipse软件安装及python工程建立
原文地址:http://www.cnblogs.com/halfacre/archive/2012/07/22/2603848.html 安装python解释器安装PyDev: 首先需要去Eclip ...
Ubuntu 16.04通过Snap安装应用程序
16.04LTS可以说是一个不寻常的5年支持版本,同时也带来了Snap应用,并通过Snap可以安装众多的软件包.需要注意的是,Snap是一个全新的软件包架构,但是同样也比其它的软件包大很多. 简单的安 ...
【hql】spring data jpa中 @Query使用hql查询问题
spring data jpa中 @Query使用hql查询问题使用hql查询, 1.from后面跟的是实体类不是数据表名 2.字段应该用实体类中的字段而不是数据表中的属性实体如下 hql使 ...
Spring Boot 使用Java代码创建Bean并注冊到Spring中
从 Spring3.0 開始,添加了一种新的途经来配置Bean Definition,这就是通过 Java Code 配置 Bean Definition. 与Xml和Annotation两种配置方式 ...
eclipse Kepler tomcat内存溢出解决方式
使用eclipse开发ssh项目,本机8G内存,可是在打开一个表格后再打开一个页面.立即就内存溢出,网上搜到下面解决方式,未解决: 1.改动eclipse.ini參数 -vmargs -Xms1024 ...
plsql 无需配置客户端连接.
plsql 无需配置客户端连接. (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.5)(PORT=1521)))(C ...

Lucene in action 笔记 term vector——针对特定field建立的词频向量空间，不存！不会！影响搜索，其作用是告诉我们搜索结果是“如何”匹配的，用以提供高亮、计算相似度，在VSM模型中评分计算

Lucene in action 笔记 term vector——针对特定field建立的词频向量空间，不存！不会！影响搜索，其作用是告诉我们搜索结果是“如何”匹配的，用以提供高亮、计算相似度，在VSM模型中评分计算的更多相关文章

随机推荐

热门专题