Lucene - CustomScoreQuery 自定义排序

在某些场景需要做自定义排序（非单值字段排序、非文本相关度排序），除了自己重写collect、weight，可以借助CustomScoreQuery。

场景：根据tag字段中标签的数量进行排序（tag字段中，标签的数量越多得分越高）

public class CustomScoreTest {

    public static void main(String[] args) throws IOException {

        Directory dir = new RAMDirectory();

        Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_4_9);

        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);

        IndexWriter writer = new IndexWriter(dir, conf);

        Document doc1 = new Document();

        FieldType type1 = new FieldType();

        type1.setIndexed(true);

        type1.setStored(true);

        type1.setStoreTermVectors(true);

        Field field1 = new Field("f1", "fox", type1);

        doc1.add(field1);

        Field field2 = new Field("tag", "fox1 fox2 fox3 ", type1);

        doc1.add(field2);

        writer.addDocument(doc1);

        //

        field1.setStringValue("fox");

        field2.setStringValue("fox1");

        doc1 = new Document();

        doc1.add(field1);

        doc1.add(field2);

        writer.addDocument(doc1);

        //

        field1.setStringValue("fox");

        field2.setStringValue("fox1 fox2 fox3 fox4");

        doc1 = new Document();

        doc1.add(field1);

        doc1.add(field2);

        writer.addDocument(doc1);

        //

        writer.commit();

        //

        IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dir));

        Query query = new MatchAllDocsQuery();

        CountingQuery customQuery = new CountingQuery(query);

        int n = 10;

        TopDocs tds = searcher.search(query, n);

        ScoreDoc[] sds = tds.scoreDocs;

        for (ScoreDoc sd : sds) {

            System.out.println(searcher.doc(sd.doc));

        }

    }

}

测试结果：

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 >>

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1>>

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 fox4>>

自定义打分：

public class CountingQuery extends CustomScoreQuery {

    public CountingQuery(Query subQuery) {

        super(subQuery);

    }

    protected CustomScoreProvider getCustomScoreProvider(AtomicReaderContext context) throws IOException {

        return new CountingQueryScoreProvider(context, "tag");

    }

}

public class CountingQueryScoreProvider extends CustomScoreProvider {

    String field;

    public CountingQueryScoreProvider(AtomicReaderContext context) {

        super(context);

    }

    public CountingQueryScoreProvider(AtomicReaderContext context, String field) {

        super(context);

        this.field = field;

    }

    public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException {

        IndexReader r = context.reader();

        Terms tv = r.getTermVector(doc, field);

        TermsEnum termsEnum = null;

        int numTerms = 0;

        if (tv != null) {

            termsEnum = tv.iterator(termsEnum);

            while ((termsEnum.next()) != null) {

                numTerms++;

            }

        }

        return (float) (numTerms);

    }

}

使用：

CountingQuery customQuery = new CountingQuery(query);

测试结果如下：

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 fox4>>

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 >>

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1>>

//-----------------------

weight/score/similarity

collector

主要参考

http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

快照：

One item stands out on that list as a little low-level but not quite as bad as building a custom Lucene query: CustomScoreQuery. When you implement your own Lucene query, you’re taking control of two things:

Matching – what documents should be included in the search results
Scoring – what score should be assigned to a document (and therefore what order should they appear in)
Frequently you’ll find that existing Lucene queries will do fine with matching but you’d like to take control of just the scoring/ordering. That’s what CustomScoreQuery gives you – the ability to wrap another Lucene Query and rescore it.

For example, let’s say you’re searching our favorite dataset – SciFi Stackexchange, A Q&A site dedicated to nerdy SciFi and Fantasy questions. The posts on the site are tagged by topic: “star-trek”, “star-wars”, etc. Lets say for whatever reason we want to search for a tag and order it by the number of tags such that questions with the most tags are sorted to the top.

In this example, a simple TermQuery could be sufficient for matching. To identify the questions tagged Star Trek with Lucene, you’d simply run the following query:

Term termToSearch = new Term(“tag”, “star-trek”);

TermQuery starTrekQ = new TermQuery(termToSearch);

searcher.search(starTrekQ);

If we examined the order of the results of this search, they’d come back in default TF-IDF order.

With CustomScoreQuery, we can intercept the matching query and assign a new score to it thus altering the order.

Step 1 Override CustomScoreQuery To Create Our Own Custom Scored Query Class:

(note this code can be found in this github repo)

public class CountingQuery extends CustomScoreQuery {

public CountingQuery(Query subQuery) {

super(subQuery);

}

protected CustomScoreProvider getCustomScoreProvider(

AtomicReaderContext context) throws IOException {

return new CountingQueryScoreProvider("tag", context);

}

}

Notice the code for “getCustomScoreProvider” this is where we’ll return an object that will provide the magic we need. It takes an AtomicReaderContext, which is a wrapper on an IndexReader. If you recall, this hooks us in to all the data structures available for scoring a document: Lucene’s inverted index, term vectors, etc.

Step 2 Create CustomScoreProvider

The real magic happens in CustomScoreProvider. This is where we’ll rescore the document. I’ll show you a boilerplate implementation before we dig in

public class CountingQueryScoreProvider extends CustomScoreProvider {

String _field;

public CountingQueryScoreProvider(String field, AtomicReaderContext context) {

super(context);

_field = field;

}

public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException {

return (float)(1.0f);

}

}

This CustomScoreProvider rescores all documents by returning a 1.0 score for them, thus negating their default relevancy sort order.

Step 3 Implement Rescoring

With TermVectors on for our field, we can simply loop through and count the tokens in the field:

public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException

{

IndexReader r = context.reader();

Terms tv = r.getTermVector(doc, _field);

TermsEnum termsEnum = null;

termsEnum = tv.iterator(termsEnum);

int numTerms = ;

while((termsEnum.next()) != null) {

numTerms++;

}

return (float)(numTerms);

}

And there you have it, we’ve overridden the score of another query! If you’d like to see a full example, see my “lucene-query-example” repository that has this as well as my custom Lucene query examples.

CustomScoreQuery Vs A Full-Blown Custom Query

Creating a CustomScoreQuery is a much easier thing to do than implementing a complete query. There are A LOT of ins-and-outs for implementing a full-blown Lucene query. So when creating a custom matching behavior isn’t important and you’re only rescoring another Lucene query, CustomScoreQuery is a clear winner. Considering how frequently Lucene based technologies are used for “fuzzy” analytics, I can see using CustomScoreQuery a lot when the regular tricks don’t pan out.

Lucene - CustomScoreQuery 自定义排序的更多相关文章

Lucene 中自定义排序的实现
使用Lucene来搜索内容,搜索结果的显示顺序当然是比较重要的.Lucene中Build-in的几个排序定义在大多数情况下是不适合我们使用的.要适合自己的应用程序的场景,就只能自定义排序功能,本节我们 ...
Java集合框架实现自定义排序
Java集合框架针对不同的数据结构提供了多种排序的方法,虽然很多时候我们可以自己实现排序,比如数组等,但是灵活的使用JDK提供的排序方法,可以提高开发效率,而且通常JDK的实现要比自己造的轮子性能更优 ...
DataTable自定义排序
使用JQ DataTable 的时候,希望某列数据可以进行自定义排序,操作如下:(以中文排序和百分比排序为例) 1:定义排序类型: //百分率排序 jQuery.fn.dataTableExt.oSo ...
干货之UICollectionViewFlowLayout自定义排序和拖拽手势
使用UICollectionView,需要使用UICollectionViewLayout控制UICollectionViewCell布局,虽然UICollectionViewLayout提供了高度自 ...
DataGridView 绑定List集合后实现自定义排序
这里只贴主要代码,dataList是已添加数据的全局变量,绑定数据源 datagridview1.DataSource = dataList,以下是核心代码. 实现点击列表头实现自定义排序 priva ...
【转】c++中Vector等STL容器的自定义排序
如果要自己定义STL容器的元素类最好满足STL容器对元素的要求必须要求: 1.Copy构造函数 2.赋值=操作符 3.能够销毁对象的析构函数另外: 1. ...
mysql如何用order by 自定义排序
mysql如何用order by 自定义排序 id name roleId aaa bbb ccc ddd eee ,MySQL可以通过field()函数自定义排序,格式:field(value,st ...
python 自定义排序函数
自定义排序函数 Python内置的 sorted()函数可对list进行排序: >>>sorted([36, 5, 12, 9, 21]) [5, 9, 12, 21, 36] 但 ...
定制对ArrayList的sort方法的自定义排序
java中的ArrayList需要通过collections类的sort方法来进行排序如果想自定义排序方式则需要有类来实现Comparator接口并重写compare方法调用sort方法时将Arr ...

随机推荐

php curl 传输是url中带有空格的处理方法
在crul中,应该用%20代替空格,否则,空格以后的数据将get不到.
HPU第四次积分赛-K ：方框（水题，打印图形）
方框描述用'*'打印出一个nxn的字符图形(1<=n<=100). 输入多组输入.每行输入一个n,输入EOF结束文件. 输出输出一个满足题意的图形. 输入样例 1 1 2 5 6 ...
Nginx访问日志、 Nginx日志切割、静态文件不记录日志和过期时间
1.Nginx访问日志配制访问日志:默认定义格式: log_format combined_realip '$remote_addr $http_x_forwarded_for [$time_loc ...
当爬虫遇到js加密
当爬虫遇到js加密我们在做python爬虫的时候经常会遇到许多的反爬措施,js加密就是其中一种. 破解js加密的方法也有很多种: 1.直接驱动浏览器抓取数据,无视js加密. 2.找到本地加密的js代 ...
python-基础-文件
一.文件操作打开文件时,需要指定文件路径和以何等方式打开文件, 对文件操作流程打开文件,得到文件句柄并赋值给一个变量通过句柄对文件进行操作关闭文件打开文件的模式有: r ,只读模式[默认模式 ...
前端开发利器： Bootstrap + AngularJS
http://blog.csdn.net/conquer0715/article/details/51181391 概述在HTML5盛行的互联网时代,涌现诸多的前端html/css/js框架,基于其 ...
OSX11.12安装任何来源的软件，在终端中输入
sudo spctl --master-disable
【java多线程】java8的流操作api和fork/join框架
原文:https://blog.csdn.net/u011001723/article/details/52794455/ 一.测试一个案例,说明java8的流操作是并行操作 1.代码 package ...
高性能kv存储之Redis、Redis Cluster、Pika：如何应对4000亿的日访问量？
一.背景介绍随着360公司业务发展,业务使用kv存储的需求越来越大.为了应对kv存储需求爆发式的增长和多使用场景的需求,360web平台部致力于打造一个全方位,适用于多场景需求的kv解决方案.目前, ...
<--------------------------Java多态如何使用------------------------------>
11多态调用的三种格式 * A:多态的定义格式: * 就是父类的引用变量指向子类对象父类类型变量名 = new 子类类型(); 变量名.方法名(); * B: 普通类多态定义的格式父类变量名 ...

Lucene - CustomScoreQuery 自定义排序

Lucene - CustomScoreQuery 自定义排序的更多相关文章

随机推荐

热门专题