Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.

Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?

One obvious choice is Java's UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.

BlockTree terms dictionary

The purpose of the terms dictionary is to store all unique terms seen during indexing, and map each term to its metadata (docFreqtotalTermFreq, etc.), as well as the postings (documents, offsets, postings and payloads). When a term is requested, the terms dictionary must locate it in the on-disk index and return its metadata.

The default codec uses the BlockTree terms dictionary, which stores all terms for each field in sorted binary order, and assigns the terms into blocks sharing a common prefix. Each block contains between 25 and 48 terms by default. It uses an in-memory prefix-trie index structure (an FST) to quickly map each prefix to the corresponding on-disk block, and on lookup it first checks the index based on the requested term's prefix, and then seeks to the appropriate on-disk block and scans to find the term.

In certain cases, when the terms in a segment have a predictable pattern, the terms index can know that the requested term cannot exist on-disk. This fast-match test can be a sizable performance gain especially when the index is cold (the pages are not cached by the the OS's IO cache) since it avoids a costly disk-seek. As Lucene is segment-based, a single id lookup must visit each segment until it finds a match, so quickly ruling out one or more segments can be a big win. It is also vital to keep your segment counts as low as possible!

Given this, fully random ids (like UUID V4) should perform worst, because they defeat the terms index fast-match test and require a disk seek for every segment. Ids with a predictable per-segment pattern, such as sequentially assigned values, or a timestamp, should perform best as they will maximize the gains from the terms index fast-match test.

Testing Performance

I created a simple performance tester to verify this; the full source code is here. The test first indexes 100 million ids into an index with 7/7/8 segment structure (7 big segments, 7 medium segments, 8 small segments), and then searches for a random subset of 2 million of the IDs, recording the best time of 5 runs. I used Java 1.7.0_55, on Ubuntu 14.04, with a 3.5 GHz Ivy Bridge Core i7 3770K.

Since Lucene's terms are now fully binary as of 4.0, the most compact way to store any value is in binary form where all 256 values of every byte are used. A 128-bit id value then requires 16 bytes.

I tested the following identifier sources:

For the UUIDs and Flake IDs I also tested binary encoding in addition to their standard (base 16 or 36) encoding. Note that I only tested lookup speed using one thread, but the results should scale linearly (on sufficiently concurrent hardware) as you add threads.

UUID K lookups/sec, 1 thread0150300450600Zero-pad sequentialUUID v1 [binary]NanotimeUUID v1SequentialFlake [binary]FlakeUUID v4 [binary]UUID v4

ID Source K lookups/sec, 1 thread
Zero-pad sequential 593.4
UUID v1 [binary] 509.6
Nanotime 461.8
UUID v1 430.3
Sequential 415.6
Flake [binary] 338.5
Flake 231.3
UUID v4 [binary] 157.8
UUID v4 149.4
 

Zero-padded sequential ids, encoded in binary are fastest, quite a bit faster than non-zero-padded sequential ids. UUID V4 (using Java's UUID.randomUUID()) is ~4X slower.

But for most applications, sequential ids are not practical. The 2nd fastest is UUID V1, encoded in binary. I was surprised this is so much faster than Flake IDs since Flake IDs use the same raw sources of information (time, node id, sequence) but shuffle the bits differently to preserve total ordering. I suspect the problem is the number of common leading digits that must be traversed in a Flake ID before you get to digits that differ across documents, since the high order bits of the 64-bit timestamp come first, whereas UUID V1 places the low order bits of the 64-bit timestamp first. Perhaps the terms index should optimize the case when all terms in one field share a common prefix.

I also separately tested varying the base from 10, 16, 36, 64, 256 and in general for the non-random ids, higher bases are faster. I was pleasantly surprised by this because I expected a base matching the BlockTree block size (25 to 48) would be best.

There are some important caveats to this test (patches welcome)! A real application would obviously be doing much more work than simply looking up ids, and the results may be different as hotspot must compile much more active code. The index is fully hot in my test (plenty of RAM to hold the entire index); for a cold index I would expect the results to be even more stark since avoiding a disk-seek becomes so much more important. In a real application, the ids using timestamps would be more spread apart in time; I could "simulate" this myself by faking the timestamps over a wider range. Perhaps this would close the gap between UUID V1 and Flake IDs? I used only one thread during indexing, but a real application with multiple indexing threads would spread out the ids across multiple segments at once.

I used Lucene's default TieredMergePolicy, but it is possible a smarter merge policy that favored merging segments whose ids were more "similar" might give better results. The test does not do any deletes/updates, which would require more work during lookup since a given id may be in more than one segment if it had been updated (just deleted in all but one of them).

Finally, I used using Lucene's default Codec, but we have nice postings formats optimized for primary-key lookups when you are willing to trade RAM for faster lookups, such as this Google summer-of-code project from last year and MemoryPostingsFormat. Likely these would provide sizable performance gains!

 
转自:http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html

Choosing a fast unique identifier (UUID) for Lucene——有时间再看下的更多相关文章

  1. A Universally Unique IDentifier (UUID) URN Namespace

    w Network Working Group P. Leach Request for Comments: 4122 Microsoft Category: Standards Track M. M ...

  2. Atitit 深入了解UUID含义是通用唯一识别码 (Universally Unique Identifier),

    Atitit 深入了解UUID含义是通用唯一识别码 (Universally Unique Identifier), UUID1 作用1 组成1 全球唯一标识符(GUID)2 UUID 编辑 UUID ...

  3. java生成UUID通用唯一识别码 (Universally Unique Identifier)

    转自:http://blog.csdn.net/carefree31441/article/details/3998553 UUID含义是通用唯一识别码 (Universally Unique Ide ...

  4. (转)java生成UUID通用唯一识别码 (Universally Unique Identifier)

    (原文链接:http://blog.csdn.net/carefree31441/article/details/3998553)   UUID含义是通用唯一识别码 (Universally Uniq ...

  5. java生成UUID通用唯一识别码 (Universally Unique Identifier) 分类: B1_JAVA 2014-08-22 16:09 331人阅读 评论(0) 收藏

    转自:http://blog.csdn.net/carefree31441/article/details/3998553 UUID含义是通用唯一识别码 (Universally Unique Ide ...

  6. 全局唯一标识符(GUID,Globally Unique Identifier)

    全局唯一标识符(GUID,Globally Unique Identifier)是一种由算法生成的二进制长度为128位的数字标识符.GUID主要用于在拥有多个节点.多台计算机的网络或系统中.在理想情况 ...

  7. android unique identifier

    android get device mac address programmatically http://android-developers.blogspot.jp/2011/03/identi ...

  8. HTML5 + JS 网站追踪技术:帆布指纹识别 Canvas FingerPrinting Universally Unique Identifier,简称UUID

    1 1 1 HTML5 + JS  网站追踪技术:帆布指纹识别 Canvas FingerPrinting 1 一般情况下,网站或者广告联盟都会非常想要一种技术方式可以在网络上精确定位到每一个个体,这 ...

  9. 设备唯一标识方法(Unique Identifier):如何在Windows系统上获取设备的唯一标识 zz

    原文地址:http://www.vonwei.com/post/UniqueDeviceIDforWindows.html 唯一的标识一个设备是一个基本功能,可以拥有很多应用场景,比如软件授权(如何保 ...

随机推荐

  1. 基于【 SpringBoot】一 || QQ授权流程

    一.准备工作 1.qq开放平台应用申请,获取APP ID和APP Key 2.qq开放平台配置回调地址 二.服务器端生成授权链接 1.请求地址 https://graph.qq.com/oauth2. ...

  2. 节日营销!这样搞-App运营日常

    节日送礼需求日益增长,当儿女们有了购买需求的时候,商家如何突出重围,成为孝子们的首选?如何做好节日营销?几个经验分享一下: 1.抓住节日特色 结合节日风格特色,营造节日气氛,如母亲节这种节日,主要体现 ...

  3. Lwip与底层的接口

    Lwip有三套api,分别是: raw api:使用方法为使用回调函数,即先注册一个函数,当接受到数据之后调用这个函数.缺点是对于数据连续处理不好. Lwip api:把接收与处理放在一个线程里面.因 ...

  4. 基于SpringBoot的多模块项目引入其他模块时@Autowired无法注入其他模块stereotype注解类对象的问题解决

    类似问题: 关于spring boot自动注入出现Consider defining a bean of type 'xxx' in your configuration问题解决方案 排查原因总结如下 ...

  5. CentOS7安装CDH 第八章:CDH中对服务和机器的添加与删除操作

    相关文章链接 CentOS7安装CDH 第一章:CentOS7系统安装 CentOS7安装CDH 第二章:CentOS7各个软件安装和启动 CentOS7安装CDH 第三章:CDH中的问题和解决方法 ...

  6. java - day015 - 手写双向链表, 异常(续), IO(输入输出)

    类的内存分配 加载到方法区 对象在堆内存 局部变量在栈内存 判断真实类型,在方法区加载的类 对象.getClass(); 类名.class; 手写双向链表 package day1501_手写双向链表 ...

  7. redis windows安装与liunx安装

    windows安装redis 2.把安装包放在Linux文件系统下,利用WinSCP工具 3.解压缩 tar -zxf redis-4.0.2.tar.gz 4.切换到解压后的目录 cd redis- ...

  8. Light OJ - 1026 - Critical Links(图论-Tarjan算法求无向图的桥数) - 带详细注释

     原题链接   无向连通图中,如果删除某边后,图变成不连通,则称该边为桥. 也可以先用Tajan()进行dfs算出所有点 的low和dfn值,并记录dfs过程中每个 点的父节点:然后再把所有点遍历一遍 ...

  9. 常考JS题笔记

    ### 1. 原始类型有哪几种?null 是对象吗? 答: Null,undefined,Number,String,Blooean,symbol1)[理解和使用ES6中的Symbol][https: ...

  10. BZOJ 3162 / Luogu P4895: 独钓寒江雪 树hash+DP

    题意 给出一棵无根树,求本质不同的独立集数模100000000710000000071000000007的值. n≤500000n\le 500000n≤500000 题解 如果是有根树就好做多了.然 ...