Compact Projection: Simple and Efficient Near Neighbor Search with Practical memory Requirement

Autor:Kerui Min1,2, Linjun Yang2, John Wright2, Lei Wu2,3, Xian-Sheng Hua2, Yi Ma2

Institute: 1 School of Computer Science, Fudan University  krmin@fudan.edu.cn

                  2 Microsoft Research Asia    {linjury,jowrig,xshua,mayi}@microsoft.com

                  3 MOE-MS Keynote Lab of MCC, University of Science and Technology of China  wulei@live.com

《压缩映射:根据实际内存需要的 一个简单而有效的 线性最近邻搜索方法》

Image similarity search is a fundamental problem in   computer vision. Efficient similarity search across large image databases depends critically on the availability of compact image representations and good data structures for indexing them.  Numerous
approaches to the problem of generating and indexing image codes have been presented in the literature, but existing schemes generally lack explicit estimates of the number of bits needed to effectively index a given large image database. We present a very
simple algorithm for generating compact binary representations of imagery data, based on random projections. Our analysis gives the first explicit bound on the number of bits needed to effectively solve the indexing problem. When applied to real image search
tasks, these theoretical improvements translate into practical performance gains: experimental resultsshow that the new method, while using significantly less memory, is several times faster than existing alternatives.

图像相似度搜索是图像处理中的基本问题。对于大数据结构的 有效的相似性检索 严重依赖于图像表达的压缩可行性和 一个好的数据结构索引。生成和索引图像编码的很多方法被提出,但是 现存的算法无法给出 索引一个大数据库 精确的内存需要值。我们提出了 基于随机映射的 一个简单的 图像数据的二进制压缩表达。 我们的分析给出了 开创性的 清晰地陈述 解决索引问题的 内存需要。 当应用于真实图像数据库上时,这些原理有了下面显著的提高:实验结果显示出新方法 比其他现有方法 利用更少内存,并且快几倍。

1. Introduction

The growing popularity of online image and video sharing sites has made efficiently indexing and searching large image databases an extremely active area of research in recent  years [5]. Much of this research has focused on the fundamental problem of retrieving
images that are as similar as possible to a given query image. This similarity search is a key component of many commercial image search engines, helping to organize results and remove near duplicates. Current state-of-the-art systems for similarity search
generally built around approximately invariant image descriptors such as SIFT [10] or GIST [12]. These features are quantized

and their statistics are used to efficiently index the image database, using an inverted file system (see Figure 1). This methodology finds application throughout the computer vision literature, including areas that are not traditionally considered part of
image search, such as 3D reconstruction [20] and face recognition [21].

近几年大量增长的在线视频和图片分享网站产生了大量的有效的图像索引数据库。很多这些研究聚焦于 给出图像的 图像相似性检索,这种相似性检索是很多商业引擎的关键元素,用于帮助组织检索结果和移动相似性拷贝。现存的先进的相似性检索系统 主要围绕 旋转不变性描绘子 比如 SIFT 和全局SIFT。这些特征被量化 并用于来有效地 统计索引图像数据库,基于一个倒排文件系统。这种方法的贯通了 视觉的领域,包括 传统的部分图像检索,并且用于3D重建 和人脸识别。

The practicality of such image retrieval systems depends critically on two factors: the quality of the image representation employed, and the efficiency of the indexing system. Unfortunately, these two factors are often in tension.Research in image descriptors
generally suggests that better discrimination can be achieved with higher dimensional  feature descriptors. However, finding nearest neighbors in the high-dimensional feature space is impractical for webscale image databases. The naive nearest neighbor algorithm
suffers from the “curse of dimensionality”, as do classical alternatives such as KD-trees [13] and its variants [6, 11], R-trees [7] etc.

实际的图像检索系统严重依赖于两个因素:图像表达的质量 和索引系统的质量,不幸的是,这两个因素经常很受压力。在图像特征的研究上普遍显示更好的描述会造成更高维的特征向量,然而,寻找高维特征空间的最近邻方法 对于网站级别的图像数据是 不实际的。实际的近邻算法承受着“维数灾难”诅咒,同样 其他的经典算法如 K决策树和R-K决策树也有此问题。

The difficulty of exact nearest neighbor search has led to the development of many alternative strategies, with varying levels of practicality and theoretical optimality. Weber et al. [18] proposed a vector approximation scheme to accelerate the sequential
scan. Wang et al. [17] proposed a binary hash coding (BHC), which projected images from a high dimensional feature space into a low dimensional hash code space and then filtered similar images by nearest neighbor search. Weiss et al. [19] used a spectral graph
based method to find a best hashing code space in which Hamming distance is correlated with the semantic similarity. Torralba et al. [15] proposed to thumbnail the images into small code to speed up the similarity search in a large database. While many of
the above techniques demonstrate excellent performance on specific image retrieval tasks and databases, they generally lack theoretical guarantees or guidelines as to the number of bits needed to achieve a given level of performance. This lack of foundation
makes their application to new databases mostly a matter of trial-and-error.

随着各种水平的使用化和原理最优化发展,提取最近邻的研究 导致了其他策略的发展。Weber et al. [18] 提出了一个向量近似方案 来加快线性检索。Wang et al. [17]提出了一个二进制哈希编码,其把图像从一个高维特征空间投影到低维 哈希编码空间,然后利用最近邻搜索 过滤相似性图像。Weiss et al. [19]利用基于图谱的方法 在 与语义相似性有关的汉明距离空间 寻找最优哈希编码。Torralba et al. [15]提出一个极小化图像编码加快 大数据库的 相似性检索。
以上很多技巧在特定的数据检索任务和数据集上有良好的表现,但普遍 在理论上不能给出一个 能达到在给出的结果要求上  可计算的内存空间。这种缺陷导致应用于新的数据库 几乎是一个 尝试--试错过程(没有普遍性)。

On the other hand, researchers in the theoretical computer science community have developed approximate nearest neighbor algorithms with excellent performance guarantees, in some cases nearly optimal. The current state-of the-art in this area is arguably
the locality sensitive hashing (LSH) approach of [2, 4], which exploits the distance preserving properties of random projections. However, in spite of their appealing properties, there are practical drawbacks to this and related algorithms, most importantly
intensive memory requirements that hinder their application to real image search tasks. This difficulty manifests itself in theory as well as practice: state of-the-art schemes lack an explicit bound on the number of bits needed to effectively index a given
set of data.

    另一方面,理论计算机科学团体已经提出一个接近最优的近邻算法,在一些情况下接近最优。现有在此领域最好的 公认的是局部敏感哈希(LSH)[2, 4],它利用随机映射同时保留了距离;然而,尽管有吸引人的性能,但仍然有实际的缺陷,最终要的是内存敏感性 阻碍了算法的实际应用。这些困难在理论和实际上也显示出来:对于一个给定的数据集现有的方法无法给出一个内存占有的边界上界。

In this paper, we study a very simple algorithm for generating compact binary representations of a given image database. Like LSH, our algorithm is based on random

projections. However, it achieves much better space complexity. We prove an explicit bound on the number of bits needed to encode the data with a certain guaranteed retrieval performance. Experimental results show that our algorithm, while using significantly
less memory, is several times faster than existing alternatives.

在此文中,我们研究出一个简单的算法 用于一个给定的数据集给出压缩映射。相似于LSH,我们的算法基于随机映射,但在空间复杂度上表现较好。我们给出了一个数据集的映射编码的内存需要边界,实验结果显示内存少,速度快。

2. Problem Statement and Algorithm

Our algorithm takes as its input a large set of highd imensional vectors X = x1 . . . xn ∈ Rd. In the context of image retrieval, these could be feature descriptors representing n images. Given a query vector q ∈ Rd, the core retrieval task is to efficiently
return a vector x¤ ∈ X that is as similar as possible q: the distance d(q, x¤) is as small as possible. The correct distance d(·, ·) may vary depending on application and dataset. However, if the image features

are well-chosen, the ℓp normshave been observed to perform well for retrieval tasks [10,16]. In this paper, we focus on the Euclidean case p = 2.

  我们的算法 如下:

输入为高维特征向量集合:vectors X = x1 . . . xn ∈ Rd.(在基于内容的图像检索里面,为图像特征描述子)

给出一个 查询向量:vector q ∈ Rd .(核心任务是 查询出q的最近邻 x在Rd中,使d(x,q)最小,在此文中 取欧氏距离)

The most straightforward approach to this problem is to simply scan through the database and return the point x¤ ∈ X that minimizes d(q, xi), yielding an excellent

neighbor: d(q, x¤) = minx2X d(q, x). However, this approach requires O(dn) operations. In practical image retrieval scenarios, both n and d could be very large: web image

databases such as Flickr contain millions of images or more, while for excellent retrieval performance is generally believed to require high-dimensional image representations (e.g., d = 512 dimensions for the GIST descriptor [12]). This unfavorable complexity
rules out the trivial algorithm for all but the smallest image databases. Fortunately, in applications such as image search it is not necessary to return the point that is nearest to the query point; a set of well chosen near neighbors may suffice. This gives
the algorithm designer considerable freedom to trade off approximation error for search complexity. This problem can be formalized as follows:

Problem 1 ( c-approximate near neighbor (c-NN)). Given a setX of points in a d-dimensional space Rd, devise a data structure, which for any query point q ∈ Rd finds a point x¤ ∈ X such that d(q, x¤) ≤ c · minx2X d(q, x).

最直接的方法是线性扫描数据库返回距离值,寻找最小距离,产生最优近邻。这种方法要求的时间为(O(dn))。在实际的检索操作中n和d可能很大,并且图像特征必须维度很高才能区分大量图像(e.g., d = 512 dimensions for the GIST descriptor [12]).这种最优策略只能适用于很小的数据集。

幸运的是,诸如此类的图像搜索并不必要返回最近邻查询点,一个被选择的候选集便足够。这给了算法设计者足够的自由在检索错误率和搜索完备性之间一个权衡。可以描述如下:

(c相似性近邻):在D维空间的一系列点集Xi(向量),设计一个数据结构,对任意的查询点q,可以找到一个点使d(q,Xk)为min(q,X) ( X in Xi )

This problem has received a great deal of attention in the theoretical computer science community (see [14] and the references therein). Some of the simplest and best performing algorithms for solving it rely on the fact that in high dimensional spaces,
random projections preserve distances with very high probability. More precisely, if

we sample a random vector r ∈ Rd with components iid N(0, 1), then var(r · (x − y)) ∼ kx − yk22. Locality Sen-sitive Hashing (LSH) builds on this property to give a stateof-the-art approximate nearest neighbor scheme [2]. However, that algorithm relies on an
amplification technique that essentially trades off storage space for query time. The result is a scheme that is extremely memory intensive, requiring superlinear space, which limits its practical applicability to large image databases.

此问题受到理论计算机科学的极大注意(see [14] and the references therein)。许多简单有效的算法依赖于事实:随机映射很大概率上 保持了 样本距离。     .........LSH基于此属性给出了合适的最近邻方法[2]。但是方法本质上依赖于 储存空间和查询时间的折中。结果是此种方法是极度内存敏感的,依赖于超线性空间,影响了其实际应用。

重点:算法 本文的算法:

我们提出一个可以缓解LSH内存瓶颈的方法,利用一个更简单的hash方法。

Algorithm 1 :Compact Projection Generation

Input: Data samples x1 . . . xn ∈ Rd.
Output: Compact projections y1 . . . yn ∈ {0, 1}^k.

   Generate R ∈ R^k×d  with independent normal entries

    Rij ∼ N(0, 1).

     for each i = 1 . . . n do

         Compute Rxi

          Set yi = σ(Rxi).

     end for

压缩映射:简单最邻近搜索-(SLH)Simple Linear Hash的更多相关文章

  1. 机器学习:simple linear iterative clustering (SLIC) 算法

    图像分割是图像处理,计算机视觉领域里非常基础,非常重要的一个应用.今天介绍一种高效的分割算法,即 simple linear iterative clustering (SLIC) 算法,顾名思义,这 ...

  2. 简单选择排序(Simple Selection Sort)的C语言实现

    简单选择排序(Simple Selection Sort)的核心思想是每次选择无序序列最小的数放在有序序列最后 演示实例: C语言实现(编译器Dev-c++5.4.0,源代码后缀.cpp) 原创文章, ...

  3. STA 463 Simple Linear Regression Report

    STA 463 Simple Linear Regression ReportSpring 2019 The goal of this part of the project is to perfor ...

  4. LINEAR HASH Partitioning

    MySQL :: MySQL 8.0 Reference Manual :: 23.2.4.1 LINEAR HASH Partitioning https://dev.mysql.com/doc/r ...

  5. Java设计模式(1)——创建型模式之简单工厂模式(Simple Factory)

    设计模式系列参考: http://www.cnblogs.com/Coda/p/4279688.html 一.概述 工厂模式主要是为创建对象提供过渡接口,以便将创建对象的具体过程屏蔽隔离起来,达到提高 ...

  6. 设计模式之笔记--简单工厂模式(Simple Factory)

    简单工厂模式(Simple Factory) 类图 描述 简单工厂: 一个抽象产品类,可以派生多个具体产品类: 一个具体工厂类: 工厂只能创建一个具体产品. 应用场景 汽车接口 public inte ...

  7. 创建型模式(过渡模式) 简单工厂模式(Simple Factory)

    简单工厂模式(Simple Factory Pattern)属于类的创建型模式,又叫静态工厂方法模式(Static FactoryMethod Pattern) 是通过专门定义一个类来负责创建其他类的 ...

  8. PHP设计模式(一)简单工厂模式 (Simple Factory For PHP)

    最近天气变化无常,身为程序猿的寡人!~终究难耐天气的挑战,病倒了,果然,程序猿还需多保养自己的身体,有句话这么说:一生只有两件事能报复你:不够努力的辜负和过度消耗身体的后患.话不多说,开始吧. 一.什 ...

  9. 深入浅出设计模式——简单工厂模式(Simple Factory)

    介绍简单工厂模式不能说是一个设计模式,说它是一种编程习惯可能更恰当些.因为它至少不是Gof23种设计模式之一.但它在实际的编程中经常被用到,而且思想也非常简单,可以说是工厂方法模式的一个引导,所以我想 ...

随机推荐

  1. NYIST 760 See LCS again

    See LCS again时间限制:1000 ms | 内存限制:65535 KB难度:3 描述There are A, B two sequences, the number of elements ...

  2. (39.2). Spring Boot Shiro权限管理【从零开始学Spring Boot】

    (本节提供源代码,在最下面可以下载) (4). 集成Shiro 进行用户授权 在看此小节前,您可能需要先看: http://412887952-qq-com.iteye.com/blog/229973 ...

  3. eventlet学习笔记

    eventlet学习笔记 标签(空格分隔): python eventlet eventlet是一个用来处理和网络相关的python库函数,且可以通过协程(coroutines)实现并发.在event ...

  4. CODEVS——T 2969 角谷猜想

    http://codevs.cn/problem/2969/  时间限制: 1 s  空间限制: 32000 KB  题目等级 : 黄金 Gold 题解  查看运行结果     题目描述 Descri ...

  5. Mac下OpenCV开发

    1.         环境搭建 a)       安装Homebrew i.            下载地址:http://github.com/mxcl/homebrew/tarball/maste ...

  6. WPF学习笔记——DataContext 与 ItemSource

    作为一个WPF新手,在ListBox控件里,我分不清 DataContext 与 ItemSource的区别. 在实践中,似乎: <ListBox x:Name="Lst" ...

  7. linux(centos)下安装git并上传代码些许步骤(亲自验证过的步骤)

     曾经听说了好多次github,但直到近期才第一次学习使用github来托管自己在linux下的代码! 说实话.我自己在使用的时候从网上查了好多教程.但总认为难以掌握(步骤过于繁琐),自己操作的时候还 ...

  8. hdoj--5612--Baby Ming and Matrix games(dfs)

     Baby Ming and Matrix games Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/65536 K ...

  9. Java数据结构2——深入JCF

    Java集合框架(JCF)参考C++的STL实现的在日常Java开发工作很常用的数据结构容器,有技术追求的人除了要会简单使用JCF之外,也要知道其底层的实现机制,知道它是如何实现的,为什么这样实现.就 ...

  10. 杂项-人物:Alan cooper

    ylbtech-杂项-人物:Alan cooper Alan Cooper ,“VB之父”“交互设计之父”,荣获视窗先锋奖(Microsoft Windows Pioneer)和软件梦幻奖(Softw ...