Xapian的内存索引-添加文档

本文主要记录Xapian的内存索引在添加文档过程中，做了哪些事情。

内容主要为函数执行过程中的流水线。

demo代码：

    Xapian::WritableDatabase db = Xapian::InMemory::open();

    Xapian::Document doc;

    // 添加文档的，T表示字段名字，TERM内容为世界，position为1

    doc.add_posting("T世界", );

    doc.add_posting("T体育", );

    doc.add_posting("T比赛", );

    // 添加doc的数据

    doc.set_data("世界体育比赛");

    // 添加doc的唯一term

    doc.add_boolean_term(K_DOC_UNIQUE_ID);

    // 采用replace_document，保证拥有K_DOC_UNIQUE_ID的文档在索引库中唯一

    Xapian::docid innerId = db.replace_document(K_DOC_UNIQUE_ID, doc);

1.创建并填充Document

定义好文档对象，使用add_posting接口，添加term，以及对应的position、wdfinc；

内部实现细节：

1.1 先尝试读取doc已有term数据；如果读取到了，则将term以及positions信息记录到terms中；

void Xapian::Document::Internal::need_terms() const {

    if (terms_here) {

        return;

    }

    if (database.get()) {

        Xapian::TermIterator t(database->open_term_list(did));

        Xapian::TermIterator tend(NULL);

        for ( ; t != tend; ++t) {

            Xapian::PositionIterator p = t.positionlist_begin();

            OmDocumentTerm term(t.get_wdf());

            for ( ; p != t.positionlist_end(); ++p) {

                term.append_position(*p);

            }

            terms.insert(make_pair(*t, term));

        }

    }

    termlist_size = terms.size();

    terms_here = true;

}

1.2 加入全新term，首先，创建新的term对象，为其添加position信息，最后加入到terms；

void Xapian::Document::Internal::add_posting(const string & tname, Xapian::termpos tpos, Xapian::termcount wdfinc) {

    need_terms();

    positions_modified = true;

    std::map<std::string, OmDocumentTerm>::iterator i = terms.find(tname);

    if (i == terms.end()) {

        ++termlist_size;

        OmDocumentTerm newterm(wdfinc);

        newterm.append_position(tpos);

        terms.insert(make_pair(tname, newterm));

    } else {

        // doc已经有这个term

        if (i->second.add_position(wdfinc, tpos)) {

            ++termlist_size;

        }

    }

}

1.3 加入非全新term，调用OmDocumentTerm对象的add_position，为OmDocumentTerm对象的positions添加元素，保证positions是升序的。在非首次插入position时，这里采用分批插入排序小技巧，减少了插入排序时的比较次数，值得阅读。注意：positions信息，在添加完成之后，并不是有序的，而是在把doc添加到DB之前，再做了一次merge。

技巧：往一个有序数组里添加元素，一般写代码都会采用有序的插入：先定位到插入位置，然后数据往后移，最后插入，时间复杂度是O(n^2)。这里采用的方式，有点多路归并的味道：

（1）数据分为历史数据和新增数据，在数据添加的过程中，需要保证两个数据组都是升序的，否则就需要对他们做merge合并；

（2）当新加入的数据适合（符合升序要求，且当前新增数据组为空）放在历史数据组中，则直接在其尾部append；

（3）否则，判断是否符合新增数据组要求（升序要求），合适则append到新增数据组中；

（4）如果不合适，则要对历史数据组和新增数据组做merge，把新增数据组合并到历史数据组中，这个合并就是两个升序数组的合并，时间复杂度是O(n+m)，合并完成之后，再重复（2）和（3）和（4）这个流程；

（5）当数据添加完毕之后，可能新增数据组还没有合并到历史数据组中，这个合并的操作延迟到了doc添加到db的时候才做。

实际代码中，历史数据组和新增数据组是合并在一起存放的，就一个vector，然后有一个变量记录当前历史数据组的位置。

这个技巧下时间复杂度仍然是n^2，但实际耗时跟每次一个数字的插入排序相比，会降低几倍。

这种设计思路，跟搜索引擎索引库常见的大小库（静、动库）设计是一样的。

bool OmDocumentTerm::add_position(Xapian::termcount wdf_inc, Xapian::termpos tpos) {

    LOGCALL(DB, bool, "OmDocumentTerm::add_position", wdf_inc | tpos);

    if (rare(is_deleted())) {

        wdf = wdf_inc;

        split = ;

        positions.push_back(tpos);

        return true;

    }

    wdf += wdf_inc;

    // Optimise the common case of adding positions in ascending order.

    if (positions.empty()) {

        positions.push_back(tpos);

        return false;

    }

    if (tpos > positions.back()) {

        if (split) {

            // Check for duplicate before split.

            auto i = lower_bound(positions.cbegin(), positions.cbegin() + split, tpos);

            if (i != positions.cbegin() + split && *i == tpos) {

                return false;

            }

        }

        positions.push_back(tpos);

        return false;

    }

    if (tpos == positions.back()) {

        // Duplicate of last entry.

        return false;

    }

    if (split > ) {

        // We could merge in the new entry at the same time, but that seems to

        // make things much more complex for minor gains.

        merge();

    }

    // Search for the position the term occurs at.  Use binary chop to

    // search, since this is a sorted list.

    vector<Xapian::termpos>::iterator i = lower_bound(positions.begin(), positions.end(), tpos);

    if (i == positions.end() || *i != tpos) {

        auto new_split = positions.size();

        if (sizeof(split) < sizeof(Xapian::termpos)) {

            if (rare(new_split > numeric_limits<decltype(split)>::max())) {

                // The split point would be beyond the size of the type used to

                // hold it, which is really unlikely if that type is 32-bit.

                // Just insert the old way in this case.

                positions.insert(i, tpos);

                return false;

            }

        } else {

            // This assertion should always be true because we shouldn't have

            // duplicate entries and the split point can't be after the final

            // entry.

            AssertRel(new_split, <=, numeric_limits<decltype(split)>::max());

        }

        split = new_split;

        positions.push_back(tpos);

    }

    return false;

}

1.4 添加data信息

void Xapian::Document::Internal::set_data(const string &data_) {

    data = data_;

    data_here = true;

}

2. Document加入到内存DB

这里为了保证文档唯一，采用replace_document。

做基本的参数检查之后，判断是否是多子索引库，如果是多子索引库则要判断数据写入到哪个子库中，同时要删除其它子索引库库中可能存在的同unique_term doc；

判断倒排链里是不是存在这个unique_term，如果不存在则走添加流程；

Xapian::docid WritableDatabase::replace_document(const std::string & unique_term, const Document & document) {

    LOGCALL(API, Xapian::docid, "WritableDatabase::replace_document", unique_term | document);

    if (unique_term.empty()) {

        throw InvalidArgumentError("Empty termnames are invalid");

    }

    size_t n_dbs = internal.size();

    if (rare(n_dbs == )) {

        no_subdatabases();

    }

    if (n_dbs == ) {

        RETURN(internal[]->replace_document(unique_term, document));

    }

    Xapian::PostingIterator postit = postlist_begin(unique_term);

    // If no unique_term in the database, this is just an add_document().

    if (postit == postlist_end(unique_term)) {

        // Which database will the next never used docid be in?

        Xapian::docid did = get_lastdocid() + ;

        if (rare(did == )) {

            throw Xapian::DatabaseError("Run out of docids - you'll have to use copydatabase to eliminate any gaps before you can add more documents");

        }

        size_t i = sub_db(did, n_dbs);

        RETURN(internal[i]->add_document(document));

    }

    Xapian::docid retval = *postit;

    size_t i = sub_db(retval, n_dbs);

    internal[i]->replace_document(sub_docid(retval, n_dbs), document);

    // Delete any other occurrences of unique_term.

    while (++postit != postlist_end(unique_term)) {

        Xapian::docid did = *postit;

        i = sub_db(did, n_dbs);

        internal[i]->delete_document(sub_docid(did, n_dbs));

    }

    return retval;

}

2.1 添加新文档

这里将添加文档的过程分为make_doc和finish_add_doc，可能是为了在真正的replace文档时，可以复用finish_add_doc的代码；

Xapian::docid InMemoryDatabase::add_document(const Xapian::Document & document) {

    LOGCALL(DB, Xapian::docid, "InMemoryDatabase::add_document", document);

    if (closed) {

       InMemoryDatabase::throw_database_closed();

    }

    Xapian::docid did = make_doc(document.get_data());

    finish_add_doc(did, document);

    RETURN(did);

}

2.2 make_doc的实现

Xapian::docid InMemoryDatabase::make_doc(const string & docdata) {

    termlists.push_back(InMemoryDoc(true));

    doclengths.push_back();

    doclists.push_back(docdata);

    AssertEqParanoid(termlists.size(), doclengths.size());

    return termlists.size();

}

2.3 finish_add_doc的实现

首先添加value、构造term、填充termlist和postlist结构体。

termlist，即为文章的词列表，含有所有的词信息：词名、词在本文章中出现的次数、词在本文章中出现的位置；

postlist，即为词的文章列表，包含文章的信息，包括：docid、词在这个doc中出现的位置、词在这个doc中出现的次数；

也就是说，position信息，要存储两份，termlist一份，postlist一份；

void InMemoryDatabase::finish_add_doc(Xapian::docid did, const Xapian::Document &document) {

    {

        std::map<Xapian::valueno, string> values;

        Xapian::ValueIterator k = document.values_begin();

        for ( ; k != document.values_end(); ++k) {

            values.insert(make_pair(k.get_valueno(), *k));

            LOGLINE(DB, "InMemoryDatabase::finish_add_doc(): adding value " << k.get_valueno() << " -> " << *k);

        }

        add_values(did, values);

    }

    InMemoryDoc doc(true);

    Xapian::TermIterator i = document.termlist_begin();

    for ( ; i != document.termlist_end(); ++i) {

        make_term(*i);

        LOGLINE(DB, "InMemoryDatabase::finish_add_doc(): adding term " << *i);

        Xapian::PositionIterator j = i.positionlist_begin();

        if (j == i.positionlist_end()) {

            /* Make sure the posting exists, even without a position. */

            make_posting(&doc, *i, did, , i.get_wdf(), false);

        } else {

            positions_present = true;

            for ( ; j != i.positionlist_end(); ++j) {

                make_posting(&doc, *i, did, *j, i.get_wdf());

            }

        }

        Assert(did >  && did <= doclengths.size());

        doclengths[did - ] += i.get_wdf();

        totlen += i.get_wdf();

        postlists[*i].collection_freq += i.get_wdf();

        ++postlists[*i].term_freq;

    }

    swap(termlists[did - ], doc);

    totdocs++;

}

在处理position信息的过程中，有些设计上不合理，在填充doc的时候，已经为position信息排序过一次，后面将position信息添加到termlist或者postlist的时候，又重新一个个position单独处理。

文档添加到DB之后，需要执行commit，而内存索引没有落地磁盘，所以InMemoryDatabase的commit是空函数。

Xapian的内存索引-添加文档的更多相关文章

Xapian的内存索引
关键字:xapian.内存索引 xapian除了提供用于生产环境的磁盘索引,也提供了内存索引(InMemoryDatabase).内存索引.我们可以通过观察内存索引的设计,来了解xapian的设计思路 ...
SOLRJ单机-添加文档，删除，查询操作
单机solrJ不需要占用太多的服务器和资源,本机使用solr-6.3.0,也不需要配置tomcat. 一.新建一个java工程需要依赖的jar包如下: solr-solrj-6.3.0.jar; c ...
Elasticsearch（5）：添加文档
1 ES数据读写流程¶ ES中,每个索引都将被划分为若干分片,每个分片可以有多个副本.这些副本共同组成复制组,复制组中的分片在添加或删除文档时必须保持同步,否则,从一个副本中读取的数据将与从另一个 ...
solr学习之添加文档
一.开篇语其实Solr就是一个你可以通过他来查询文档的东西,他整个都是基于Document的,那么这些Document从何而来列? 当然是我们给他,而这些来源就包括了:数据库文件,XML,Json ...
lucene 内存索引和文件索引合并
IndexWriter.addIndexes(ramDirectory); http://blog.csdn.net/qq_28042463/article/details/51538283 在luc ...
lucene内存索引库、分词器
内存索引库特点在内存中开辟一块空间,专门为索引库存放.这样有以下几个特征: 1) 因为索引库在内存中,所以访问速度更快. 2) 在程序退出时,索引库中的文件也相应的消失了. 3) ...
elasticsearch的索引操作和文档操作总结
参考文档:https://es.xiaoleilu.com/010_Intro/00_README.html 一.索引操作 1.查看当前节点的所有的index 查看当前节点的所有的index [roo ...
Elastic Stack 笔记（四）Elasticsearch5.6 索引及文档管理
博客地址:http://www.moonxy.com 一.前言在 Elasticsearch 中,对文档进行索引等操作时,既可以通过 RESTful 接口进行操作,也可以通过 Java 也可以通过 ...
elasticsearch查询篇索引映射文档数据准备
elasticsearch查询篇索引映射文档数据准备我们后面要讲elasticsearch查询,先来准备下索引,映射以及文档: 我们先用Head插件建立索引film,然后建立映射 POST http ...

随机推荐

常见SMTP发送失败原因列表
SmtpException:无法读取从传输连接数据:net_io_connectionclosed(SmtpException: Unable to read data from the transp ...
3d-tiles、gltf 坐标系
gltf 为 y 轴向上的右手坐标系 3d-tiles 为 z 轴向上的右手坐标系
JMeter通过beanShell脚本生成随机手机号
package xnzx; /** * @author xn088587 * */ public class getTel{ public static int getNum(int start,in ...
H5唤醒app，第三方开源库
在微信浏览器内,安卓打开应用宝,ios跳进appstore,基本都可以成功在外部浏览器内,已安装可进入应用内,未安装进入应用宝提示下载,需客户端支持.<!DOCTYPE html> < ...
MySql分割字符串【存储过程】
MYSql没有表变量,通过函数无法返回表. 参考网址:https://bbs.csdn.net/topics/330021055 DELIMITER $$ USE `数据库`$$ DROP PROCE ...
简单的dfs题 --- POJ1321 棋盘问题
题目链接: http://poj.org/problem?id=1321 题目大意: 你有k个棋子,若干个可以填的位置,要求填下一个棋子后其行和列不能填棋子. 思路: dfs策略画图理解更好些: 填 ...
Python数据处理PDF
Python数据处理(高清版)PDF 百度网盘链接:https://pan.baidu.com/s/1h8a5-iUr4mF7cVujgTSGOA 提取码:6fsl 复制这段内容后打开百度网盘手机A ...
C# 后台通过网络地址访问百度地图取回当前在地图上的经纬度,并将取回的复杂Json格式字符串反序列化(Newtonsoft.Json)
直接上代码:解释都在代码中 ak 要自己去百度地图申请. 其中申请ak的时候,有个属性render直接填*就行. namespace HampWebControl 是我的空间命名! namespace ...
查找更改的PeopleCode
当我们做工程包迁移时,经过会遗漏部分更改过的定义.我们可以用下面的SQL来查找变更项变量 &OPRID =代码变更者变量 &PROJECT 项目工程名 SELECT * FROM ...
Java之hashCode的作用和equals方法的重构规则
这个是博主对hashcode的初步理解,以后加深了会再来更新: 1.hashcode是什么? hashcode是对象的散列码,不同的对象几乎不一样,说几乎是因为还是可以一样的. 特点:每一个对象都有h ...

Xapian的内存索引-添加文档

Xapian的内存索引-添加文档的更多相关文章

随机推荐

热门专题