sphinx（coreseek）——1、增量索引

首先介绍一下 CoreSeek/Sphinx的发布包

indexer: 用于创建全文索引;
    search: 一个简单的命令行(CLI) 的测试程序，用于测试全文索引;
    searchd: 一个守护进程，其他软件（例如WEB程序）可以通过这个守护进程进行全文检索;
    sphinxapi: 一系列searchd 的客户端API 库，用于流行的Web脚本开发语言(PHP, Python, Perl, Ruby, Java).
    spelldump: 一个简单的命令行工具，用于从 ispell 或者 MySpell (OpenOffice内置绑定) 格式的字典中提取词条。当使用 wordforms时可用这些词条对索引进行定制.
    indextool: 工具程序，用来转储关于索引的多项调试信息。此工具是从版本Coreseek 3.1(Sphinx 0.9.9-rc2)开始加入的。
    mmseg: 工具程序和库，Coreseek用于提供中文分词和词典处理。

在Sphinx+LibMMSeg搭建中文全文搜索引擎_安装配置中安装试验了Sphinx 的使用，但是还有几方面的问题有待处理。

动态增量索引
区段查询
实时索引
匹配到的字段在前端界面表现

本篇主要是对动态增量更新的一些研究
在利用 Sphinx 做搜索引擎的时候，一般他的索引建立构成有如下几个部分：

固定不变的主索引
增量索引重建
索引数据合并

在实际操作中，需要为增量索引的建立创建辅助表，这样才可以记住最后建立索引的记录ID,做实际的增量部分的索引建立。

CREATE TABLE `sph_counter` (

  `counter_id` int(11) NOT NULL,

  `max_doc_id` int(11) NOT NULL,

  PRIMARY KEY (`counter_id`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8

在主索引的数据源中作如下方式的取数据设置

source src

{

        # data source type. mandatory, no default value

        # known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc

        type                    = mysql

        sql_host                = localhost

        sql_user                = root

        sql_pass                = xxxxx

        sql_db                  = test

        sql_port                =   # optional, default is

        sql_query_pre          = SET NAMES utf8

        sql_query_pre          = SET SESSION query_cache_type=OFF

        sql_query_pre          = REPLACE INTO sph_counter SELECT , MAX(id) FROM cn

        sql_query               = SELECT id,title,content from cn where id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=)

        sql_query_info          = SELECT * FROM cn WHERE id=$id

}

在增量索引的数据源中作如下方式的取数据设置，需要注意的是sql_query_pre要和主索引数量相同，不然查询结果不是想要的内容

#表示增量数据源

source moresrc : src

{

        sql_query_pre          = SET NAMES utf8

        sql_query_pre          = SET SESSION query_cache_type=OFF

        sql_query              = SELECT id,title,content from cn where id>(SELECT max_doc_id FROM sph_counter WHERE counter_id=)

        #sql_ranged_throttle    =

}

主索引index定义配置

index src

{

        source                  = src

        path                    = /usr/local/coreseek/var/data/test1

        docinfo                 = extern

        mlock                   =

        morphology              = none

        # 启用中文分词功能source 数据源中需要 设置读取的数据编码字符集为UTF-，否则无

法正确处理；如果是xml，则正确输出为UTF-8编码格式即可；如果是MySQL，则设置读取数据输出

字符集为UTF-8即可

        charset_type            = zh_cn.utf-

        # 中文分词词库位置

        charset_dictpath        =/usr/local/mmseg/etc/

}

增量索引index定义配置

index moresrc : src

{

        source = moresrc

        path                    = /usr/local/coreseek/var/data/moresrc

        morphology              = stem_en

}

indexer

{

        # memory limit, in bytes, kiloytes (16384K) or megabytes (256M)

        # optional, default is 32M, max is 2047M, recommended is 256M to 1024M

        mem_limit               = 32M

}

创建更新全部索引

root@timeless-HP-Pavilion-g4-Notebook-PC:/usr/local/coreseek/bin# /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft.increment.conf  --all --rotate

Coreseek Fulltext 4.1 [ Sphinx 2.0.-dev (r2922)]

Copyright (c) -,

Beijing Choice Software Technologies Inc (http://www.coreseek.com)

 using config file '/usr/local/coreseek/etc/csft.increment.conf'...

indexing index 'src'...

WARNING: Attribute count is : switching to none docinfo

collected  docs, 0.0 MB

sorted 0.0 Mhits, 100.0% done

total  docs,  bytes

total 0.011 sec,  bytes/sec, 531.82 docs/sec

indexing index 'moresrc'...

WARNING: Attribute count is : switching to none docinfo

collected  docs, 0.0 MB

total  docs,  bytes

total 0.001 sec,  bytes/sec, 0.00 docs/sec

total  reads, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg

total  writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg

rotating indices: succesfully sent SIGHUP to searchd (pid=).

更新增量索引

数据库中添加数据然后运行，更新增量索引记得执行 indeser search 还有启动searchd时候要要指定-c 配置文件。

root@timeless-HP-Pavilion-g4-Notebook-PC:/usr/local/coreseek/bin# /usr/local/coreseek/bin/indexer moresrc -c /usr/local/coreseek/etc/csft.increment.conf --rotate

Coreseek Fulltext 4.1 [ Sphinx 2.0.-dev (r2922)]

Copyright (c) -,

Beijing Choice Software Technologies Inc (http://www.coreseek.com)

 using config file '/usr/local/coreseek/etc/csft.increment.conf'...

indexing index 'moresrc'...

WARNING: Attribute count is : switching to none docinfo

collected  docs, 0.0 MB

sorted 0.0 Mhits, 100.0% done

total  docs,  bytes

total 0.008 sec,  bytes/sec, 115.24 docs/sec

total  reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg

total  writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg

rotating indices: succesfully sent SIGHUP to searchd (pid=).

合并增量索引到主索引

root@timeless-HP-Pavilion-g4-Notebook-PC:/usr/local/coreseek/bin# /usr/local/coreseek/bin/indexer --merge src moresrc  -c /usr/local/coreseek/etc/csft.increment.conf --rotate

Coreseek Fulltext 4.1 [ Sphinx 2.0.-dev (r2922)]

Copyright (c) -,

Beijing Choice Software Technologies Inc (http://www.coreseek.com)

 using config file '/usr/local/coreseek/etc/csft.increment.conf'...

read 0.0 of 0.0 MB, 100.0% done

merged 0.0 Kwords

merged in 0.000 sec

total  reads, 0.000 sec, 6.4 kb/call avg, 0.0 msec/call avg

total  writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg

rotating indices: succesfully sent SIGHUP to searchd (pid=).

测试数据如下所示

root@timeless-HP-Pavilion-g4-Notebook-PC:/usr/local/coreseek/bin# ./search -c /usr/local/coreseek/etc/csft.increment.conf  测试

Coreseek Fulltext 4.1 [ Sphinx 2.0.-dev (r2922)]

Copyright (c) -,

Beijing Choice Software Technologies Inc (http://www.coreseek.com)

 using config file '/usr/local/coreseek/etc/csft.increment.conf'...

index 'src': query '测试 ': returned  matches of  total in 0.000 sec

displaying matches:

. document=, weight=

    id=

    title=?????

    content=?? ?????? ????  ??

    addtime=

. document=, weight=

    id=

    title=??

    content=????????

    addtime=

. document=, weight=

    id=

    title=???

    content=????????    ?????? ????

    addtime=

words:

. '测试':  documents,  hits

index 'moresrc': query '测试 ': returned  matches of  total in 0.000 sec

displaying matches:

. document=, weight=

    id=

    title=??

    content=????????

    addtime=

words:

. '测试':  documents,  hits

通过测试数据看到 index moresrc 增量索引查询到 “测试” 关键字 5 次合并之后的住索引变为 14个匹配。之前是 9个匹配。

sphinx（coreseek）——1、增量索引的更多相关文章

sphinx （coreseek）——3、区段查询与增量索引实例
首先本文测试数据100多万的域名的wwwtitle 信息检索数据: 首先建立临时表格: CREATE TABLE `sph_counter` ( `index_id` ) NOT NULL, `m ...
sphinx增量索引
首先建立一个计数表,保存数据表的最新记录ID CREATE TABLE `sph_counter` ( `id` int(11) unsigned NOT NULL, `max_id` int(1 ...
Coreseek:部门查询和增量索引代替实时索引
1.行业调查索引系统需要通过主查询来获取所有的文档信息,一个简单的实现是整个表的数据到内存,但是这可能会导致整个表被锁定,并且使其它操作被阻止(例如:在MyISAM格款式上INSERT操作).同时, ...
Sphinx主索引和增量索引来实现索引实时更新的关键步骤
1.配置csft.conf文件 vim /etc/csft.conf # # Minimal Sphinx configuration sample (clean, simple, functiona ...
sphinx增量索引和主索引来实现索引的实时更新
项目中文章的信息内容因为持续有新增,而文章总量的基数又比较大,所以做搜索的时候,用了主索引+增量索引这种方式来实现索引的实时更新. 实现原理: 1. 新建一张表,记录一下上一次已经创建好索引的最后一条 ...
Coreseek:区段查询及增量索引取代实时索引
1.区段查询索引系统须要通过主查询来获取所有的文档信息,一种简单的实现是将整个表的数据读入内存,可是这可能导致整个表被锁定并使得其它操作被阻止(比如:在MyISAM格式上的INSERT操作),同一时 ...
sphinx增量索引使用
sphinx在使用过程中如果表的数据量很大,新增加的内容在sphinx索引没有重建之前都是搜索不到的. 这时可以通过建立sphinx增量索引,通过定时更新增量索引,合并主索引的方式,来实现伪实时更新. ...
coreseek增量索引合并
重建主索引和增量索引: [plain] view plain copy /usr/local/coreseek/bin/indexer--config /usr/local/coreseek/etc/ ...
coreseek增量索引
1.在多数情况下,因为Coreseek索引速度高达10MB/s,所以只需要创建一个索引源即可满足需求,但是在数据量随时激增的大型应用中(如SNS.评论系统等),单一的索引源将会给indexer造成极大 ...

随机推荐

openStack使用宿主机监控
10个vm 平稳运行 top 数值
cygwin with openssh
新建系统变量 CYGWIN=ntsec path添加 ;c:\cygwin\bin 之后参考http://blog.csdn.net/benkaoya/article/details/8884677 ...
mysql 引擎区分
MySQL常用的存储引擎为MyISAM.InnoDB.MEMORY.MERGE,其中InnoDB提供事务安全表,其他存储引擎都是非事务安全表. MyISAM是MySQL的默认存储引擎.MyISAM不支 ...
高性能、高流量Java Web站点打造的22条建议
@http://www.csdn.net/article/2013-12-20/2817861-22-recommendations-for-building-effective-high-traff ...
java 新手
public class hello{ public static void main(String args[]){ int a=23,b=32,c=34; int s=Math.max(a,c); ...
nodejs端口被占用。
I had the same issue. I ran: $ ps aux | grep node to get the process id, then: $ sudo kill -9 follow ...
【设计模式 - 18】之备忘录模式（Memento）
1 模式简介备忘录模式的定义: 备忘录模式保存一个对象的某个状态,以便在适当的时候恢复对象,用作"后悔药",即取消上次操作或返回到以前的某个版本. 备忘录模式的应用实例 ...
jQuery 实现上下,左右滑动
前几天的任务:http://t.sina.com.cn/ 的下滑效果. 渐变移动出足够的空白 -> 淡出最后一个 ->渐变移动出足够的空白我们要做的是向左移动效果.这个效果用时需添加一 ...
基于Visual C++2012拆解世界五百强面试题--题3
请用C语言实现输入N,打印N*N矩阵比如 N = 3, 打印: 1 2 3 8 9 4 7 6 5 N = 4, 打印 1 2 3 4 12 13 14 5 11 16 ...
支付宝手机网站支付流程（Node实现）
前言公司M站要接入支付宝,借机研究了一下支付宝的支付流程.毕竟,只有公司才能拿到支付接口权限. 主要参考文档: https://doc.open.alipay.com/doc2/detail?tre ...

sphinx（coreseek）——1、增量索引

sphinx（coreseek）——1、增量索引的更多相关文章

随机推荐

热门专题