Coreseek 安装指南

Sphinx0.9.9 中文手册：http://www.coreseek.cn/docs/coreseek_3.2-sphinx_0.9.9.html

1. 简介

Coreseek 是一款中文全文检索/搜索软件，其核心是基于Sphinx。更多的人可能是听过Apache的Lucene，两者的对比就不在此赘述了。

Lucene和Sphinx的对比参考：http://sg552.iteye.com/blog/1560834

2. 下载

官网下载：http://www.coreseek.cn/products/ft_down/

wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.13.tar.gz

解压后有三个文件夹：

coreseek-3.2./

├── csft-3.2.         # 主程序

├── mmseg-3.2.    # 中文分词软件包

├── README.txt

└── testpack            # 测试用

3. 安装 mmseg

一般情况下直接 ./configure 会提示 config.status: error: cannot find input file: src/Makefile.in 。

遇到这种情况的重新生成一次Makefile。

# sudo apt-get install libtool aclocal automake autoconf

aclocal

libtoolize --force

automake --add-missing

autoconf

autoheader

make clean

然后常规三步走

mkdir -p /usr/local/mmseg

./configure --prefix=/usr/local/mmseg

make && make install

4. 安装 coreseek

make的时候期间会出现关于sphinxexpr.cpp中的ExprEval声明错误，原bugfix上没说清楚是什么问题。据网上普遍反映gcc4.7 基本会有这个问题，有人试过4.8似乎也不行，反而是Mac OS上的成功率比较高。

bugfix提供的解决办法：

wget http://vifix.cn/blog/wp-content/uploads/2012/04/sphinxexpr.cpp_.patch_.zip

解压后会有两个patch文件，根据coreseek版本选择对应的patch。（patch的使用请自行google）

用编辑器打开对应的patch，将前两行的路径全部改为src/sphinxexpr.cpp，然后将patch放在cstf-3.2.13安装目录下。（自己知道怎么用patch的就不用参考我这段了）

比如目前我的是3.2的

patch -p0 < sphinxexpr.cpp-csft-3.2..patch

然后还是三步走，configure的选项较多，主要的有--with-python --with-mysql --with-mmseg等，根据需求自己配置。

mkdir -p /usr/local/coreseek

./configure --prefix=/usr/local/coreseek --with-mmseg --with-mmseg-includes=/usr/local/mmseg/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg/lib/ --with-mysql

make && make install

到此便已安装完成。

5. 配置

coreseek主要是两部分：

indexer 用来建索引

searchd 搜索引擎的daemon，从其d结尾就知道其角色了。

coreseek 默认的配置路径是 coreseek目录下面etc里的csft.conf，比如我们上面的示例就是 /usr/local/coreseek/etc/csft.conf。

初始情况下etc里只有sphinx.conf.dist ， sphinx-min.conf.dist ，以及一个测试用的example.sql。

sphinx-min.conf.dist包含最少的能运行sphinx的选项，但我们通常情况下需要的是更多的自定义选项。

因此：

1. 将sphinx.conf.dist复制成csft.conf (其他名字也可以，但是之后执行的时候需要用-c选项显式指定配置文件)

cp sphinx.conf.dist csft.conf

2. 配置文件里主要由四部分构成：source, index, indexer, searchd

source：数据源，主要有mysql, postgresql, xmlpipe, xmlpipe2 。如果选的是数据库类的同时需要配置数据库的相关选项以及sql的query。
index：索引，主要配置生成索引时的一些选项，诸如编码，过滤等。
indexer 和 searchd 均为配置对应的执行程序的选项，一般可用默认不修改。

　首先看source的配置：

source xiaotao_src

{

    type                    = mysql

    #####################################################################

    ## SQL settings (for 'mysql' and 'pgsql' types)

    #####################################################################

    # some straightforward parameters for SQL source types

    sql_host                = localhost

    sql_user                = root

    sql_pass                = root

    sql_db                  = test

    sql_port                =     # optional, default is 

    # UNIX socket name

    # optional, default is empty (reuse client library defaults)

    # usually '/var/lib/mysql/mysql.sock' on Linux

    # usually '/tmp/mysql.sock' on FreeBSD

    #

    sql_sock                = /tmp/mysql.sock

    # pre-query, executed before the main fetch query

    # multi-value, optional, default is empty list of queries

    #

    sql_query_pre              = SET NAMES utf8

    sql_query_pre              = REPLACE INTO xt_index_counter SELECT , MAX(pub_id) FROM xt_pub    # 用于增量索引

    # sql_query_pre            = SET SESSION query_cache_type=OFF

    # main document fetch query

    # mandatory, integer document ID field MUST be the first selected column

    sql_query                = \

        SELECT pub_id, title, content, saler, ctime \

        FROM xt_pub \

                WHERE pub_id <= ( \

                     SELECT max_id \

                     FROM xt_index_counter \

                    WHERE index_id =  \

                )

    #sql_attr_str2ordinal        = title

    #sql_attr_str2ordinal        = content

    #sql_attr_str2ordinal        = saler

    sql_attr_timestamp           = ctime

}

以上是个mysql数据源的配置，大部分选项基本可以在手册里面看到详细介绍，一些注意的地方如下：

source 后面接的可以是自己定义的任意名字，但是要注意和index里面的source对应。
对于中文检索切记要保持从头到尾的UTF-8编码，切记最好加上 sql_query_pre = SET NAMES UTF-8
sql_attr 是属性，并不代表返回的结果就是该字段的结果。而且text文本类型不可作为属性，字符串类型作为属性返回的是整数值表示顺序。

index的配置如下：

index xiaotao_item_index

{

    # document source(s) to index

    # multi-value, mandatory

    # document IDs must be globally unique across all sources

    source            = xiaotao_src

    # index files path and file name, without extension

    # mandatory, path must be writable, extensions will be auto-appended

    path            = /usr/local/coreseek/var/data/xiaotao_item_index

    # document attribute values (docinfo) storage mode

    # optional, default is 'extern'

    # known values are 'none', 'extern' and 'inline'

    docinfo            = extern

    # memory locking for cached data (.spa and .spi), to prevent swapping

    # optional, default is  (do not mlock)

    # requires searchd to be run from root

    mlock            = 

    morphology        = none

    # stopword files list (space separated)

    # optional, default is empty

    # contents are plain text, charset_table and stemming are both applied

    #

    stopwords            = /usr/local/mmseg/etc/stopword.txt

    # wordforms file, in "mapfrom > mapto" plain text format

    # optional, default is empty

    #

    wordforms            = 

    # tokenizing exceptions file

    # optional, default is empty

    #

    # plain text, case sensitive, space insensitive in map-from part

    # one "Map Several Words => ToASingleOne" entry per line

    #

    exceptions        = 

    # minimum indexed word length

    # default is  (index everything)

    min_word_len        = 

    # charset encoding type

    # optional, default is 'sbcs'

    # known types are 'sbcs' (Single Byte CharSet) and 'utf-8'

    charset_type        = zh_cn.utf-

    charset_dictpath        = /usr/local/mmseg/etc

    # n-gram length to index, for CJK indexing

    # only supports  and  for now, other lengths to be implemented

    # optional, default is  (disable n-grams)

    #

    ngram_len                = 

    # n-gram characters list, for CJK indexing

    # optional, default is empty

    #

    ngram_chars            = U+..U+2FA1F

    html_strip                = 

}

以上是一个中文检索用的index配置，主要点如下：

中文检索需指定分词词典 charset_dictpath，在 /usr/local/mmseg/etc 下就有uni.lib 文件了。当然也可以自己生成。做一个格式如mmseg安装目录下 mmseg-3.2.13/data/unigram.txt 的文本文件，注意编码utf-8，然后执行 mmseg -u unigram.txt，把生成的文件改名为 uni.lib 替换 /usr/local/mmseg/etc/uni.lib 即可。
中文检索情况下
```
morphology = none

ngram_len = 
```
有的教程里推荐用charset_table，如果失败的话尝试注释掉不要用。

6. 增量索引

首先引入全量更新索引和增量更新索引的概念。

全量更新索引：即利用现有的全部数据重新做索引。

增量更新索引：只针对一段时间内新增加的数据做索引。

为什么需要增量呢？因为首先在大数据的情况下全量是需要很长时间的，不可能说每次新加数据都做一次索引，大流量网站的全量一般是每天一次。但是对于当天新加的数据我们也希望能尽快加入被检索，因此引入增量概念。增量会对当天新加的数据进行动态的索引，然后定时将索引合并到主索引中。

因此我们此时需要修改配置文件，添加如下：

source delta_src : xiaotao_src

{

    sql_query_pre                = SET NAMES utf8

    sql_query_pre               =

    sql_query                    = \

        SELECT pub_id, title, content, saler, ctime \

        FROM xt_pub \

        WHERE pub_id > ( \

            SELECT max_id \

            FROM xt_index_counter \

            WHERE index_id =  \

            )

}

index delta : xiaotao_item_index

{

    source          = delta_src

    path            = /usr/local/coreseek/var/data/delta

}

由上可知，sphinx的配置是支持类似继承的关系的，子配置会继承父配置的信息。
source里的子配置的sql_query_pre和sql_query_post需要保持和父配置的一样多，在此我大胆推测其他配置选项也是如此。这和配置parser有关系，因为你如果写的数量比父配置少，parser不知道你想覆盖的是哪一条。
注意主source和delta source 的 sql_query 里面的 WHERE条件。
注意需要新建一个TABLE用来记录主索引和增量索引的id分界线，我的例子里就是 xt_index_counter。
```
CREATE TABLE xt_index_counter

(

index_id INTEGER PRIMARY KEY NOT NULL,

max_id INTEGER NOT NULL

);
```

再写一个全量脚本和一个增量脚本:

build_main_index.sh:

/usr/local/coreseek/bin/searchd --stop

/usr/local/coreseek/bin/indexer xiaotao_item_index

/usr/local/coreseek/bin/searchd &

build_delta_index.sh:

/usr/local/coreseek/bin/searchd --stop

/usr/local/coreseek/bin/indexer delta

/usr/local/coreseek/bin/indexer --merge xiaotao_item_index delta

/usr/local/coreseek/bin/searchd &

用crontab来定时执行：

crontab -e

*/ * * * * /usr/local/coreseek/build_delta_index.sh >> /usr/local/coreseek/var/log/build_delta_index.log

  * * * /usr/local/coreseek/build_main_index.sh >> /usr/local/coreseek/var/log/build_main_index.log

此时基本就完成了。以上有个小瑕疵是每次更新索引都要stop searchd，但实际上可以利用--rotate选项来避免这个问题：

/usr/local/coreseek/bin/indexer xiaotao_item_index --rotate

7. 常见问题

目前遇到的无非就两种（暂时懒得重现，描述一下算了）

1. .spa文件 size 为 0 ：说明建立索引失败，一般需要检查一下你的配置有没问题，比如 sql_attr 指定了个TEXT对象，子配置和父配置覆盖选项的时候没做到数量一致等等。

2. .spl文件被lock了：searchd没有停止，要么bin/searchd --stop再做更新，要么使用--rotate开关。

Coreseek 安装指南的更多相关文章

coreseek常见错误原因及解决方法
coreseek常见错误原因及解决方法 Coreseek 中文全文检索引擎 Coreseek 是一款中文全文检索/搜索软件,以GPLv2许可协议开源发布,基于Sphinx研发并独立发布,专攻中文搜索和 ...
coreseek增量索引合并
重建主索引和增量索引: [plain] view plain copy /usr/local/coreseek/bin/indexer--config /usr/local/coreseek/etc/ ...
coreseek操作
开启服务$ /usr/local/coreseek/bin/searchd -c /usr/local/coreseek/etc/csft.conf 重新索引: /usr/local/coresee ...
coreseek 安装及使用方法详解
coreseek 安装及使用一般站点都需要搜索功能,如果是php+mysql站点,建议选择coreseek,如果是java站点建议使用lucene,coreseek 是一款很好的中文全文检索/搜索软 ...
coreseek安装
一. Sphinx简介 Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎.意图为其他应用提供高速.低空间占用.高结果相关度的全文搜索功能.Sphinx可以非常容易的与 ...
coreseek安装过程
一.sphinx 全文检索通过sphinx检索到id,然后到mysql里面拿到记录什么是劝我呢检索?结构化数据: 具有固定格式或者长度的数据非结构化数据: 标题内容等不定长的数据非机构化数据还 ...
coreseek增量索引
1.在多数情况下,因为Coreseek索引速度高达10MB/s,所以只需要创建一个索引源即可满足需求,但是在数据量随时激增的大型应用中(如SNS.评论系统等),单一的索引源将会给indexer造成极大 ...
coreseek+sphinx+mysql+thinkphp整合
1.安装coreseek 1.1首先升级或安装系统依赖库 yum install make gcc g++ automake libtool mysql-client libmysqlclient15 ...
nGrinder安装指南
NGrinder 由两个模块组成,其运行环境为 Oracle JDK 1.6 nGrinder controller web 应用程序,部署在Tomcat 6.x 或更高的版本 nGrinder A ...

随机推荐

Java---类加载机制，构造方法，静态变量，（静态）代码块，父类，变量加载顺序
直接上代码: 代码1: public class ConstroctTest { private static ConstroctTest test = new ConstroctTest(); // ...
Shell入门教程：Shell变量
变量是一种很“弱”的变量,默认情况下,一个变量保存一个串,Shell不关心这个串是什么含义.所以若要进行数学运算,必须使用一些命令例如 let.declare.expr.双括号等. Shell变量可 ...
android 项目中如何引入第三方jar包
http://www.360doc.com/content/13/0828/08/11482448_310390794.shtml
C和指针第九章习题
9.15 编写函数格式化金钱为标准字符串 #include <stdio.h> #include <string.h> #define TEMP_LEN 1000 void d ...
关于 JSONP跨域示例
1.脚本文件Jsonp,代码如下: $(function () { TestJsonP(); function TestJsonP() { var xhrurl = 'http://localhost ...
Xcode7建立自己的自定义工程和类模板
首先进入系统模板的目录 /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/Library ...
Memcache之telnet操作
在telnet Memcache之前,先要确认 memcached已启用. 如:ps -ef |grep memcache netstat -elp |grep memcache 或者 netstat ...
redis 密码配置
http://blog.csdn.net/vtopqx/article/details/46833099 http://www.2cto.com/database/201412/365757.html ...
PHP探针
来自LNMP.org 探针p.php 代码: <?php error_reporting(0); //抑制所有错误信息 @header("content-Type: text/html ...
利用python的双向队列(Deque)数据结构实现回文检测的算法
#!/usr/bin/env python # -*- coding: utf-8 -*- # learn <<Problem Solving with Algorithms and Da ...

Coreseek 安装指南

Coreseek 安装指南的更多相关文章

随机推荐

热门专题