php+中文分词scws+sphinx+mysql打造千万级数据全文搜索

2024-10-27 09:54:47 原文

转载自：http://blog.csdn.net/nuli888/article/details/51892776

Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。当前系统内置MySQL和PostgreSQL 数据库数据源的支持，也支持从标准输入读取特定格式的XML数据。

Sphinx创建索引的速度为：创建100万条记录的索引只需3～4分钟，创建1000万条记录的索引可以在50分钟内完成，而只包含最新10万条记录的增量索引，重建一次只需几十秒。
Sphinx的特性如下：
a) 高速的建立索引(在当代CPU上，峰值性能可达到10 MB/秒);
b) 高性能的搜索(在2 – 4GB 的文本数据上，平均每次检索响应时间小于0.1秒);
c) 可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可处理100 M 文档);
d) 提供了优秀的相关度算法，基于短语相似度和统计（BM25）的复合Ranking方法;
e) 支持分布式搜索;
f) 支持短语搜索
g) 提供文档摘要生成
h) 可作为MySQL的存储引擎提供搜索服务;
i) 支持布尔、短语、词语相似度等多种检索模式;
j) 文档支持多个全文检索字段(最大不超过32个);
k) 文档支持多个额外的属性信息(例如：分组信息，时间戳等);
l) 支持断词;
虽然mysql的MYISAM提供全文索引，但是性能却不敢让人恭维

开始搭建

系统环境：centos6.5+php5.6+apache+mysql

1、安装依赖包

[php] view plain copy

yum -y install make gcc g++ gcc-c++ libtool autoconf automake imake php-devel mysql-devel libxml2-devel expat-devel

2、安装Sphinx

[php] view plain copy

yum install expat expat-devel
wget -c http://sphinxsearch.com/files/sphinx-2.0.7-release.tar.gz
tar zxvf sphinx-2.0.7-release.tar.gz
cd sphinx-2.0.7-release
./configure --prefix=/usr/local/sphinx --with-mysql --with-libexpat --enable-id64
make && make install

3、安装libsphinxclient，php扩展用到

[php] view plain copy

cd api/libsphinxclient
./configure --prefix=/usr/local/sphinx/libsphinxclient
make && make install

4、安装Sphinx的PHP扩展:我的是5.6需装sphinx-1.3.3.tgz，如果是php5.4以下可sphinx-1.3.0.tgz

[php] view plain copy

wget -c http://pecl.php.net/get/sphinx-1.3.3.tgz
tar zxvf sphinx-1.3.3.tgz
cd sphinx-1.3.3
phpize
./configure --with-sphinx=/usr/local/sphinx/libsphinxclient/ --with-php-config=/usr/bin/php-config
make && make install
成功后会提示：
Installing shared extensions: /usr/lib64/php/modules/
echo "[Sphinx]" >> /etc/php.ini
echo "extension = sphinx.so" >> /etc/php.ini
#重启apache
service httpd restart

5、创建测试数据

[php] view plain copy

CREATE TABLE IF NOT EXISTS `items` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
`content` text NOT NULL,
`created` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='全文检索测试的数据表' AUTO_INCREMENT=11 ;
INSERT INTO `items` (`id`, `title`, `content`, `created`) VALUES
(1, 'linux mysql集群安装', 'MySQL Cluster 是MySQL 适合于分布式计算环境的高实用、可拓展、高性能、高冗余版本', '2016-09-07 00:00:00'),
(2, 'mysql主从复制', 'mysql主从备份(复制)的基本原理 mysql支持单向、异步复制,复制过程中一个服务器充当主服务器,而一个或多个其它服务器充当从服务器', '2016-09-06 00:00:00'),
(3, 'hello', 'can you search me?', '2016-09-05 00:00:00'),
(4, 'mysql', 'mysql is the best database?', '2016-09-03 00:00:00'),
(5, 'mysql索引', '关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车', '2016-09-01 00:00:00'),
(6, '集群', '关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车', '0000-00-00 00:00:00'),
(9, '复制原理', 'redis也有复制', '0000-00-00 00:00:00'),
(10, 'redis集群', '集群技术是构建高性能网站架构的重要手段，试想在网站承受高并发访问压力的同时，还需要从海量数据中查询出满足条件的数据，并快速响应，我们必然想到的是将数据进行切片，把数据根据某种规则放入多个不同的服务器节点，来降低单节点服务器的压力', '0000-00-00 00:00:00');
CREATE TABLE IF NOT EXISTS `sph_counter` (
`counter_id` int(11) NOT NULL,
`max_doc_id` int(11) NOT NULL,
PRIMARY KEY (`counter_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COMMENT='增量索引标示的计数表';

以下采用"Main + Delta" ("主索引"+"增量索引")的索引策略，使用Sphinx自带的一元分词。

6、Sphinx配置:注意修改数据源配置信息

[php] view plain copy

vi /usr/local/sphinx/etc/sphinx.conf
source items {
type = mysql
sql_host = localhost
sql_user = root
sql_pass = 123456
sql_db = sphinx_items
sql_query_pre = SET NAMES utf8
sql_query_pre = SET SESSION query_cache_type = OFF
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM items
sql_query_range = SELECT MIN(id), MAX(id) FROM items \
WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
sql_range_step = 1000
sql_ranged_throttle = 1000
sql_query = SELECT id, title, content, created, 0 as deleted FROM items \
WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1) \
AND id >= $start AND id <= $end
sql_attr_timestamp = created
sql_attr_bool = deleted
}
source items_delta : items {
sql_query_pre = SET NAMES utf8
sql_query_range = SELECT MIN(id), MAX(id) FROM items \
WHERE id > (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
sql_query = SELECT id, title, content, created, 0 as deleted FROM items \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) \
AND id >= $start AND id <= $end
sql_query_post_index = set @max_doc_id :=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
sql_query_post_index = REPLACE INTO sph_counter SELECT 2, IF($maxid, $maxid, @max_doc_id)
}
#主索引
index items {
source = items
path = /usr/local/sphinx/var/data/items
docinfo = extern
morphology = none
min_word_len = 1
min_prefix_len = 0
html_strip = 1
html_remove_elements = style, script
ngram_len = 1
ngram_chars = U+3000..U+2FA1F
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
preopen = 1
min_infix_len = 1
}
#增量索引
index items_delta : items {
source = items_delta
path = /usr/local/sphinx/var/data/items-delta
}
#分布式索引
index master {
type = distributed
local = items
local = items_delta
}
indexer {
mem_limit = 256M
}
searchd {
listen = 9312
listen = 9306:mysql41 #Used for SphinxQL
log = /usr/local/sphinx/var/log/searchd.log
query_log = /usr/local/sphinx/var/log/query.log
compat_sphinxql_magics = 0
attr_flush_period = 600
mva_updates_pool = 16M
read_timeout = 5
max_children = 0
dist_threads = 2
pid_file = /usr/local/sphinx/var/log/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
workers = threads # for RT to work
binlog_path = /usr/local/sphinx/var/data
}

保存退出

7、Sphinx创建索引

[php] view plain copy

#第一次需重建索引:
[root@localhost bin]# ./indexer -c /usr/local/sphinx/etc/sphinx.conf --all
Sphinx 2.0.7-id64-release (r3759)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/usr/local/sphinx/etc/sphinx.conf'...
indexing index 'items'...
collected 8 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 8 docs, 1121 bytes
total 1.017 sec, 1101 bytes/sec, 7.86 docs/sec
indexing index 'items_delta'...
collected 0 docs, 0.0 MB
total 0 docs, 0 bytes
total 1.007 sec, 0 bytes/sec, 0.00 docs/sec
skipping non-plain index 'master'...
total 4 reads, 0.000 sec, 0.7 kb/call avg, 0.0 msec/call avg
total 14 writes, 0.001 sec, 0.5 kb/call avg, 0.1 msec/call avg
#启动sphinx
[root@localhost bin]# ./searchd -c /usr/local/sphinx/etc/sphinx.conf
Sphinx 2.0.7-id64-release (r3759)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/usr/local/sphinx/etc/sphinx.conf'...
listening on all interfaces, port=9312
listening on all interfaces, port=9306
precaching index 'items'
precaching index 'items_delta'
rotating index 'items_delta': success
precached 2 indexes in 0.012 sec
#查看进程
[root@localhost bin]# ps -ef | grep searchd
root 30431 1 0 23:59 ? 00:00:00 ./searchd -c /usr/local/sphinx/etc/sphinx.conf
root 30432 30431 0 23:59 ? 00:00:00 ./searchd -c /usr/local/sphinx/etc/sphinx.conf
root 30437 1490 0 23:59 pts/0 00:00:00 grep searchd
#停止Searchd:
./searchd -c /usr/local/sphinx/etc/sphinx.conf --stop
#查看Searchd状态:
./searchd -c /usr/local/sphinx/etc/sphinx.conf --status

索引更新及使用说明
"增量索引"每N分钟更新一次.通常在每天晚上低负载的时进行一次索引合并,同时重新建立"增量索引"。当然"主索引"数据不多的话，也可以直接重新建立"主索引"。
API搜索的时，同时使用"主索引"和"增量索引"，这样可以获得准实时的搜索数据.本文的Sphinx配置将"主索引"和"增量索引"放到分布式索引master中,因此只需查询分布式索引"master"即可获得全部匹配数据(包括最新数据)。

索引的更新与合并的操作可以放到cron job完成：

[php] view plain copy

crontab -e
*/1 * * * * /usr/local/sphinx/shell/delta_index_update.sh
0 3 * * * /usr/local/sphinx/shell/merge_daily_index.sh
crontab -l

cron job所用的shell脚本例子:

delta_index_update.sh:

[php] view plain copy

#!/bin/bash
/usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta > /dev/null 2>&1

merge_daily_index.sh:

[php] view plain copy

#!/bin/bash
indexer=`which indexer`
mysql=`which mysql`
QUERY="use sphinx_items;select max_doc_id from sph_counter where counter_id = 2 limit 1;"
index_counter=$($mysql -h192.168.1.198 -uroot -p123456 -sN -e "$QUERY")
#merge "main + delta" indexes
$indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate --merge items items_delta --merge-dst-range deleted 0 0 >> /usr/local/sphinx/var/index_merge.log 2>&1
if [ "$?" -eq 0 ]; then
##update sphinx counter
if [ ! -z $index_counter ]; then
$mysql -h192.168.1.198 -uroot -p123456 -Dsphinx_items -e "REPLACE INTO sph_counter VALUES (1, '$index_counter')"
fi
##rebuild delta index to avoid confusion with main index
$indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta >> /usr/local/sphinx/var/rebuild_deltaindex.log 2>&1
fi

8、php中文分词scws安装:注意扩展的版本和php的版本

[php] view plain copy

wget -c http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2
tar jxvf scws-1.2.3.tar.bz2
cd scws-1.2.3
./configure --prefix=/usr/local/scws
make && make install

9、scws的PHP扩展安装:

[php] view plain copy

cd ./phpext
phpize
./configure
make && make install
echo "[scws]" >> /etc/php.ini
echo "extension = scws.so" >> /etc/php.ini
echo "scws.default.charset = utf-8" >> /etc/php.ini
echo "scws.default.fpath = /usr/local/scws/etc/" >> /etc/php.ini

10、词库安装:

[php] view plain copy

wget http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2
tar jxvf scws-dict-chs-utf8.tar.bz2 -C /usr/local/scws/etc/
chown www:www /usr/local/scws/etc/dict.utf8.xdb

11、php使用Sphinx+scws测试例子
在Sphinx源码API中,有好几种语言的API调用.其中有一个是sphinxapi.php。
不过以下的测试使用的是Sphinx的PHP扩展.具体安装见本文开头的Sphinx安装部分。
测试用的搜索类Search.php：注意修改getDBConnection()信息为自己的

[php] view plain copy

测试文件test.php：

[php] view plain copy

<?php
require('Search.php');
$s = new Search([
'snippet_fields' => ['title', 'content'],
'field_weights' => ['title' => 20, 'content' => 10],
]);
$s->setSortMode(SPH_SORT_EXTENDED, 'created desc,@weight desc');
//$s->setSortBy('created desc,@weight desc');
$words = $s->wordSplit("mysql集群");//先分词结果：(mysql)|(mysql集群)
//print_r($words);exit;
$res = $s->query($words, 0, 10, 'master');
echo '<pre/>';print_r($res);

测试结果：

12、SphinxQL测试

要使用SphinxQL需要在Searchd的配置里面增加相应的监听端口(参考上文配置)。

[php] view plain copy

[root@localhost bin]# mysql -h127.0.0.1 -P9306 -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 2.0.7-id64-release (r3759)
Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show global variables;
+----------------------+---------+
| Variable_name | Value |
+----------------------+---------+
| autocommit | 1 |
| collation_connection | libc_ci |
| query_log_format | plain |
| log_level | info |
+----------------------+---------+
4 rows in set (0.00 sec)
mysql> desc items;
+---------+-----------+
| Field | Type |
+---------+-----------+
| id | bigint |
| title | field |
| content | field |
| created | timestamp |
| deleted | bool |
+---------+-----------+
5 rows in set (0.00 sec)
mysql> select * from master where match ('mysql集群') limit 10;
+------+---------+---------+
| id | created | deleted |
+------+---------+---------+
| 1 | 2016 | 0 |
| 6 | 0 | 0 |
+------+---------+---------+
2 rows in set (0.00 sec)
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 2 |
| total_found | 2 |
| time | 0.006 |
| keyword[0] | mysql |
| docs[0] | 5 |
| hits[0] | 15 |
| keyword[1] | 集 |
| docs[1] | 3 |
| hits[1] | 4 |
| keyword[2] | 群 |
| docs[2] | 3 |
| hits[2] | 4 |
+---------------+-------+
12 rows in set (0.00 sec)
mysql>

php+中文分词scws+sphinx+mysql打造千万级数据全文搜索的更多相关文章

如何优化Mysql千万级快速分页,limit优化快速分页,MySQL处理千万级数据查询的优化方案
如何优化Mysql千万级快速分页,limit优化快速分页,MySQL处理千万级数据查询的优化方案
Coreseek-带中文分词的Sphinx
Sphinx并不支持中文分词, 也就不支持中文搜索, Coreseek = Sphinx + MMSEG(中文分词算法) 1.下载 1).到官网下载 2).解压后有三个文件夹 csft-3.2.14: ...
编译安装开源免费中文分词scws
一.SCWS了解一下: SCWS 是 Simple Chinese Word Segmentation 的首字母缩写(即:简易中文分词系统). 这是一套基于词频词典的机械式中文分词引擎,它能将一整段的 ...
完全用nosql轻松打造千万级数据量的微博系统（转）
原文:http://www.cnblogs.com/imxiu/p/3505213.html 其实微博是一个结构相对简单,但数据量却是很庞大的一种产品.标题所说的是千万级数据量也并不是一千万条微博信 ...
完全用nosql轻松打造千万级数据量的微博系统
其实微博是一个结构相对简单,但数据量却是很庞大的一种产品.标题所说的是千万级数据量也并不是一千万条微博信息而已,而是千万级订阅关系之间发布.在看我这篇文章之前,大多数人都看过sina的杨卫华大牛的微 ...
scws简单中文分词
demo如下: /** * 中文分词 * @param $keyword * @param $getTop * @param $limit * @return array */ function sp ...
MySQL 全文搜索支持, mysql 5.6.4支持Innodb的全文检索和类memcache的nosql支持
背景:搞个个人博客的全文搜索得用like啥的,现在mysql版本号已经大于5.6.4了也就支持了innodb的全文搜索了,刚查了下目前版本号都到MySQL Community Server 5.6.1 ...
Sphinx + Coreseek 实现中文分词搜索
Sphinx + Coreseek 实现中文分词搜索 Sphinx Coreseek 实现中文分词搜索全文检索 1 全文检索 vs 数据库 2 中文检索 vs 汉化检索 3 自建全文搜索与使用Goo ...
真分布式SolrCloud+Zookeeper+tomcat搭建、索引Mysql数据库、IK中文分词器配置以及web项目中solr的应用(1)
版权声明:本文为博主原创文章,转载请注明本文地址.http://www.cnblogs.com/o0Iris0o/p/5813856.html 内容介绍: 真分布式SolrCloud+Zookeepe ...

随机推荐

Python函数篇（二）之递归函数、匿名函数及高阶函数
1.全局变量和局部变量一般定义在程序的最开始的变量称为函数变量,在子程序中定义的变量称为局部变量,可以简单的理解为,无缩进的为全局变量,有缩进的是局部变量,全局变量的作用域是整个程序,而局部变量的作 ...
[C#]使用Label标签控件模拟窗体标题的移动
本文为原创文章.源代码为原创代码,如转载/复制,请在网页/代码处明显位置标明原文名称.作者及网址,谢谢! 开发工具:VS2017 语言:C# DotNet版本:.Net FrameWork 4.0及以 ...
实践作业3DAY1
今天,老师又布置了新的学习任务,关于白盒测试.感觉黑盒测试,我们用的比较多,白盒测试就相对陌生了.上课的时候老师虽然也进行了一定的点拨,外加我们学习了SPOC视频,但是并没有看到什么具体的项目,所以实 ...
关于mui header在手机上运行丢失问题
并不需要换header, 只需要把引用的例子自带的CSS文件 app.css.里的两个样式:.mui-plus.mui-android header.mui-bar {display: none;}. ...
django同时查询两张表的数据，合并检索对象返回
原始需求: 1.一篇文章内容分N个版块,每篇文章的版块数量不同. 2.有个文章搜索功能,需要同时搜索标题和内容. 实现思路: 1.由于每篇文章的内容版块数量不同,因此将每个文章的标题和内容分开存入2张 ...
《深入理解java虚拟机》笔记——简析java类文件结构
一直不太搞得明确jvm究竟是如何进行类载入的,在看资料的过程中迷迷糊糊.在理解类载入之前,首先看看java的类文件结构究竟是如何的,都包含了哪些内容. 最直接的參考当然是官方文档:The Java® ...
Xcode6 UIWebView与JavaScript交互（issue fix）
这篇文章中,有介绍UIWebView与JavaScript交互,在UIWebView截获JavaScript请求处理.从app的角度,这是JavaScript的Hook请求. 在Xcode6之前的Ap ...
【VS2017新特性】在VS中调试javascript脚本
1 概述 VS2017可以调试JS,本篇文章简要概述VS2017关于启用和关闭VS调试功能. 2 具体内容当开启VS2017JS调试功能时,我们用VS2017打开解决方案时,会出现如下界面: ...
IDEA定位到类的代码区域（查看类的源码）
经常需要查看某一个类中的成员变量和方法,那么怎么进入到这个类的源码区域呢?在IDEA中只需要使用快捷键: ctrl+shift+t 就可以快速定位到这个类的源码.
Oracle JDBC：驱动版本区别与区分 [转]
classes12.jar,ojdbc14.jar,ojdbc5.jar和ojdbc6.jar的区别,之间的差异在使用Oracle JDBC驱动时,有些问题你是不是通过替换不同版本的Oracle ...