最近做了sphinx的全文搜索,现在记录一下,以便以后需要查看。
参考手册:http://www.coreseek.cn/docs/coreseek_3.2-sphinx_0.9.9.html
本次sphinx全文搜索使用:
软件:coreseek 服务器:linux 程序语言:php 数据库:mysql
1.服务器上搭建coreseek服务:
切换到root用户,确保拥有完整的权限来安装软件
$ su root
安装步骤:
参考文档:http://www.coreseek.cn/products-install/install_on_bsd_linux/
(1).下载获取4.1版本的coreseek:
wget -c http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.1-beta.tar.gz
(2).解压coreseek文件:
tar xzvf coreseek-4.1-beta.tar.gz
(3).安装coreseek开发的mmseg,为coreseek提供中文分词功能
①.进入目录:
cd coreseek-4.1-beta
cd mmseg-3.2.14
[root@localhost mmseg-3.2.14]# ./bootstrap
显示:
+ aclocal -I config
config/sys_siglist.m4:20: warning: underquoted definition of SIC_VAR_SYS_SIGLIST
config/sys_siglist.m4:20: run info '(automake)Extending aclocal'
config/sys_siglist.m4:20: or see http://sources.redhat.com/automake/automake.html#Extending-aclocal
+ libtoolize --force --copy
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `config'.
libtoolize: copying file `config/ltmain.sh'
libtoolize: Consider adding `AC_CONFIG_MACRO_DIR([m4])' to configure.in and
libtoolize: rerunning libtoolize, to keep the correct libtool macros in-tree.
libtoolize: Consider adding `-I m4' to ACLOCAL_AMFLAGS in Makefile.am.
+ autoheader
+ automake --add-missing --copy
+ autoconf
②.安装:
[root@localhost mmseg-3.2.14]#./configure --prefix=/usr/local/webserver/mmseg3
[root@localhost mmseg-3.2.14]#make
[root@localhost mmseg-3.2.14]#make install
(4).测试中文分词
[root@localhost mmseg-3.2.14]#/usr/local/webserver/mmseg3/bin/mmseg -d /usr/local/webserver/mmseg3/etc src/t1.txt
显示:
中文/x 分/x 词/x 测试/x
中国人/x 上海市/x
Word Splite took: 0 ms.
说明正常
(5).安装coreseek
[root@localhost mmseg-3.2.14]#cd ..
(6).执行configure,进行编译配置
#sh buildconf.sh
#./configure --prefix=/usr/local/webserver/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/webserver/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/webserver/mmseg3/lib/ -- with-mysql=/usr/local/webserver/mysql
显示:configuration done------------------You can now run 'make install' to build and install Sphinx binaries.On a multi-core machine, try 'make -j4 install' to speed up the build.Updates, articles, help forum, and commercial support, consulting, training,and development services are available at http://sphinxsearch.com/Thank you for choosing Sphinx!安装:[root@localhost csft-4.1]#make显示:/home/centos/coreseek-4.1-beta/csft-4.1/src/sphinx.cpp:22292: undefined reference to `libiconv_open'
/home/centos/coreseek-4.1-beta/csft-4.1/src/sphinx.cpp:22310: undefined reference to `libiconv'
/home/centos/coreseek-4.1-beta/csft-4.1/src/sphinx.cpp:22316: undefined reference to `libiconv_close'这是报错了,需要修改文件:修改./src/Makefile,将第157行的LIBS = -ldl -lm -lz -lexpat -L/usr/local/lib -lrt -lpthread改成LIBS = -ldl -lm -lz -lexpat -liconv -L/usr/local/lib -lrt -lpthread然后再次执行make备份方法:make ZEND_EXTRA_LIBS='-liconv'make install
cd ..
yum whatprovides lexpat
(7).测试配置:
[root@localhost csft-4.1]# /usr/local/webserver/coreseek/bin/indexer -c /usr/local/webserver/coreseek/etc/sphinx-min.conf.dist
显示:
Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]
Copyright (c) 2007-2011,
Beijing Choice Software Technologies Inc (http://www.coreseek.com)
ERROR: nothing to do.
表示正常
(8).修改配置文件
①.进入etc目录:
cd /usr/local/webserver/coreseek/etc/
②.修改配置文件名为csft:
mv sphinx.conf csft.conf
③.修改配置文件:
vi csft.conf
(相关命令:a:进入编辑模式 esc:退出编辑模式 保存退出:shift,:,wq)
需建立两个索引,一个主索引library,一个增量索引delta
文件内容如下:
# 索引源 #
source library_src {#数据源类型type = mysql#mysql主机sql_host = 192.168.1.206#mysql用户名sql_user = root#mysql密码sql_pass = 123#mysql数据库名sql_db = pp_library#mysql端口sql_port = 3306#mysql检索编码,特别要注意这点,很多人中文检索不到是数据库的编码是GBK或其他非UTF8sql_query_pre = SET NAMES UTF8sql_query_pre = SET SESSION query_cache_type=OFF# 获取数据的sqlsql_query = SELECT cid,cid AS id,top_status,status,tid,totalnum,pubtime,lasttime, creatdate,content FROM wb_library WHERE status = 0 ORDER BY cid asc#属性配置,搜索和排序用到sql_attr_uint = idsql_attr_uint = top_statussql_attr_uint = statussql_attr_uint = tidsql_attr_uint = totalnumsql_attr_timestamp = pubtimesql_attr_timestamp = lasttime
}
source delta_src : library_src
{
type = mysql
{
type = mysql
sql_host = 192.168.1.206sql_user = rootsql_pass = 123sql_db = pp_librarysql_port = 3306sql_query_pre = SET NAMES utf8sql_query_pre = SET SESSION query_cache_type=OFFsql_query_pre = REPLACE INTO search_counter (counterid, max_doc_id) SELECT 1,MAX(cid) FROM wb_library #创建增量索引前更改标识位置sql_query_post = UPDATE search_counter SET min_doc_id=max_doc_id WHERE counterid=1 #创建增量索引后更改标识位置sql_query = SELECT cid,cid as id,top_status,status,tid,totalnum,pubtime,lasttime,creatdate,content FROM wb_library WHERE cid > (select min_doc_id FROM search_counter) AND cid <= (select max_doc_id FROM search_counter)sql_attr_uint = idsql_attr_uint = top_statussql_attr_uint = statussql_attr_uint = tidsql_attr_uint = totalnumsql_attr_timestamp = pubtimesql_attr_timestamp = lasttimesql_range_step = 1000sql_ranged_throttle = 1000
}
# 索引 #index library {#声明索引源source = library_src#索引文件存放路径及索引的文件名path = /home/sphinxdata/indexer/library#文档信息存储方式docinfo = extern#缓存数据内存锁定mlock = 0#形态学(对中文无效)morphology = none#索引的词最小长度min_word_len = 1#数据编码charset_type = utf-8charset_dictpath = /usr/local/webserver/mmseg3/etc# 字符表,注意:如使用这种方式,则sphinx会对中文进行单字切分,# 即进行字索引,若要使用中文分词,必须使用其他分词插件如 coreseek,sfccharset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, \U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D,\U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, \U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, \U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, \U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, \U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, \U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,\U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, \U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, \U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, \U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175,\U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, \U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, \U+0430..U+044F,U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, \U+0621..U+063A, U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, \U+0671..U+06D3, U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, \U+0966..U+096F, U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, \U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, \U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, \U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, \U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, \U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, \U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, \U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, \U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, \U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, \U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, \U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, \U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, \U+2F800..U+2FA1F, U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, \U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, \U+3130..U+318F, U+A000..U+A48F,U+A490..U+A4CF#最小前缀min_prefix_len = 0#最小中缀min_infix_len = 1#对于非字母型数据的长度切割ngram_len = 1}# 增量索引 #
index delta : library
{
source = delta_src
path = /home/sphinxdata/indexer/dleta
charset_type = utf-8
charset_dictpath = /usr/local/webserver/mmseg3/etc
docinfo = extern
ngram_len = 0
}
# 索引器配置 #
indexer{
mem_limit = 5120000k # 内存限制
}
# sphinx 服务进程 #
searchd {
#监听端口,在此版本开始,官方已在IANA获得正式授权的9312端口,以前版本默认的是3312
listen = 192.168.1.239:9312
#服务进程日志 ,一旦sphinx出现异常,基本上可以从这里查询有效信息,轮换(rotate)出的问题一般可在此寻到答案
log = /usr/local/webserver/coreseek/var/log/searchd.log
#客户端查询日志,笔者注:若欲对一些关键词进行统计,可以分析此日志文件
query_log = /usr/local/webserver/coreseek/var/log/query.log
#请求超时
read_timeout = 5
#同时可执行的最大searchd 进程数
max_children = 50
#进程ID文件
pid_file = /usr/local/webserver/coreseek/var/log/searchd.pid
#查询结果的最大返回数
max_matches = 2000000
#是否支持无缝切换,做增量索引时通常需要
seamless_rotate = 1
preopen_indexes = 0
#sphinxql 兼容模式
compat_sphinxql_magics = 0
}
(9).生成索引文件
/usr/local/webserver/coreseek/bin/indexer -c /usr/local/webserver/coreseek/etc/csft.conf --all
(10).启动服务
/usr/local/webserver/coreseek/bin/searchd -c /usr/local/webserver/coreseek/etc/csft.conf
由于是增量索引,需要有定时脚本来更新索引:每天生成一次主索引:/usr/local/sphinx/bin/indexer library--config /usr/local/sphinx/etc/sphinx.conf每10分钟生成一次增量索引:/usr/local/sphinx/bin/indexer delta --config /usr/local/sphinx/etc/sphinx.conf同时合并增量索引到主索引:/usr/local/sphinx/bin/indexer --merge library delta --config /usr/local/sphinx/etc/sphinx.conf以上linux上的coreseek服务已经搭建好,下面通过api接口来调用
2.调用api接口来实现全文搜索
php api接口文档:http://docs.php.net/manual/zh/book.sphinx.php
// sphinx测试
require WEIBO_ROOT . 'source/class/class_sphinx.php' ;
$cl = new SphinxClient ();
$sphinx = getglobal('config/sphinx' );
$cl->SetServer ( $sphinx ['host' ], $sphinx ['port' ]);
$cl->SetFilter ( "id" , array (27,28), true); // id 不是27和28
$cl->SetFilter ( "tid" , array (68, 69)); // tid 是68或69
$cl->SetFilter ( "status" , array (0));
$cl->SetFilter ( "top_status" , array (2));
$cl->setFilterRange( 'pubtime' , 0, $_G ['timestamp' ]); // pubtime范围
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "top_status" ); // 按top_status降序排序
$cl->SetSortMode ( SPH_SORT_ATTR_ASC, "totalnum" ); // 按totalnum 升序排序
$cl->setLimits(0, 5 ); // 用于分页,还可以设置最大匹配数,默认是1000,可以修改,参考接口:http://docs.php.net/manual/zh/sphinxclient.setlimits.php
$res = $cl ->Query ( '我', "library" );
echo '<pre>';
print_r( $res);exit ;
total:总数 matches:匹配项,然后就可以 根据获取到的cid集合就可以
查询详细内容了
默认关键次为空是是没有结果的,如果要展现所有信息,需要修改匹配模式为SPH_MATCH_FULLSCAN
$cl->setMatchMode(SPH_MATCH_FULLSCAN);
$res = $cl ->Query ('', "library");
coreseek的简单使用差不多就这样了,主要是索引的创建和定时更新维护。分词可以使用它自带的分词,基本是没有问题的,如果要求高的话,可能就要修改它的词库了,
这块暂时没有做研究。