全文检索

前几天的调研( Rails3下的 full text search (全文本搜索, 全文匹配?) ), 我发现了两个不错的候选: 

1. lucene  (solr, elasticsearch 都是基于它) 

2. sphinx 

两者都有很不错的口碑。所以今天更加进一步的调查。把看到的有价值的文章记录在这里: 

1. http://stackoverflow.com/questions/737275/comparison-of-full-text-search-engine-lucene-sphinx-postgresql-mysql 

------------ 
回答1.  Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings. 
结果相关度 是排序的默认条件。你也可以自行指定,也可以配置不同列的权重。 

Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either. 
由于直接跟数据库对话,它建立索引的速度超快,除非你的SQL语句非常复杂,或者某个列没有使用索引。我的项目中没遇到这些问题。 

The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too. 
搜索服务进程占用资源极小,你也可以指定内存大小的分配。 

Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with. 
扩展性: 我对它了解的不多。但是很容易把一份索引COPY到多个服务器上,然后再跑多个搜索进程。  从其他人那里了解的情况是:在高压高并发下,单极表现就足够好了!所以没必要考虑把它做成分布式。。。 

There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches. 
它不支持查询纠正(“你是不是想搜索OOXX”? )  Sphinx 使用字典进行分词,所以driving 和 drive 返回的搜索结果是一样的。 

------------ 
回答2. 
I don't know Sphinx, but as for Lucene vs a database full-text search, I think that Lucene performance is unmatched. You should be able to do almost any search in less than 10 ms, no matter how many records you have to search, provided that you have set up your Lucene index correctly. 
我没用过Shpinx,但是跟数据库相比的话,lucene的能力是无可匹敌的。你几乎可以在10ms内做任何搜索,不管目标的数据量有多大。(前提是你正确的建立好了索引) 

Here comes the biggest hurdle though: personally, I think integrating Lucene in your project is not easy. Sure, it is not too hard to set it up so you can do some basic search, but if you want to get the most out of it, with optimal performance, then you definitely need a good book about Lucene. 
这里有个最大的门槛:个人以为,在项目中集成lucene并不容易。当然了,建立具备基本功能的原型并不难,但是你想要优化的话,你手边有一本非常好的书才行。 

As for CPU & RAM requirements, performing a search in Lucene doesn't task your CPU too much, though indexing your data is, although you don't do that too often (maybe once or twice a day), so that isn't much of a hurdle. 
检索时它对CPU和内存的需求很小, 建立索引时却不小,不过估计你每天重建索引的次数也不多,所以估计问题不大。 

http://stackoverflow.com/a/2288211/445908 
elasticsearch 的作者的回答: 

As the creator of ElasticSearch, maybe I can give you some reasoning on why I went ahead and created it in the first place
做为ElasticSearch的作者,也许我可以解释一下我建立这个项目的缘由。 

Using pure Lucene is challenging. There are many things that you need to take care for if you want it to really perform well, and also, its a library, so no distributed support, its just an embedded Java library that you need to maintain. 
使用Lucene有一定的挑战性。想要用好它的话,你需要时刻留心很多东西。而且它只是一个jar包,不支持分布式。 

In terms of Lucene usability, way back when (almost 6 years now), I created Compass. Its aim was to simplify using Lucene and make everyday Lucene simpler. What I came across time and time again is the requirement to be able to have Compass distributed. I started to work on it from within Compass, by integrating with data grid solutions like GigaSpaces, Coherence and Terracotta, but its not enough. 
在使用性方面。。。还得谈及6年前我建立 Compass 这个项目。它的目的是把使用和维护Lucene 变得简单些。而我当时屡次遇到这样的需求:让Compass支持分布式。 于是我以此为起点开始工作, 集成了 data grid 解决方案(例如Gigaspaces .... )但是这些努力还不够。 

At its core, a distributed Lucene solution needs to be sharded. Also, with the advancement of HTTP and JSON as ubiquitous APIs, it means that a solution that many different systems with different languages can easily be used. 
分布式Lucene解决方案的核心是:  数据库需要水平分区的。(sharded, 词条见:shard  )  同时,还要使用 HTTP 和JSON 的形式来做为API。这样的优势是:不论什么语言,都可以轻易的调用它。 


This is why I went ahead and created ElasticSearch. It has a very advance distributed model, speaks natively JSON, and exposes many advance search features, all seamlessly expressed through JSON DSL. 
这就是我建立ElasticSearch的原因。  它用有先进的分布式模型, 本地语言是JSON, 还提供了很多查询特性。 这些都使用JSON形式的DSL来访问。 

Solr is also a solution for exposing an indexing/search server over HTTP, but I would argue that ElasticSearch provides a much superior distributed model and ease of use (though currently lacking on some of the search features, but not for long, and in any case, the plan is to get all Compass features into ElasticSearch). Of course, I am biased, since I created ElasticSearch, so you might need to check for yourself. 
Solr 也是一个通过HTTP 访问的检索/查询解决方案,但是我觉得 ElasticSearch 提供了更好的分布式模型,也更容易使用(尽管目前看来, ElasticSearch还缺少某些检索特性,但是在不远的将来,我保证,会把Compass所有的特性都移植到ElasticSearch中去) 当然了,做为作者,我的话肯定会偏向于ElasticSearch,所以你最好亲自试一下。 

As for Sphinx, I have not used it, so I can't comment. What I can refer you is to this thread at Sphinx forum which I think proves the superior distributed model of ElasticSearch. 
对于 Sphinx,我没用过。所以也就不评价它了。不过从 Sphinx 论坛的这篇文章看来,ElasticSearch 提供了更好的 分布式模型。 

Of course, ElasticSearch has many more features then just being distributed. It is actually built with cloud in mind. You can check the feature list on the site. 
当然了,除了更好的分布式模型,ElasticSearch 还提供了很多其他的特性。 因为它的诞生就是基于 分布式的理念(built with cloud in mind),你可以试一下 站点中的特性里面所列举的特性。 

http://stackoverflow.com/q/1284083/445908 
I've been using Solr successfully for almost 2 years now, and have never used Sphinx, so I'm obviously biased. However, I'll try to keep it objective by quoting the docs or other people. I'll also take patches to my answer :-) 
过去两年我一直在用 Solr,用的很好。从没用过Sphinx. 所以我个人观点肯定不太客观。不过,我引用一下其他人的看法 

Similarities: 相同点: 

Both Solr and Sphinx satisfy all of your requirements. They're fast and designed to index and search large bodies of data efficiently. 
两者都满足你的需求。它们都很快,面向于大数据量下的高效率的建立索引,搜索。 
Both have a long list of high-traffic sites using them (Solr, Sphinx) 
都有很长的大数据量网站列表 
Both offer commercial support. (Solr, Sphinx) 
都有商业支持。 
Both offer client API bindings for several platforms/languages (Sphinx, Solr) 
都支持 对不同语言的 CLIENT API。 
Both can be distributed to increase speed and capacity (Sphinx, Solr) 
都支持分布式。 

Here are some differences: 几点不同: 

Solr, being an Apache project, is obviously Apache2-licensed. Sphinx is GPLv2. This means that if you ever need to embed or extend (not just "use") Sphinx in a commercial application, you'll have to buy a commercial license (rationale) 
Solr 是 apache的项目,是apache2的license. Sphinx是 GPL,也就是说,如果你想把Sphinx放到某个商业性的项目中,你就得买个商业许可证。 

Solr is easily embeddable in Java applications. 
Solr很容易就可以集成到JAVA项目中。 

Solr is built on top of Lucene, which is a proven technology over 8 years old with a huge user base (this is only a small part). Whenever Lucene gets a new feature or speedup, Solr gets it too. Many of the devs committing to Solr are also Lucene committers. 
Solr 是基于Lucene 的,后者已经8岁了,有着庞大的用户群体。Lucene 有啥功能,Solr就能享受到啥功能。而且Solr的 开发人员很多也参与了Lucene的开发。 

Sphinx integrates more tightly with RDBMSs, especially MySQL. 
Solr can be integrated with Hadoop to build distributed applications 
Solr can be integrated with Nutch to quickly build a fully-fledged web search engine with crawler. 
Sphinx 跟 RDBMS (特别是MYSQL) 绑定的特别紧密。 而且Solr 可以和 Hadoop 集成,成为分布式系统。 也可以 和 Nutch集成,成为一个功能完备的搜索引擎,以及网络爬虫(crawler) 

Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can't. 
Solr 可以检索 WORD, PDF。 Sphinx不行 

Solr comes with a spell-checker out of the box. 
Solr 还带有拼写检查器。 

Solr comes with facet support out of the box. Faceting in Sphinx takes more work. 
Solr 默认有facet支持。 而Shphinx中就得做一些额外的工作才行 

Sphinx doesn't allow partial index updates for field data. 
Sphinx 不支持针对field data 的partial index的更新 

In Sphinx, all document ids must be unique unsigned non-zero integer numbers. Solr doesn't even require an unique key for many operations, and unique keys can be either integers or strings. 
Sphinx中,所有的 document id 必须是 unique , unsigned, non-zero 整数(估计是用C语言的名词来解释)。 Solr的很多操作,甚至不需要unique key。 而且unique key 可以是整数,也可以是字符串。 

Solr supports field collapsing (currently as an additional patch only) to avoid duplicating similar results. Sphinx doesn't seem to provide any feature like this. 
Solr 支持field collapsing 来避免相似搜索结果的重复性。 Sphinx没这个功能。 

While Sphinx is designed to only retrieve document ids, in Solr you can directly get whole documents with pretty much any kind of data, making it more independent of any external data store and it saves the extra roundtrip. 
Sphinx只是查询document id, 而solr 则可以查询出整个的document. 

Solr, except when used embedded, runs in a Java web container such as Tomcat or Jetty, which require additional specific configuration and tuning (or you can use the included Jetty and just launch it with java -jar start.jar). Sphinx has no additional configuration. 
Solr 跑在 java web 容器中,例如Tomcat 或 Jetty. 所以我们就可以进行配置和调试,优化。  Sphinx 则没有额外的配置选项。 


http://www.wikivs.com/wiki/Lucene_vs_Sphinx 
有一点比较重要: sphinx不支持 live index update. 支持的话也非常有限。 

有一个PPT,可以增加知识: 
http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql

posted @ 2015-10-27 08:32  caijinhao  阅读(198)  评论(0编辑  收藏  举报