SOLR (全文检索)

SOLR (全文检索)

 

http://sinykk.iteye.com/

1.   什么是SOLR

官方网站

http://wiki.apache.org/solr

http://wiki.apache.org/solr/DataImportHandler

本文档以solr3.4   tomcat6.3  IKAnalyzer3.2.5Stable为例

 

 

1.1. 什么是SOLR

 

Solr是一个高性能,采用Java5开发,基于Lucene的全文搜索服务器。同时对其进行了扩展,提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展并对查询性能进行了优化,并且提供了一个完善的功能管理界面,是一款非常优秀的全文搜索引擎。

文档通过Http利用XML 加到一个搜索集合中。查询该集合也是通过http收到一个XML/JSON响应来实现。它的主要特性包括:高效、灵活的缓存功能,垂直搜索功能,

 

 

 

1.2. 在什么场合使用

1、  你搜索数据库数据时你的主键不是整形的,可能是UUID

2、  搜索任何文本类文档,甚至包括RSS,EMAIL等

 

 

2.   如何使用solr

 

通过在WINDOWS或LINUX服务器安装SOLR服务器,并配置上相应的索引规则,通过JAVA或PHP等脚本语言进行调用和查询

2.1. Window下安装solr

  1. 1.  下载所需软件,安装配置Tomcat。

下载软件为 :Tomcat与Solr,jdk1.6,官网都可免费下载。

  1. 2.  Tomcat 配置文件conf\server.xml

添加编码的配置 URIEncoding="UTF-8" (如不添加,中文检索时因为乱码搜索不到)。
添加后为:
<Connector port="8983" protocol="HTTP/1.1" connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8" />

  1. 3.  将D:\solr\apache-solr-3.3.0 解压

5. 建立d:/solr/home主目录(可以根据自己的情况建立),把D:\solr\apache-solr-3.3.0\example\solr复制到该目录下。

6. 建立solr.home 环境变量:置为 d:/solr/home

7. 将solr.War复制到tomcat的webapp下启动是会自动解压。

8. 修改D:\resouce\java\tomcat\webapps\solr\WEB-INF\web.xml.

<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>d:\solr\home</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
9. 启动tomcat,浏览器输入:http://localhost:8080/solr/
10.看到页面说明部署成功

 

      

2.2. linux下安装solr

此linux安装版结合直接安装带有分词功能

1、将TOMCAT解压到 /usr/local/apache-tomcat-6.0.33/

2、将 /solr/apache-solr-3.3.0/example/solr 文件拷贝到 /usr/local/apache-tomcat-6.0.33/

3、然后修改TOMCAT的/usr/local/apache-tomcat-6.0.33/conf/server.xml【增加中文支持】

<Connector port="8983" protocol="HTTP/1.1"

               connectionTimeout="20000"

               redirectPort="8443" URIEncoding="UTF-8"/>

<Connector port="8983" protocol="HTTP/1.1"

               connectionTimeout="20000"

               redirectPort="8443" URIEncoding="UTF-8"/>

4、添加文件 /usr/local/apache-tomcat-6.0.33/conf/Catalina/localhost/solr.xml 内容如下

<?xml version="1.0" encoding="UTF-8"?>

<Context docBase="/usr/local/apache-tomcat-6.0.33/webapps/solr" debug="0" crossContext="true" >

   <Environment name="solr/home" type="java.lang.String" value="/usr/local/apache-tomcat-6.0.33/solr" override="true" />

</Context>

<?xml version="1.0" encoding="UTF-8"?>

<Context docBase="/usr/local/apache-tomcat-6.0.33/webapps/solr" debug="0" crossContext="true" >

   <Environment name="solr/home" type="java.lang.String" value="/usr/local/apache-tomcat-6.0.33/solr" override="true" />

</Context>

5、将/sinykk/solr/apache-solr-3.3.0/example/webapps/solr.war文件放到/usr/local/apache-tomcat-6.0.33/webapps文件夹下,并启动TOMCAT

6、将/sinykk/solr/IKAnalyzer3.2.8.jar 文件放到/usr/local/apache-tomcat-6.0.33/webapps/solr/WEB-INF/lib 目录下


7、修改/usr/local/apache-tomcat-6.0.33/solr/conf/schema.xml文件为

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.4">

 <types>

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

     <!--

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

    -->

 

     <fieldType name="textik" class="solr.TextField" >

               <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/> 

      

               <analyzer type="index"> 

                   <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> 

                   <filter class="solr.StopFilterFactory" 

                           ignoreCase="true" words="stopwords.txt"/> 

                   <filter class="solr.WordDelimiterFilterFactory" 

                           generateWordParts="1" 

                           generateNumberParts="1" 

                           catenateWords="1" 

                           catenateNumbers="1" 

                           catenateAll="0" 

                           splitOnCaseChange="1"/> 

                   <filter class="solr.LowerCaseFilterFactory"/> 

                   <filter class="solr.EnglishPorterFilterFactory" 

                       protected="protwords.txt"/> 

                   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

               </analyzer> 

              <analyzer type="query"> 

                   <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> 

                   <filter class="solr.StopFilterFactory" 

                           ignoreCase="true" words="stopwords.txt"/> 

                   <filter class="solr.WordDelimiterFilterFactory" 

                           generateWordParts="1" 

                           generateNumberParts="1" 

                           catenateWords="1" 

                           catenateNumbers="1" 

                           catenateAll="0" 

                           splitOnCaseChange="1"/> 

                   <filter class="solr.LowerCaseFilterFactory"/> 

                   <filter class="solr.EnglishPorterFilterFactory" 

                       protected="protwords.txt"/> 

                   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

               </analyzer> 

      

</fieldType>

 </types>

 

 

 <fields>

  <field name="id" type="string" indexed="true" stored="true" required="true" />

 </fields>

 

 <uniqueKey>id</uniqueKey>

 

</schema>

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.4">

 <types>

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

   <!--

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

  -->

 

   <fieldType name="textik" class="solr.TextField" >

               <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/> 

      

               <analyzer type="index"> 

                   <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> 

                   <filter class="solr.StopFilterFactory" 

                           ignoreCase="true" words="stopwords.txt"/> 

                   <filter class="solr.WordDelimiterFilterFactory" 

                           generateWordParts="1" 

                           generateNumberParts="1" 

                           catenateWords="1" 

                           catenateNumbers="1" 

                           catenateAll="0" 

                           splitOnCaseChange="1"/> 

                   <filter class="solr.LowerCaseFilterFactory"/> 

                   <filter class="solr.EnglishPorterFilterFactory" 

                       protected="protwords.txt"/> 

                   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

               </analyzer> 

           <analyzer type="query"> 

                   <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> 

                   <filter class="solr.StopFilterFactory" 

                           ignoreCase="true" words="stopwords.txt"/> 

                   <filter class="solr.WordDelimiterFilterFactory" 

                           generateWordParts="1" 

                           generateNumberParts="1" 

                           catenateWords="1" 

                           catenateNumbers="1" 

                           catenateAll="0" 

                           splitOnCaseChange="1"/> 

                   <filter class="solr.LowerCaseFilterFactory"/> 

                   <filter class="solr.EnglishPorterFilterFactory" 

                       protected="protwords.txt"/> 

                   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

               </analyzer> 

      

</fieldType>

 </types>

 

 

 <fields>

  <field name="id" type="string" indexed="true" stored="true" required="true" />

 </fields>

 

 <uniqueKey>id</uniqueKey>

 

</schema>

最后运行http://192.168.171.129:8983/solr/admin/analysis.jsp

 

 

 

2.3. solr 将MYSQL数据库做成索引数据源

solr 将MYSQL数据库做成索引数据源【注意格式】

参考:http://digitalpbk.com/apachesolr/apache-solr-mysql-sample-data-config

  1. 1.  在solrconfig.xml中添加,增加导入数据功能

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  

       <lst name="defaults">  

           <str name="config">data-config.xml</str>  

       </lst>  

</requestHandler>

 

 

 

  1. 2.  添加一个数据源data-config.xml,代码如下

 

<dataConfig>

    <dataSource type="JdbcDataSource"

   driver="com.mysql.jdbc.Driver"

   url="jdbc:mysql://localhost/test"

   user="root"

   password=""/>

    <document name="content">

        <entity name="node" query="select id,name,title from solrdb">

            <field column="nid" name="id" />

            <field column="name" name="name" />

            <field column="title" name="title" />

        </entity>

    </document>

</dataConfig>

 

  1. 3.  3、创建schema.xml语法,代码如下

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.4">

  <types>   

     <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>

 

     <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

 

     <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

 

     <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/> 

</types>

 

 

 <fields>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="title" type="string" indexed="true" stored="true"/>

   <field name="contents" type="text" indexed="true" stored="true"/>

 </fields>

 

 <uniqueKey>id</uniqueKey>

 <defaultSearchField>contents</defaultSearchField>

 <solrQueryParser defaultOperator="OR"/>

<copyField source="title" dest="contents"/>

 

</schema>

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.4">

  <types>   

     <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>

 

     <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

 

     <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

 

     <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/> 

</types>

 

 

 <fields>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="title" type="string" indexed="true" stored="true"/>

   <field name="contents" type="text" indexed="true" stored="true"/>

 </fields>

 

 <uniqueKey>id</uniqueKey>

 <defaultSearchField>contents</defaultSearchField>

 <solrQueryParser defaultOperator="OR"/>

<copyField source="title" dest="contents"/>

 

</schema>

schema.xml 里重要的字段

要有这个copyField字段SOLR才能检索多个字段的值【以下设置将同时搜索 title,name,contents中的值】
<defaultSearchField>contents</defaultSearchField>
copyField是用來複製你一個欄位裡的值到另一欄位用. 如你可以將name裡的東西copy到default裡, 這樣solr做檢索時也會檢索到name裡的東西.
<copyField source="name" dest="contents"/>
<copyField source="title" dest="contents"/>

4、创建索引

http://192.168.171.129:8983/solr/dataimport?command=full-import

注:保证与数据库连接正确

 

 

 

2.4. SOLR多个索引共存 multiple core

参考:http://wiki.apache.org/solr/CoreAdmin

  1. 1.  配置多个索引

<solr persistent="true" sharedLib="lib">

 <cores adminPath="/admin/cores">

  <core name="core0" instanceDir="core0" dataDir="D:\solr\home\core0\data"/>

  <core name="core1" instanceDir="core1" dataDir="D:\solr\home\core1\data" />

 </cores>

</solr>

2、将D:\solr\apache-solr-3.3.0\example\multicore下的 core0,core1两个文件拷贝到D:\solr\home下,D:\solr\home目录下之前的任务目录及文件不变

注:D:\solr\home目录为D:\solr\apache-solr-3.3.0\example\solr


3、建立两个索引数据存放目录
D:\solr\home\core0\data
D:\solr\home\core1\data

4、修改其中一个索引如CORE1
修改solrconfig.xml为如下代码
【注 需要加入 lib 标签主要是因为DataImportHandler 为报错,这可能是官方的BUG】

 

<?xml version="1.0" encoding="UTF-8" ?>

<config>

  <luceneMatchVersion>LUCENE_33</luceneMatchVersion>

  <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

 

  <lib dir="D:/solr/apache-solr-3.3.0/contrib/extraction/lib" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-cell-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-clustering-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-dataimporthandler-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/contrib/clustering/lib/" />

  <lib dir="/total/crap/dir/ignored" />

  <updateHandler class="solr.DirectUpdateHandler2" />

 

  <requestDispatcher handleSelect="true" >

    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />

  </requestDispatcher>

 

  <requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />

  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

  <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />

  <admin>

    <defaultQuery>solr</defaultQuery>

  </admin>

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  

                 <lst name="defaults">  

                          <str name="config">data-config.xml</str>  

                 </lst>  

</requestHandler>

</config>

<?xml version="1.0" encoding="UTF-8" ?>

<config>

  <luceneMatchVersion>LUCENE_33</luceneMatchVersion>

  <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

 

  <lib dir="D:/solr/apache-solr-3.3.0/contrib/extraction/lib" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-cell-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-clustering-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-dataimporthandler-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/contrib/clustering/lib/" />

  <lib dir="/total/crap/dir/ignored" />

  <updateHandler class="solr.DirectUpdateHandler2" />

 

  <requestDispatcher handleSelect="true" >

    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />

  </requestDispatcher>

 

  <requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />

  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

  <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />

  <admin>

    <defaultQuery>solr</defaultQuery>

  </admin>

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  

      <lst name="defaults">  

         <str name="config">data-config.xml</str>  

      </lst>  

</requestHandler>

</config>

最后运行 http://localhost:8080/solr/core1/admin/

 

 

2.5. 全自动近实时全文检索(增量索引)

 

每次检索都检索至上次建立的索引基础上,所以当有新增数据时,不经过处理是无法检索到新增数据的。这时需要进行相关配置来实现实时检索

 

思路:设置两个数据源和两个索引,对很少更新或根本不更新的数据建立主索引,而对新增文档建立增量索引

 

主要是修改data-config.xml 数据源

 

<dataConfig>

    <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/demo" user="root" password=""/>

    <document name="products">

       <entity name="item" pk="id"

          query="SELECT id,title,contents,last_index_time FROM solr_articles"

          deltaImportQuery="SELECT id,title,contents,last_index_time FROM solr_articles

            WHERE id = '${dataimporter.delta.id}'"

          deltaQuery="SELECT id FROM solr_articles

            WHERE last_index_time > '${dataimporter.last_index_time}'">

        </entity>

    </document>

</dataConfig>

注意数据库相关表的创建

如本例中 solr_articles表中有 last_index_time(timestamp)字段,每当增加或者更新了值都应修改last_index_time的值,以便增量索引能更新到

有问题请即时查看TOMCAT的LOG日志文件

运行:http://192.168.171.129:8983/solr/dataimport?command=delta-import

如果运行后未达到你的预期,请查看dataimport.properties文件的日期,并组合新SQL语句查询来调整问题

 

 

 

做好主索引和增量索引后就需要建立两个定时任务(linux crontab)

 

一个每五分钟的增量索引定时任务:每五分钟更新一次增量索引,同时合并主索引和增量索引以此保证能检索出五分钟以前的所有数据

 

一个每天凌晨两点的主索引更新,同时清除增量索引,以此来保证主索引的效率,同时减少数据的重复性

 

 

 

2.6. 分布式索引

solr 分布式其实是分发,这概念像Mysql的复制。所有的索引的改变都在主服务器里,所有的查询都在从服务里。从服务器不断地(定时)从主服务器拉内容,以保持数据一致。

参考:http://chenlb.blogjava.net/archive/2008/07/04/212398.html

 

 

2.7. 解决数据准确性

 

要想搜索出的数据准确你可以通过以下几种方式来解决

1、  建立自己的分词库

2、  在对数据进行了更新,添加,删除时通过DOCUMENT来更新索引

3、  采用增量索引,进行定时更新

 

2.8. SOLR分词的配置

参考本文档的LINUX安装SOLR

 

2.9. SOLR的PHP客户端

使用PHP访问SOLR中的索引数据

参考:http://code.google.com/p/solr-php-client/

一个简单的例子:http://code.google.com/p/solr-php-client/wiki/ExampleUsage

 

注:与用C写的SPHINX搜索引擎相似

 

3.   其它参考

 

posted @ 2014-09-03 17:59  GisClub  阅读(1036)  评论(0编辑  收藏  举报