如何在SOLR中嵌入自己的分词系统
SOLR推荐但失败的分词接入方法是在schema.xml字段配置文件中写入以下配置:
编写自己的TokenFactory ,该类继成自 SOLR的BaseTokenizerFactory ,找到以下配置节点,并将 tokenizer的 class类 :替换掉。
- <tokenizer class="org.apache.lucene.analysis.cn.SolrTokenFactory"/>
- <analyzer type="index">
- <!-- <tokenizer class="org.apache.lucene.analysis.cn.SolrTokenFactory"/>
- ->
- <tokenizer class="org.apache.lucene.analysis.cn.SolrTokenFactory"/>
- <!-- in this example, we will only use synonyms at query time
- <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
- -->
- <!-- Case insensitive stop word removal.
- add enablePositionIncrements=true in both the index and query
- analyzers to leave a 'gap' for more accurate phrase queries.
- -->
- <filter class="solr.StopFilterFactory"
- ignoreCase="true"
- words="stopwords.txt"
- enablePositionIncrements="true"
- />
- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
- <filter class="solr.LowerCaseFilterFactory"/>
- </analyzer>
- <analyzer type="query">
- <tokenizer class="org.apache.lucene.analysis.cn.SolrTokenFactory"/>
- <!-- <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory14"/> -->
- <filter class="solr.StopFilterFactory"
- ignoreCase="true"
- words="stopwords.txt"
- enablePositionIncrements="true"
- />
- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
- <filter class="solr.LowerCaseFilterFactory"/>
- </analyzer>
把IK的源代码翻了个底朝天,也没看出自己的分词器和它的接入方法有什么区别,不是QUERY分词失败,就是写入的索引没有分词效果,一气之下就钻到SOLR的源代码里,经过一番苦战,终于苦尽甘来~,彻底搞定!
废话少说,来解决方案:
说明:因为修改了SOLR的部分代码,所以分词器在SCHEMA.xml的配置是彻底失效了,但是其他字段设定都沿用schema.xml.
在这之前先说下SOLR加载schema.xml的步骤:
调用栈:
org.apache.solr.core.SolrCore 520行
schema = new IndexSchema(config, IndexSchema.DEFAULT_SCHEMA_FILE, null);
org.apache.solr.schema.IndexSchema 103行
readSchema(lis);
SOLR主要是通过 private void readSchema(InputStream is) 这个函数对分词解析器进行初始化,及对schema.xml中的各种类型进行实例化,同时写入到 :protected final HashMap<String, Analyzer> analyzers ,供外部系统调用。
这次我们开到的函数就是 org.apache.solr.schema.IndexSchema 的 readSchema()函数。
因为我们要将自己的分词解析器半路插进去,因此在这个函数的这个位置插入以下语句:
- try{
- AbstractPluginLoader<FieldType> fieldLoader = new AbstractPluginLoader<FieldType>(
- "[schema.xml] fieldType", true, true) {
- @Override
- protected FieldType create(ResourceLoader loader, String name,
- String className, Node node) throws Exception {
- FieldType ft = (FieldType) loader.newInstance(className);
- ft.setTypeName(name);
- String expression = "./analyzer[@type='query']";
- Node anode = (Node) xpath.evaluate(expression, node,
- XPathConstants.NODE);
- Analyzer queryAnalyzer = readAnalyzer(anode);
- // An analyzer without a type specified, or with
- // type="index"
- expression = "./analyzer[@type='index']";
- anode = (Node) xpath.evaluate(expression, node,
- XPathConstants.NODE);
- Analyzer analyzer = readAnalyzer(anode);
- if (queryAnalyzer == null)
- queryAnalyzer = analyzer;
- if (analyzer == null)
- analyzer = queryAnalyzer;
- if (analyzer != null) {
- if(ft!=null && className.equals("solr.TextField")){
- ft.setAnalyzer(AnalyzerManager.getAnalyzer());
- ft.setQueryAnalyzer(AnalyzerManager.getAnalyzer());
- }else{
- ft.setAnalyzer(analyzer);
- ft.setQueryAnalyzer(analyzer);
- }
- }
- return ft;
- }
在 protected FieldType create(ResourceLoader loader, String name,
String className, Node node) throws Exception {
这个函数体中,判断 className 的类名,因为我们需要对solr.TextField类型做重写,即改写text类型的分词器,所以需要加入以下判断:
- if (analyzer != null) {
- if(ft!=null && className.equals("solr.TextField")){
- ft.setAnalyzer(AnalyzerManager.getAnalyzer());
- ft.setQueryAnalyzer(AnalyzerManager.getAnalyzer());
- }else{
- ft.setAnalyzer(analyzer);
- ft.setQueryAnalyzer(analyzer);
- }
- }
OK,重启SOLR,试试看,是不是奏效了??