nutch 写一个indexingfilter插件
参考源:http://blog.csdn.net/amuseme_lu/article/details/6780244
1 生成一个与urlfilter-regex类似的包结构
代码路径的生成:http://www.cnblogs.com/i80386/archive/2012/09/04/2670670.html
2
public class MyIndexingFilter implements IndexingFilter { public static final Log LOG = LogFactory.getLog(MyIndexingFilter.class); private Configuration conf; public void addIndexBackendOptions(Configuration conf) { LuceneWriter.addFieldOptions("mt", LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf); } private NutchDocument addMyField(NutchDocument doc) { System.out.println("银河系"); String value="银河系"; doc.add("mt",value); //这里我设置了一个固定字段,实际应该从html抽取目标字段 return doc; } public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException { addMyField(doc); return doc; } public Configuration getConf() { return this.conf; } public void setConf(Configuration arg0) { this.conf = arg0; } }
3 生成jar包 build fat jar
4 生成plugin.xml
<plugin id="index-myfield" name="my Indexing Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="myfield.jar"> <export name="*"/> </library> </runtime> <requires> <import plugin="nutch-extensionpoints"/> </requires> <extension id="org.apache.nutch.indexer.myfield" name="Nutch My Indexing Filter" point="org.apache.nutch.indexer.IndexingFilter"> <implementation id="MyIndexingFilter" class="org.apache.nutch.indexer.myfield.MyIndexingFilter"/> </extension> </plugin>
5 最后把打好的jar包与plugin.xml放到E:\nutch\src\plugin\index-myfield 文件夹中
6 修改conf\nutch-site.xml
<configuration> <property> <name>searcher.dir</name> <value>E:/crawl_2</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-(regex|prefix|my)|parse-(html|tika)|index-(basic|anchor|myfield)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> </configuration>
7 启动nutch
8 在solr中检索
9 可以检索到我们需要的字段
注:如果我不是手动打jar放到 index-myfield文件夹中 ,而是直接修改nutch-site.xml 添加了 index-(basic|anchor|myfield)