Nutch 0.8最近更新的邮件列表

2006-09-05 01:41 cppguy 阅读(396) 评论(1) 编辑收藏举报

怎样向Nutch中添加筛取Http流的正则表达式的业务逻辑？

你应该写一个新的plugin，使用src/plugin/creativecommons作为模版创建适用自己的plugin

You can start from here.. http://wiki.apache.org/nutch/ About writing plugin http://wiki.apache.org/nutch/PluginCentral About development env. http://wiki.apache.org/nutch/RunNutchInEclipse Cheers

怎样在索引或者搜索的过程中进行筛取

我希望把我抓取的页面结果进行信息的分离，下面是我的操作过程：

crawl/index

我想在索引时建立一个extra field，这样我就要开发一个涵盖抽取逻辑的plugin，然后配置conf/nutch-site.xml，把我的plugin植入

在开发plugin的过程中，我要实现：Configurable <http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configurable.html>, IndexingFilter <http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/indexer/IndexingFilter.html>, and Pluggable <http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/plugin/Pluggable.html> 这些接口

但是不能工作

你需要写一indexing filter和一个 query filter

对前者来说，我拷贝了index-more这个plugin，并且修改了名字，dirs，和build files，里面最主要的改变是filter 方法

pulbic Document filter（Document doc，Parse parse，FetcherOutput fo）

在这个函数里面，你添加你自己的 filds，添加新的category。如下

doc.add(new Field("category", "puppies", false, true, false));

具体相关的参数可参见Document.add API

相同方否关于搜索创建一个新的query filter,我会拷贝query-site plugin，Again change file names,directories, and build files as needed. The main java file is very simple, just

change the string in the line with "super". Instead of:
super("site");
You would have
super("category");

最后别忘了把这个新的plugin加入nutch-default.xml这个文件，也要在web-if/classess这个目录下

会员力量，点亮园子希望

刷新页面返回顶部

OnSorrow+=new EventHandler(smile) 侧耳倾听

Nutch 0.8最近更新的邮件列表

About