Nutch开发(一)
Nutch 开发(一)
文章目录
开发环境
- Linux,Ubuntu20.04LST
- IDEA
- Nutch1.18
- Solr8.11
转载请声明出处!!!By 鸭梨的药丸哥
1.IDEA 导入nutch项目
要开发nutch最好连nutch源码一起下载下来。去官方下载nutch的源码包。
1.18版本的下载地址:https://www.apache.org/dyn/closer.lua/nutch/1.18/apache-nutch-1.18-src.tar.gz
我下载的是Linux的源码包,因为nutch很多命令都需要运行在Linux上面,所以为了方便我是在Linux上对nutch的插件进行开发。
编译源码前,确保已经安装好ant,可以执行下面的方法进行ant的安装
sudo apt-get update
sudo apt-get install ant
将nutch构建成eclipse项目
ant eclipse
然后使用IDEA以eclipse工程导入项目,这个网上的资源比较多,正常滴导入Nutch源码项目即可,导入时选择以eclipse项目的方式进行导入。
2.nutch源码目录了解
通过nutch源码编译出来的目录结构跟下载的bin包的结构目录有细微的差异
build/ #ant eclipse编译后的生成的
conf/ #配置文件目录
docs/ #接口文档
ivy/ #ivy依赖管理工具的文件夹
lib/ #放置Hadoop本机库的占位符的文件夹(不会自动下载,里面的组件用来加快数据(反)压缩)
src/ #源码目录
3.Nutch爬取步骤
Nutch整个爬取过程是分很多步骤的:
- injector -> generator -> fetcher -> parseSegment -> updateCrawleDB -> Invert links -> Index -> DeleteDuplicates -> IndexMerger
建立初始URL集
执行inject ,将URL集注入crawldb数据库
执行generate,根据crawldb数据库创建抓取列表
执行fetch,获取网页信息
4.2)执行parse,解析网页信息
执行updatedb ,把获取到的页面信息存入数据库中
重复进行3~5的步骤,直到预先设定的抓取深度。—“产生/抓取/更新”循环
执行invertlinks ,根据sengments的内容更新linkdb数据库
建立索引—index (如:在solr中建立索引)
Nutch作者画的一个Nutch架构图,以前较老版本的架构,当初nutch还未吧全文检索功能分离出来
4.启动类的介绍
主要的启动类如下:
Operation | Class in Nutch 1.x (i.e.trunk) | Class in Nutch 2.x |
---|---|---|
inject | org.apache.nutch.crawl.Injector | org.apache.nutch.crawl.InjectorJob |
generate | org.apache.nutch.crawl.Generator | org.apache.nutch.crawl.GeneratorJob |
fetch | org.apache.nutch.fetcher.Fetcher | org.apache.nutch.fetcher.FetcherJob |
parse | org.apache.nutch.parse.ParseSegment | org.apache.nutch.parse.ParserJob |
updatedb | org.apache.nutch.crawl.CrawlDb | org.apache.nutch.crawl.DbUpdaterJob |
invertlinks | org.apache.nutch.crawl.LinkDb | ??? |
5.Nutch的sh脚本
重Nutch的sh脚本可以发现,nutch脚本的本质还是调用具体的启动类来实现其功能。
下面截取sh的部分片段,可以看出不同的COMMAND对应不同的启动类,然后将命令行的参数传递给启动类。
# figure out which class to run
if [ "$COMMAND" = "crawl" ] ; then
echo "Command $COMMAND is deprecated, please use bin/crawl instead"
exit -1
elif [ "$COMMAND" = "inject" ] ; then
CLASS=org.apache.nutch.crawl.Injector
elif [ "$COMMAND" = "generate" ] ; then
CLASS=org.apache.nutch.crawl.Generator
elif [ "$COMMAND" = "freegen" ] ; then
CLASS=org.apache.nutch.tools.FreeGenerator
elif [ "$COMMAND" = "fetch" ] ; then
CLASS=org.apache.nutch.fetcher.Fetcher
elif [ "$COMMAND" = "parse" ] ; then
CLASS=org.apache.nutch.parse.ParseSegment
elif [ "$COMMAND" = "readdb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDbReader
elif [ "$COMMAND" = "mergedb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ "$COMMAND" = "readlinkdb" ] ; then
CLASS=org.apache.nutch.crawl.LinkDbReader
elif [ "$COMMAND" = "readseg" ] ; then
CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "mergesegs" ] ; then
CLASS=org.apache.nutch.segment.SegmentMerger
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDb
elif [ "$COMMAND" = "invertlinks" ] ; then
CLASS=org.apache.nutch.crawl.LinkDb
elif [ "$COMMAND" = "mergelinkdb" ] ; then
CLASS=org.apache.nutch.crawl.LinkDbMerger
elif [ "$COMMAND" = "dump" ] ; then
CLASS=org.apache.nutch.tools.FileDumper
elif [ "$COMMAND" = "commoncrawldump" ] ; then
CLASS=org.apache.nutch.tools.CommonCrawlDataDumper
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1"
shift
elif [ "$COMMAND" = "index" ] ; then
CLASS=org.apache.nutch.indexer.IndexingJob
elif [ "$COMMAND" = "solrdedup" ] ; then
echo "Command $COMMAND is deprecated, please use dedup instead"
exit -1
elif [ "$COMMAND" = "dedup" ] ; then
CLASS=org.apache.nutch.crawl.DeduplicationJob
elif [ "$COMMAND" = "solrclean" ] ; then
CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1"
shift; shift
elif [ "$COMMAND" = "clean" ] ; then
CLASS=org.apache.nutch.indexer.CleaningJob
elif [ "$COMMAND" = "parsechecker" ] ; then
CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "filterchecker" ] ; then
CLASS=org.apache.nutch.net.URLFilterChecker
elif [ "$COMMAND" = "normalizerchecker" ] ; then
CLASS=org.apache.nutch.net.URLNormalizerChecker
elif [ "$COMMAND" = "domainstats" ] ; then
CLASS=org.apache.nutch.util.domain.DomainStatistics
elif [ "$COMMAND" = "protocolstats" ] ; then
CLASS=org.apache.nutch.util.ProtocolStatusStatistics
elif [ "$COMMAND" = "crawlcomplete" ] ; then
CLASS=org.apache.nutch.util.CrawlCompletionStats
elif [ "$COMMAND" = "webgraph" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.WebGraph
elif [ "$COMMAND" = "linkrank" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.LinkRank
elif [ "$COMMAND" = "scoreupdater" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.ScoreUpdater
elif [ "$COMMAND" = "nodedumper" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.NodeDumper
elif [ "$COMMAND" = "plugin" ] ; then
CLASS=org.apache.nutch.plugin.PluginRepository
elif [ "$COMMAND" = "junit" ] ; then
CLASSPATH="$CLASSPATH:$NUTCH_HOME/test/classes/"
if $local; then
for f in "$NUTCH_HOME"/test/lib/*.jar; do
CLASSPATH="${CLASSPATH}:$f";
done
fi
CLASS=org.junit.runner.JUnitCore
elif [ "$COMMAND" = "startserver" ] ; then
CLASS=org.apache.nutch.service.NutchServer
elif [ "$COMMAND" = "webapp" ] ; then
CLASS=org.apache.nutch.webui.NutchUiServer
elif [ "$COMMAND" = "warc" ] ; then
CLASS=org.apache.nutch.tools.warc.WARCExporter
elif [ "$COMMAND" = "updatehostdb" ] ; then
CLASS=org.apache.nutch.hostdb.UpdateHostDb
elif [ "$COMMAND" = "readhostdb" ] ; then
CLASS=org.apache.nutch.hostdb.ReadHostDb
elif [ "$COMMAND" = "sitemap" ] ; then
CLASS=org.apache.nutch.util.SitemapProcessor
elif [ "$COMMAND" = "showproperties" ] ; then
CLASS=org.apache.nutch.tools.ShowProperties
else
CLASS=$COMMAND
fi
6.运行injector
inject的主函数在org.apache.nutch.crawl包的injector类中。
6.1 配置
要运行inject,首先要apache-nutch-1.18/conf/nutch-site.xml添加plugin.folders配置,用来覆盖掉默认的相对路径的配置。因为使用nutch脚本的运行路径和我们直接用源码运行的路径是不同的。
<property>
<name>plugin.folders</name>
<value>/home/liangwy/IdeaProjects/apache-nutch-1.18/src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
6.2创建一个url列表
mkdir urls
touch urls/seeds.txt
vim urls/seeds.txt
#然后输入要第一批进行爬取的url即可
6.3 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
- Name : Injector
- Main Class :org.apache.nutch.crawl.Injector (1.x版本的主函数类,具体名字要看源码2.x叫InjectorJob)
- VM options :-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
- Program arguments : /home/User/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/User/apache-nutch-1.18/urls (存储抓取地址文件seed.txt的目录)
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
6.4 运行效果对等
运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch inject /home/User/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/apache-nutch-1.18/urls
7.Injector主函数分析
injector的main函数如下:
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
System.exit(res);
}
Injector的运行是通过ToolRunner进行的,点开ToolRunner的run函数,发现最后运行的实际调用方法是injector的run函数。
方法参数:
- Configuration conf #nutch的配置
- Tool tool #要运行的工具类(如:injector,generator)
- String[] args #传递给工具类的命令行参数
public static int run(Configuration conf, Tool tool, String[] args) throws Exception {
if (CallerContext.getCurrent() == null) {
CallerContext ctx = (new Builder("CLI")).build();
CallerContext.setCurrent(ctx);
}
if (conf == null) {
conf = new Configuration();
}
//解析配置
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
tool.setConf(conf);
String[] toolArgs = parser.getRemainingArgs();
//实际运行还是调用tool自身的run
return tool.run(toolArgs);
}
8.运行Generator
8.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
- Name : Generator
- Main Class :org.apache.nutch.crawl.Generator
- Program arguments : /home/User/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/User/IdeaProjects/apache-nutch-1.18/myNutch/segments -topN 100
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
8.2 运行效果对等
运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch generate /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments -topN 100
9.运行Fetcher
9.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
- Name : Fetcher
- Main Class :org.apache.nutch.fetcher.Fetcher
- Program arguments : /home/User/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955 -threads 16
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
9.2 报错分析
没有配置http.agent.name,这个配置可以在conf/nutch-site.xml中进行配置
Fetcher: No agents listed in ‘http.agent.name’ property.
Fetcher: java.lang.IllegalArgumentException: Fetcher: No agents listed in ‘http.agent.name’ property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:563)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:431)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:545)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:518)
9.3 配置http.agent.name
在conf/nutch-site.xml文件中添加如下配置
property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43,*</value>
</property>
9.3 运行效果对等
运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch fetch /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955 -threads 16
10.运行ParseSegment
10.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
- Name : ParseSegment
- Main Class :org.apache.nutch.parse.ParseSegment
- Program arguments : /home/User/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
10.2 运行效果对等
运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch parse /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955
11.运行CrawlDb
11.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
- Name : CrawlDb
- Main Class :org.apache.nutch.crawl.CrawlDb
- Program arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb/ -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
11.2 运行效果对等
运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch updatedb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb/ -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments
12.运行LinkDb
12.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
- Name : LinkDb
- Main Class :org.apache.nutch.crawl.LinkDb
- Program arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/linkdb -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
12.2 运行效果对等
运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
/nutch invertlinks /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/linkdb -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/
下一章
下一章,教如何将这些步骤进行整合。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)