Nutch2.x 演示抓取第一个网站
http://www.micmiu.com/opensource/nutch/nutch2x-crawl-first-website/?utm_source=tuicool&utm_medium=referral
下面演示的过程是基于目前 Nutch 2.2.1 自己编译配置的版本。
在编译后 bin目录下有两个脚本文件:nutch
和 crawl
,在命令行下执行各命令即可查看具体使用说明:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
$ nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
|
1
2
|
$ crawl
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
|
在Nutch2.x版本中,爬取流程所涉及的命令做了优化,整合到了crawl 命令中,使用者只需要执行一个命令 crawl 即可完成爬取流程,而不必像老版本中那样,必须依次地执行 inject、generate、fetch、parse等命令。对于初学者来说仍然可以依次执行相关命令 ,仔细观察每执行一步引起的数据变化。下面以抓取 本人博客网站为例详细说明下抓取的过程:
[准备]:创建需要抓取的URL
- 首先启动hbase (本文是在单机模式下演示的)
- mkdir -p urls
- cd urls
- touch seed.txt
- echo ‘http://micmiu.com’ >seed.txt
下面每一步执行后都可以查看HBase中数据的变化情况。
[第一步]:inject
1
2
3
4
5
6
7
|
$ nutch inject urls -crawlId micmiublog
InjectorJob: starting at 2015-01-12 09:42:46
InjectorJob: Injecting urlDir: urls
2015-01-12 09:42:47.096 java[14509:4735452] Unable to load realm info from SCDynamicStore
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
|
查看HBase中得数据:
1
2
3
4
5
6
7
8
9
|
hbase(main):016:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
1 row(s) in 0.1010 seconds
|
[第二步]:generate
1
2
3
4
5
6
7
8
9
10
|
$ nutch generate -topN 5 -crawlId micmiublog
GeneratorJob: starting at 2015-01-12 09:47:09
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 5
2015-01-12 09:47:09.822 java[14533:4744993] Unable to load realm info from SCDynamicStore
GeneratorJob: finished at 2015-01-12 09:47:13, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1421027229-1374349927
|
查看HBase中得数据:
1
2
3
4
5
6
7
8
9
10
11
|
hbase(main):018:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
1 row(s) in 0.0580 seconds
|
[第三步]:fetch
ps:上一步执行的日志中 GenerateorJob batch id 的值 作为下面命令的参数 batchId的值
也可以从hbase中重查询到:
1
2
3
4
|
hbase(main):025:0> get 'micmiublog_webpage','com.micmiu:http/',{COLUMNS => 'f:bid'}
COLUMN CELL
f:bid timestamp=1421027232815, value=1421027229-1374349927
1 row(s) in 0.0060 seconds
|
下面执行 fetch 命令:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
$ nutch fetch 1421027229-1374349927 -crawlId micmiublog -threads 10
FetcherJob: starting
FetcherJob: batchId: 1421027229-1374349927
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
2015-01-12 09:49:37.095 java[14546:4753667] Unable to load realm info from SCDynamicStore
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://micmiu.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
|
查看HBase中得数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
hbase(main):019:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value=
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00
com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05
com.micmiu:http/ column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17%
com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html
com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0
com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close
com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip
com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20
com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8
com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT
com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT
com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/
com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache
com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed
com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/
com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie
com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php
com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29
com.micmiu:http/ column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927
com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
1 row(s) in 0.0980 seconds
|
[第四步]:parse
1
2
3
4
5
6
7
8
9
|
$ nutch parse 1421027229-1374349927 -crawlId micmiublog
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1421027229-1374349927
2015-01-12 09:50:03.525 java[14559:4756783] Unable to load realm info from SCDynamicStore
Parsing http://micmiu.com/
http://micmiu.com/ skipped. Content of size 20 was truncated to 0
ParserJob: success
|
查看HBase中得数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
hbase(main):020:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value=
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00
com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05
com.micmiu:http/ column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17%
com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html
com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0
com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close
com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip
com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20
com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8
com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT
com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT
com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/
com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache
com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed
com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/
com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie
com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php
com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29
com.micmiu:http/ column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927
com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
1 row(s) in 0.0690 seconds
|
[第五步]:updatedb
1
2
3
4
|
$ nutch updatedb -crawlId micmiublog
DbUpdaterJob: starting
2015-01-12 09:50:47.662 java[14572:4762452] Unable to load realm info from SCDynamicStore
DbUpdaterJob: done
|
查看HBase中得数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
hbase(main):021:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu.www:http/ column=f:fi, timestamp=1421027452042, value=\x00'\x8D\x00
com.micmiu.www:http/ column=f:st, timestamp=1421027452042, value=\x00\x00\x00\x01
com.micmiu.www:http/ column=f:ts, timestamp=1421027452042, value=\x00\x00\x01J\xDB\xD6$f
com.micmiu.www:http/ column=mk:dist, timestamp=1421027452042, value=1
com.micmiu.www:http/ column=mtdt:_csh_, timestamp=1421027452042, value=?\x80\x00\x00
com.micmiu.www:http/ column=s:s, timestamp=1421027452042, value=?\x80\x00\x00
com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value=
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00
com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05
com.micmiu:http/ column=f:ts, timestamp=1421027452042, value=\x00\x00\x01KvS\xDF%
com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html
com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0
com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close
com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip
com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20
com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8
com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT
com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT
com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/
com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache
com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed
com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/
com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie
com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php
com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
2 row(s) in 0.1140 seconds
|
—————– EOF @Michael Sun —————–
原创文章,转载请注明: 转载自micmiu – 软件开发+生活点滴[ http://www.micmiu.com/ ]
本文链接地址: http://www.micmiu.com/opensource/nutch/nutch2x-crawl-first-website/
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)