关于nutch的配置(转)

Nutch是一个开源的搜索引擎,包括抓取,索引,搜索,不过它主要专注于抓取,下面我讲一下它的简单使用。

首先,从这里下载Nutch的最新release(作此文时最新release为1.0),或者从这里直接下载源码,然后解压。解压后,打开文件$NUTCH_HOME/conf/nutch-site.xml(NUTCH_HOME为你nutch所在的文件夹,这个nutch-site文件是nutch的配置文件,不要直接修改nutch-default文件,那个是nutch的默认配置,nutch-site.xml会覆盖nutch-default.xml中的配置,详情请见Nutch配置文件的加载。当然你也可以修改nutch-default,xml,但是nutch官方不推荐那样做),在<configuration>和</configuration>之间输入以下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<property>
<name>http.agent.name</name>
<value>spider</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.robots.agents</name>
<value>spider,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>

其中字段“http.agent.name”为你的crawler的名字(记得早期的版本可以不填的,现在的版本不填就报错),字段http.robots.agents,也可以不填,但是不填的话抓取的时候nutch会报:

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

烦的慌,你要是不怕烦的话可以不填。
然后再打开文件$NUTCH_HOME/conf/crawl-urlfilter.txt,把该文件里面的MY.DOMAIN.NAME替换成你想抓取的域名,比如apache.org。

修改完以上的配置,现在就可以抓取了,抓取之前你得建立一个文件,里面存放你要抓取的url,比如建立一个文件urls,内容为:http://lucene.apache.org/nutch/,把该文件放到目录urls下面,Nutch抓取的时候只能对一个目录下的所有文件中的url进行抓取,不能对一个文件中的url进行抓取(这是由它的分布式系统Hadoop的特性决定的)。抓取很简单:

$NUTCH_HOME/bin/nutch crawl urls -dir crawl -depth 2

urls为待抓取的urls目录,crawl为输出目录(可以不写,默认为”crawl-”加当前日期和时间),depth为抓取深度,默认为5。输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
ahei@ubuntu3:~/nutch-1.0/bin$ ./nutch crawl urls -dir crawl -depth 2
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 2
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20091126170222
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl/segments/20091126170222
Fetcher: threads: 10
QueueFeeder finished: total 1 records.
fetching http://lucene.apache.org/nutch/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20091126170222]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20091126170233
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl/segments/20091126170233
Fetcher: threads: 10
QueueFeeder finished: total 38 records.
fetching http://wiki.apache.org/nutch/
fetching http://issues.apache.org/jira/browse/Nutch
fetching http://lucene.apache.org/nutch/tutorial.html
-activeThreads=10, spinWaiting=7, fetchQueues.totalSize=35
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=35
fetching http://lucene.apache.org/nutch/skin/breadcrumbs.js
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=34
Error parsing: http://lucene.apache.org/nutch/skin/breadcrumbs.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/nutch/skin/breadcrumbs.js
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=34
fetching http://lucene.apache.org/nutch/version_control.html
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=33
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=33
fetching http://wiki.apache.org/nutch/FAQ
fetching http://lucene.apache.org/nutch/apidocs-0.8.x/index.html
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=31
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=31
fetching http://lucene.apache.org/hadoop/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=30
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=30
fetching http://forrest.apache.org/
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=29
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=29
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29
fetching http://lucene.apache.org/nutch/apidocs-0.9/index.html
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=28
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=28
fetching http://lucene.apache.org/nutch/credits.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27
fetching http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26

抓取完数据之后怎样检验呢?使用命令:

$NUTCH_HOME/bin/nutch org.apache.nutch.searcher.NutchBean apache

这个命令会给出apache的搜索结果,这个命令默认是对crawl目录进行搜索,这是代码证明:

1
2
3
4
5
6
7
文件:$NUTCH_HOME/src/java/org/apache/nutch/searcher/NutchBean.java:87
public NutchBean(Configuration conf, Path dir) throws IOException {
this.conf = conf;
this.fs = FileSystem.get(this.conf);
if (dir == null) {
dir = new Path(this.conf.get("searcher.dir", "crawl"));
}

要想对其他目录进行搜索,在nutch-site.xml中加入以下内容:

1
2
3
4
5
6
7
8
9
10
11
<property>
<name>searcher.dir</name>
<value>other-searcher-dir</value>
<description>
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>

搜索结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
ahei@ubuntu3:~/nutch-1.0/bin$ ./nutch org.apache.nutch.searcher.NutchBean apache
Total hits: 25
0 20091126170222/http://lucene.apache.org/nutch/
... Lucene. January 2005: Nutch Joins Apache Incubator Nutch is a ... determined that the Apache license is the appropriate
1 20091126170233/http://www.apache.org/
... including Apache XML, Apache Jakarta, Apache Cocoon, Apache Xerces, Apache Ant, and Apache ... Source projects such as NoSQL, Apache ...
2 20091126170233/http://www.apache.org/licenses/
... Copyright ? 2009 The Apache Software Foundation, Licensed under the ... Apache License, Version 2.0 . Apache ... Apache and the ...
3 20091126170233/http://forrest.apache.org/
... Welcome to Apache Forrest apache > forrest Welcome Developers Versioned Docs ... Example sites Thanks Related projects Apache Gump Apache ...
4 20091126170233/http://lucene.apache.org/
... the release of Apache Mahout 0.1. Apache Mahout is a subproject ... on top of ...
5 20091126170233/http://wiki.apache.org/nutch/
FrontPage - Nutch Wiki Search: Nutch Wiki Login FrontPage FrontPage RecentChanges FindPage HelpContents Immutable Page Comments Info Attachments More Actions: ...
6 20091126170233/http://lucene.apache.org/nutch/index.html
... Lucene. January 2005: Nutch Joins Apache Incubator Nutch is a ... determined that the Apache license is the appropriate
7 20091126170233/http://wiki.apache.org/nutch/FAQ
... all available at http://lucene.apache.org/nutch/mailing_lists.html . How ...
8 20091126170233/http://lucene.apache.org/nutch/tutorial8.html
... http://([a-z0-9]*\.)*apache.org/ This will include any ... in the domain apache.org . Edit the file ...
9 20091126170233/http://lucene.apache.org/nutch/tutorial.html
... crawl to the apache.org domain, the line ... http://([a-z0-9]*\.)*apache.org/ This will include any



转自 : http://1985wanggang.blog.163.com/blog/static/77638332010102315423666/
posted @ 2011-06-14 18:58  xiao晓  阅读(3222)  评论(0编辑  收藏  举报