CentOS 6.4 中安装部署 Nutch 1.7

1.配置SSH

自行查阅相关资料

2.安装JDK,配置Java环境

自行查阅相关资料

3.安装SVN

[root@master ~]# yum install -y subversion

通过SVN签出(Check Out)Nutch源代码

[root@master ~]# svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.7/

4.安装ANT,配置ANT环境

自行查阅相关资料

5.在~/release-1.7/conf/nutch-site.xml配置文件中增加'http.agent.name'配置

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>Mozilla/5.0 (Windows NT 6.3; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

        http.robots.agents
        http.agent.description
        http.agent.url
        http.agent.email
        http.agent.version

  and set their values appropriately.

  </description>
</property>
 

6.进入Nutch所在目录,执行ant命令,编译Nutch源代码

[root@master release-1.7]# ant
ANT构建之后会生成runtime目录,该目录下有deploy和local两个目录,分别代表了Nutch的两种运行方式。

7.在local目录中创建urls目录

[root@master local]# mkdir urls

8.在urls目录中通过VI编辑器创建url文件

[root@master local]# vi urls/url

9.在url文件中添加要抓取的URLs

http://www.leezhen.net/

10.开始抓取

[root@master local]# nohup bin/nutch crawl urls -dir data -depth 3 -threads 100 &

参考: http://wiki.apache.org/nutch/NutchTutorial
posted @ 2014-01-02 22:06  LeeZhen  阅读(433)  评论(0编辑  收藏  举报