CentOS 6.4 中安装部署 Nutch 1.7
1.配置SSH
自行查阅相关资料2.安装JDK,配置Java环境
自行查阅相关资料3.安装SVN
[root@master ~]# yum install -y subversion
通过SVN签出(Check Out)Nutch源代码
[root@master ~]# svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.7/
4.安装ANT,配置ANT环境
自行查阅相关资料5.在~/release-1.7/conf/nutch-site.xml配置文件中增加'http.agent.name'配置
<!-- HTTP properties --> <property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.3; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property>
6.进入Nutch所在目录,执行ant命令,编译Nutch源代码
[root@master release-1.7]# antANT构建之后会生成runtime目录,该目录下有deploy和local两个目录,分别代表了Nutch的两种运行方式。
7.在local目录中创建urls目录
[root@master local]# mkdir urls
8.在urls目录中通过VI编辑器创建url文件
[root@master local]# vi urls/url
9.在url文件中添加要抓取的URLs
http://www.leezhen.net/
10.开始抓取
[root@master local]# nohup bin/nutch crawl urls -dir data -depth 3 -threads 100 &
参考: http://wiki.apache.org/nutch/NutchTutorial