【Nutch】Linux下应用nutch 1.0 Web前端实现单机检索

nutch的爬虫和搜索可以说是分离的两块，爬虫可以是M/R作业，但搜索不是M/R作业。搜索有两种方式：一是将爬虫数据(或者称索引数据)放在本地硬盘，进行搜索。二是直接搜索HDFS中的爬虫数据。
这里介绍如何使用nutch-1.0的WEB前端检索本地爬虫数据：
(1)Nutch的搜索可以独立于hadoop集群，只要将爬虫下来的数据copy到任何机器，在此机器上安装一个tomcat，并运行nutch自带的WEB前端程序并做相应配置，就可实现搜索。
(2) 将使用命令bin/nutch crawl -dir data -depth 3 -topN 5爬虫下下来的数据data放在本地某目录下（如果是分布式爬虫，可以使用命令" bin/hadoop dfs -copyFromLocal data 本地目录" 将爬虫数据data复制到本地目录），例如将生成的data目录复制到/home/nutch/nutchinstall/crawltest/目录下。（安全起见，请确保目录路径中没有空格，这个可能有影响）。
说明：
data目录是爬虫生成的目录，下面有这些子目录：crawldb,index,indexes,linkdb,segments
(3)安装tomcat，请确保安装路径没有空格，这很重要，在windows上因为有空格导致搜索结果始终为0.
(4) 将Nutch主目录下的WEB前端程序nutch-1.0.war复制到 /usr/program/apache-tomcat-6.0.18/webapps/目录下(apache安装目录是/usr/program /apache-tomcat-6.0.18)
(5)浏览器中输入http://localhost:8080/nutch-1.0，将自动解压nutch-1.0.war。
(6)配置WEB前端程序中的nutch-site.xml文件，配置完成后必须重启tomcat(/usr/program/apache-tomcat-6.0.18/bin/shutdown.sh,然后在start.sh)。
nutch-site.xml在目录/usr/program/apache-tomcat-6.0.18/webapps/nutch-1.0/WEB-INF/classes/下，
配置如下：
<property>
<name>http.agent.name</name> 不可少，否则无搜索结果
<value>nutch-1.0</value>
<description>HTTP 'User-Agent' request header.</description>
</property>

<property>
<name>http.robots.agents</name>
<value>nutch-1.0,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>

<property>
<name>searcher.dir</name>
<value>/home/nutch/nutchinstall/crawltest/data</value> 定位到爬虫数据目录，绝对路径
<description>if index in the hdfs ,the value is hdfs relative dir
if index in the local dir,the value is local dir
</description>
</property>
(7)重启tomcat
(8)在http://localhost:8080/nutch-1.0下检索关键字

posted @ 2010-06-25 10:08 searchDM 阅读(617) 评论(0) 编辑收藏举报

刷新页面返回顶部

wycg1984

【Nutch】Linux下应用nutch 1.0 Web前端实现单机检索

公告