使用NUTCH进行单站点的爬取与检索测试

单站点的爬取与检索测试

1, 创建urls文件夹,在文件夹下面创建seed.txt
文件, seed.txt文件中输入要爬取的站点例如: www.osu.edu
mkdir -p urls 
cd urls

touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutchto crawl).

2,修改conf/crawl-urlfilter.txt

MY.DOMAIN.NAME替换为osu.edu

原来为:

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

现在为:

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*osu.edu/

3, 开始爬取

bin/nutch crawl urls -dir crawldemo -depth 2

4, 配置tomcat,并重新启动,重启的过程不能忘记.

gsli@ubuntu:~/Downloads/apache-tomcat-7.0.10/webapps/nutch-1.2/WEB-INF/classes$
cat nutch-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

                                    <name>searcher.dir</name>

                                      <value>/home/gsli/Downloads/nutch-1.2/crawldemo</value>

                                      <description></description>

</property>

</configuration>

5, nutch的搜索页面进行检索

需要在完成第四步的配置,然后重启tomcat才可以进行检索



 



posted @ 2013-06-27 19:18  free_thinker  阅读(215)  评论(0编辑  收藏  举报