使用NUTCH进行单站点的爬取与检索测试

单站点的爬取与检索测试

1, 创建urls文件夹,在文件夹下面创建seed.txt
文件, 在seed.txt文件中输入要爬取的站点例如: www.osu.edu
mkdir -p urls
cd urls

touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutchto crawl).

2,修改conf/crawl-urlfilter.txt

将MY.DOMAIN.NAME替换为osu.edu

原来为:

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

现在为:

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*osu.edu/

3, 开始爬取

bin/nutch crawl urls -dir crawldemo -depth 2

4, 配置tomcat,并重新启动,重启的过程不能忘记.

gsli@ubuntu:~/Downloads/apache-tomcat-7.0.10/webapps/nutch-1.2/WEB-INF/classes$
cat nutch-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>searcher.dir</name>

<value>/home/gsli/Downloads/nutch-1.2/crawldemo</value>

</property>

</configuration>

5, 在nutch的搜索页面进行检索

需要在完成第四步的配置,然后重启tomcat才可以进行检索

posted @ 2013-06-27 19:18 free_thinker 阅读(234) 评论(0) 收藏举报

刷新页面返回顶部

free_thinker