Nutch分类搜索
环境
ubuntu11.10
tomcat6.0.35
nutch1.2
笔者想到的分类搜索的方法是根据不同的url建立不同的抓取库,比如要搞电力行业的垂直的搜索,可以将他分为新闻,产品,人才。那麽就建立三个抓取库,每个抓取库都有自己的url入口地址列表。然后配置网站过滤规则达到想要的结果。
下面笔者将一步一步的讲解他的实现过程。
首先先要得到相关类别的url入口地址列表,这个可以分类百度一下然后根据结果自己整理出来3个列表。
以下是笔者整理的三个列表。
新闻的(文件名newsURL)
http://www.cpnn.com.cn/
http://news.bjx.com.cn/
http://www.chinapower.com.cn/news/
产品的(文件名productURL)
http://www.powerproduct.com/
http://www.epapi.com/
人才的(文件名talentURl)
http://www.cphr.com.cn/
http://www.ephr.com.cn/
http://www.myepjob.com/
http://www.epjob88.com/
http://hr.bjx.com.cn/
http://www.epjob.com.cn/
http://ep.baidajob.com/
因为是做测试用,所以就不弄太多的地址了。
做垂直搜索就不能在用nutchcrawl url -dir crawl -depth -topN -threads命令来抓取了,这个命令是做企业内部搜索的,而且不能增量抓取。在这里笔者采用别人已经写好的增量抓取脚本。
地址http://wiki.apache.org/nutch/Crawl
因为要建三个抓取库所以要将该脚本修给一下。笔者的抓取库放在/crawldb/news /crawldb/product
/crawldb/talent,而且将三个url入口文件分别放到相应的分类下面/crawldb/news/newsURL
/crawldb/product/productURl /crawldb/talent/talentURl。下面是笔者修改后的抓取脚本。使用该脚本要配置NUTCH_HOME,CATALINA_HOME环境变量。
#!/bin/bash
#############################电力新闻抓增量取部分################################runbot script to run the Nutch bot for crawling and re-crawling.
#Usage: bin/runbot [safe]
# If executed in 'safe' mode, it doesn't delete the temporary
# directories generated during crawl. This might be helpful for
# analysis and recovery in case a crawl fails.
#
#Author: Susam Pal
echo"-----开始电力新闻增量抓取-----"
cd/crawldb/news
depth=5
threads=100
adddays=5
topN=5000#Comment this statement if you don't want to set topN value
#Arguments for rm and mv
RMARGS="-rf"
MVARGS="--verbose"
#Parse arguments
if[ "$1" == "safe" ]
then
safe=yes
fi
if[ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.
echorunbot: $0 could not find environment variable NUTCH_HOME
echorunbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echorunbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
if[ -z "$CATALINA_HOME" ]
then
CATALINA_HOME=/opt/apache-tomcat-6.0.10
echorunbot: $0 could not find environment variable NUTCH_HOME
echorunbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
echorunbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi
if[ -n "$topN" ]
then
topN="-topN$topN"
else
topN=""
fi
steps=8
echo"----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutchinject crawl/crawldb urls
echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0;i < $depth; i++))
do
echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \
-adddays$adddays
if[ $? -ne 0 ]
then
echo"runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls-d crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutchfetch $segment -threads $threads
if[ $? -ne 0 ]
then
echo"runbot: fetch $segment at depth `expr $i + 1` failed."
echo"runbot: Deleting segment $segment."
rm$RMARGS $segment
continue
fi
$NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment
done
echo"----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*
if[ "$safe" != "yes" ]
then
rm$RMARGS crawl/segments
else
rm$RMARGS crawl/BACKUPsegments
mv$MVARGS crawl/segments crawl/BACKUPsegments
fi
mv$MVARGS crawl/MERGEDsegments crawl/segments
echo"----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*
echo"----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \
crawl/segments/*
echo"----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutchdedup crawl/NEWindexes
echo"----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes
echo"----- Loading New Index (Step 8 of $steps) -----"
if[ "$safe" != "yes" ]
then
rm$RMARGS crawl/NEWindexes
rm$RMARGS crawl/index
else
rm$RMARGS crawl/BACKUPindexes
rm$RMARGS crawl/BACKUPindex
mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes
mv$MVARGS crawl/index crawl/BACKUPindex
fi
mv$MVARGS crawl/NEWindex crawl/index
echo"runbot: FINISHED: -----电力新闻增量抓取完毕!-----"
echo""
#############################电力产品增量抓取部分################################
echo"-----开始电力产品增量抓取-----"
cd/crawldb/product
steps=8
echo"----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutchinject crawl/crawldb urls
echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0;i < $depth; i++))
do
echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \
-adddays$adddays
if[ $? -ne 0 ]
then
echo"runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls-d crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutchfetch $segment -threads $threads
if[ $? -ne 0 ]
then
echo"runbot: fetch $segment at depth `expr $i + 1` failed."
echo"runbot: Deleting segment $segment."
rm$RMARGS $segment
continue
fi
$NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment
done
echo"----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*
if[ "$safe" != "yes" ]
then
rm$RMARGS crawl/segments
else
rm$RMARGS crawl/BACKUPsegments
mv$MVARGS crawl/segments crawl/BACKUPsegments
fi
mv$MVARGS crawl/MERGEDsegments crawl/segments
echo"----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*
echo"----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \
crawl/segments/*
echo"----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutchdedup crawl/NEWindexes
echo"----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes
echo"----- Loading New Index (Step 8 of $steps) -----"
if[ "$safe" != "yes" ]
then
rm$RMARGS crawl/NEWindexes
rm$RMARGS crawl/index
else
rm$RMARGS crawl/BACKUPindexes
rm$RMARGS crawl/BACKUPindex
mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes
mv$MVARGS crawl/index crawl/BACKUPindex
fi
mv$MVARGS crawl/NEWindex crawl/index
echo"runbot: FINISHED:-----电力产品增量抓取完毕!-----"
echo""
###############################电力人才增量抓取部分############################
echo"-----开始电力人才增量抓取!-----"
cd/crawldb/talent
steps=8
echo"----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutchinject crawl/crawldb urls
echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0;i < $depth; i++))
do
echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \
-adddays$adddays
if[ $? -ne 0 ]
then
echo"runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls-d crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutchfetch $segment -threads $threads
if[ $? -ne 0 ]
then
echo"runbot: fetch $segment at depth `expr $i + 1` failed."
echo"runbot: Deleting segment $segment."
rm$RMARGS $segment
continue
fi
$NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment
done
echo"----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*
if[ "$safe" != "yes" ]
then
rm$RMARGS crawl/segments
else
rm$RMARGS crawl/BACKUPsegments
mv$MVARGS crawl/segments crawl/BACKUPsegments
fi
mv$MVARGS crawl/MERGEDsegments crawl/segments
echo"----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*
echo"----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \
crawl/segments/*
echo"----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutchdedup crawl/NEWindexes
echo"----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes
echo"----- Loading New Index (Step 8 of $steps) -----"
${CATALINA_HOME}/bin/shutdown.sh
if[ "$safe" != "yes" ]
then
rm$RMARGS crawl/NEWindexes
rm$RMARGS crawl/index
else
rm$RMARGS crawl/BACKUPindexes
rm$RMARGS crawl/BACKUPindex
mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes
mv$MVARGS crawl/index crawl/BACKUPindex
fi
mv$MVARGS crawl/NEWindex crawl/index
${CATALINA_HOME}/bin/startup.sh
echo"runbot: FINISHED:-----电力人才增量抓取完毕!-----"
echo""
将上面的代码复制到你的linux上,然后给他可执行的权限chmod755 。
下载还不能抓取页面,要在$NUTCH_HOME/conf/regex.urlfilter.txt中配置url过滤规则
我的配置如下
#Licensed to the Apache Software Foundation (ASF) under one or more
#contributor license agreements. See the NOTICE file distributedwith
#this work for additional information regarding copyright ownership.
#The ASF licenses this file to You under the Apache License, Version2.0
#(the "License"); you may not use this file except incompliance with
#the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS"BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express orimplied.
#See the License for the specific language governing permissions and
#limitations under the License.
#The default url filter.
#Better for whole-internet crawling.
#Each non-comment, non-blank line contains a regular expression
#prefixed by '+' or '-'. The first matching pattern in the file
#determines whether a URL is included or ignored. If no pattern
#matches, the URL is ignored.
#skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
#skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
#skip URLs containing certain characters as probable queries, etc.
+[?*!@=]
#skip URLs with slash-delimited segment that repeats 3+ times, tobreak loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
-.*\.js
#accept anything else
+^http://([a-z0-9]*\.)*cpnn.com.cn/
+^http://([a-z0-9]*\.)*cphr.com.cn/
+^http://([a-z0-9]*\.)*powerproduct.com/
+^http://([a-z0-9]*\.)*bjx.com.cn/
+^http://([a-z0-9]*\.)*renhe.cn/
+^http://([a-z0-9]*\.)*chinapower.com.cn/
+^http://([a-z0-9]*\.)*ephr.com.cn/
+^http://([a-z0-9]*\.)*epapi.com/
+^http://([a-z0-9]*\.)*myepjob.com/
+^http://([a-z0-9]*\.)*epjob88.com/
+^http://([a-z0-9]*\.)*xindianli.com/
+^http://([a-z0-9]*\.)*epjob.com.cn/
+^http://([a-z0-9]*\.)*baidajob.com/
+^http://([a-z0-9]*\.)*01hr.com/
接下来配置$NUTCH_HOME/conf/nutch-site.xml如下
<?xmlversion="1.0"?>
<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>
<!--Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>justa test</value>
<description>Test</description>
</property>
</configuration>
上述步骤都成功了的话,就可以用刚才的脚本抓取了。这里要注意你的抓取数据的存放目录,请在抓取脚本的相应位置做出更改,以适应你的目录结构。
抓取完成后就是要将搭建搜索环境了。
将nutch目录下的war包放到tomcat的webapps目录下,待其自己解压。将ROOT该目录下已有的东西删掉,将刚才解压目录中的东西复制到其中,并修改WEB-INF/classes/nutch-site.xml文件如下
<?xmlversion="1.0"?>
<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>
<!--Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>searcher.dir</name>
<value>/crawldb/news/crawl</value>
</property>
<property>
<name>http.agent.name</name>
<value>tangmiSpider</value>
<description>MySearch Engine</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-lucene|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>
其中的search.dir的值是你的抓取数据的存放目录,请做出相应的更改。在webapps目录下建立两个目录talent、product,将刚才解压目录中的东西复制到其中,并修改WEB-INF/classes/nutch-site.xml,将searcher.dir的分别设置为/crawldb/talent/crawl、/crawldb/product/crawl。至此就可以进行分类搜索了。进行搜索是请进输入相应的url。
我的结果页面