Hadoop-0.20.2+ Nutch-1.2+Tomcat-7——分布式搜索配置

随着nutch的发展,各模块逐渐独立性增强,我从2.11.6装过来,也没有实现整个完整的功能。今天装一下nutch1.2,这应该是最后一个有war文件的稳定版本。

 

1.      准备工作

下载apache-nutch-1.2-bin.zipapache-tomcat-7.0.39.tar.gzhadoop-0.20.2.tar.gz

将下载的hadoop-0.20.2.tar.gz解压到/opt文件夹下。

将下载的apache-nutch-1.2-bin.zip解压到/opt文件夹下。

将下载的apache-tomcat-7.0.39.tar.gz解压到/opt文件夹下。

2.      配置hadoop-0.20.2

(1)   编辑conf/hadoop-env.sh,最后添加

export JAVA_HOME=/opt/java-7-sun

export HADOOP_HEAPSIZE=1000

export HADOOP_CLASSPATH=.:/opt/nutch-1.2/lib:/opt/hadoop-0.20.2

export NUTCH_HOME=/opt/nutch-1.2/lib

(2)   编辑/etc/profile,添加

#Hadoop

export HADOOP_HOME=/opt/hadoop-0.20.2

export PATH=$PATH:$HADOOP_HOME/bin

(3)   编辑conf/core-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://m2:9000</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/opt/hadoop-0.20.2/tempdata/var</value>

</property>

<property>

<name>hadoop.native.lib</name>

<value>true</value>

<description>Should native hadoop libraries, if present, be used.</description>

</property>

</configuration>

(4)   编辑conf/hdfs-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/opt/hadoop-0.20.2/tempdata/name1,/opt/hadoop-1.0.4/tempdata/name2</value> #hadoopname目录路径

<description>  </description>

</property>

<property>

<name>dfs.data.dir</name>

<value>/opt/hadoop-0.20.2/tempdata/data1,/opt/hadoop-1.0.4/tempdata/data2</value>

<description> </description>

</property>

<property>

  <name>dfs.replication</name>

  <value>2</value>

</property>

</configuration>

(5)   编辑conf/mapred-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>mapred.job.tracker</name>

  <value>m2:9001</value>

</property>

<property>

  <name>mapred.local.dir</name>

<value>/opt/hadoop-0.20.2/tempdata/var</value>

</property>

<property>

  <name>mapred.output.compression.type</name>

  <value>BLOCK</value>

  <description>If the job outputs are to compressed as SequenceFiles, how should

   they be compressed? Should be one of NONE, RECORD or BLOCK.

  </description>

</property>

<property>

  <name>mapred.output.compress</name>

  <value>true</value>

  <description>Should the job outputs be compressed?

  </description>

</property>

<property>

  <name>mapred.compress.map.output</name>

  <value>true</value>

</property> 

(6)   conf/masterconf/slave文件写好。启动hadoophadoop命令在hadoop目录的bin目录下)

a)  hadoop namenode –format

b)  start-all.sh

(7)   WEB下查看Hadoop的工作情况

a)  http://localhost:50070

b)  http://localhost:50030

3.      配置nutch-1.2

(1)   建立nutch1.2/urls/url.txt,里面加入

http://www.163.com/

http://www.tianya.cn/

http://www.renren.com/

http://www.iteye.com/

 

(2)   编辑conf/crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME

#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

+^http://([a-z0-9]*\.)*163.com/

+^http://([a-z0-9]*\.)*tianya.cn/

+^http://([a-z0-9]*\.)*renren.com/

+^http://([a-z0-9]*\.)*iteye.com/

 

(3)   配置conf/nutch-site.xml

<configuration>

<property>

<name>http.agent.name</name>

<value>aniu-search</value>

<description>aniu.160</description>

</property>

<property>

<name>http.agent.description</name>

<value></value>

<description></description>

</property>

<property>

<name>http.agent.url</name>

<value></value>

<description></description>

</property>

<property>

<name>http.agent.email</name>

<value></value>

<description></description>

</property>

<property>

<name>searcher.dir</name>

<value> search-dir</value>//此文件位置为hdfs://m2:9000/user/root

</property>

</configuration>

 

(4)     hadoop/conf中的文件覆盖到nutch/conf下。

(5)     建立文件search-dir/search-servers.txt入内容

m2 9999

s6 9999

s7 9999

4.      配置tomcat

(1)   nutch-1.2下的nutch-1.2.war复制到/tomcat下的webapps中,在浏览器中http://localhost:8080/nutch-1.2/,则可以看到nutch的搜索界面。(这一步测试要做,会生成相应的目录)

(2)   解决中文乱码的问题,编辑tomcat/conf/server.xml,找到并添加

<Connector port="8080" protocol="HTTP/1.1"

connectionTimeout="20000"

redirectPort="8443"

URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

(3)   nutch/conf中的文件覆盖到/opt/apache-tomcat-7.0.39/webapps/nutch-1.2/WEB-INF/classes下。

5.      分布式爬取测试和分布式搜索测试

(1)   urls文件夹上传到hdfs://m2:9000/user/root下。

(2)   执行爬取命令:bin/nutch crawl urls -dir crawl -depth 5 -threads 10 -topN 100,若结果hdfs://m2:9000/user/root下生成目录crawl(5个子目录),则爬取成功。

(3)   开启各节点的检索端口

[root@m2 nutch-1.2]# bin/nutch server 9999 hdfs://m2:9000/user/root/crawl

[root@s6 nutch-1.2]# bin/nutch server 9999 hdfs://m2:9000/user/root/crawl

[root@s7 nutch-1.2]# bin/nutch server 9999 hdfs://m2:9000/user/root/crawl

(4)   search-dir/search-servers.txt上传到在HDFS上,hdfs://m2:9000/user/root/search-dir/search-servers.txt

(5)   重启tomcat

/opt/apache-tomcat-7.0.39/bin/catalina.sh stop

/opt/apache-tomcat-7.0.39/bin/catalina.sh start

(6)   在浏览器http://10.1.50.160:8080/nutch-1.2/search.jsp中测试,搜索成功。

6.      FAQ

(1)     运行爬取命令报错java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.SnappyCodec was not found.

解决办法:是把hadoop-0.20.2/conf/mapread-site.xml中的snappy部分去掉。

(2)   运行爬取命令bin/nutch crawl hdfs://m2:9000/user/root/nutch-1.2/urls -dir hdfs:/m2:9000/user/root/nutch-1.2/crawl -depth 5 -threads 10 -topN 100报错:

java.lang.RuntimeException: Error in configuring object

at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)

at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)

at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)

咨询了大脸,这个错解决了。方法是

conf/nutch-default.xml中的相关项还原

<property>

<name>plugin.folders</name>

<value>plugins</value>

<description>Directories where nutch plugins are located. Each

element may be a relative or absolute path. If absolute, it is used

as is. If relative, it is searched for on the classpath.</description>

</property>

(3)     报错

Error: org.apache.nutch.scoring.ScoringFilters.injectedScore(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/CrawlDatum;)V

解决方法:

s6上看tasktrackerlog,发现有这么一条,查了下fatal,竟然是致命的。。。但也没给出具体原因。

接着找,从浏览器m2:50030进去,找到刚执行的task(中间要重启hadoop),找到报错

2013-04-13 07:45:38,046 ERROR org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.NoSuchMethodError: org.apache.nutch.scoring.ScoringFilters.injectedScore(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/CrawlDatum;)V

at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:141)

at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:59)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

at org.apache.hadoop.mapred.Child.main(Child.java:170)

在网上查到,NoSuchMethodError常是因为有功能相同的包在工作,版本又不一致,才导致这个错误。我在看bin/nutch时发现,会读取nutch-1.2/build中的plugins,这就和nutch-1.2/plugins冲突了,即将build这个文件夹重命名,问题就解决了。

(4)     关于分布式搜索,不能搜索HDFS上的内容

终于发现这个WARN,原来是找不到文件,

解决办法:把这个文件夹传到HDFS上解决。

7.      主要参考资料:

1.    nutch入门学习.pdf

2.    http://hi.baidu.com/erliang20088

3.    http://lendfating.blog.163.com/blog/static/1820743672012111311532359/

4.    http://blog.csdn.net/witsmakemen/article/details/8256369





posted on 2013-04-15 18:28  aniuer  阅读(1293)  评论(0编辑  收藏  举报