nutch0.8配置

nutch0.8配置

1. \cygwin\nutch-0.7.2\conf下的nutch-default.xml替换掉nutch-site.xml

2．在 crawl-urlfilter.txt 设置正则匹
# skip URLs containing certain characters as probable queries, etc.
-[!]（除掉可能出现的问号等）

正则表达式，修改为 +^http://(\.*)*
3.       nutch 0.7.2中，urls.txt中不用加
4.       Tomat的中文问题，修改server.xml下段，
<Connector port="8080" maxHttpHeaderSize="8192"
         maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
         enableLookups="false" redirectPort="8443" acceptCount="100"
         connectionTimeout="20000"
disableUploadTimeout="true"
URIEncoding="UTF-8" useBodyEncodingForURI="true" /> //添加这行

5. 在文件regex-normalize.xml中增加下述描述
<regex>
<pattern>(\?|\&|\&amp *([a-zA-Z0-9]*\.)*</pattern>
<substitution>$1$3</substitution>
</regex>

5.       修改nite-site.xml
<property>
<name>http.content.limit</name>
<value>9065536</value>//如不能正确抓取，在这里把容量设置大些
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
</description>
</property>

把nutch-site.xml的内容 copy 到nutch-default.xml 一份，保持文件一致。

posted on 2006-10-29 15:36 祝奕阅读(543) 评论(0) 编辑收藏举报