nutch http file 截断问题

问题：

列表页预计抽取 355+6 但实际只抽取到220条链接. 原因是nutch对http下载的内容的长度进行了限制。

解决方案：
这里将这个属性扩大10倍。

vim conf/nutch-defalut.xml 
修改http.content.limit属性，将其由65536 改为 655360 
<property>
  <name>http.content.limit</name>
  <value>655360</value>  -------- 这里变大一些吧，有的html确实挺大的。
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>



//div[@class='com_page']/ul/li/span/a/@href
extract 355 outlinks
//div[@class='page_link']/a/@href
extract 6 outlinks
found 361 outlinks in http://www.ly.com/news/scenery.html

修改后正常抽取

posted on 2014-09-01 12:44 雨渐渐阅读(294) 评论(0) 收藏举报

刷新页面返回顶部

nutch http file 截断问题

导航

公告