nutch - 随笔分类 - 雨渐渐

failed with: java.lang.NullPointerException

摘要：failed with: java.lang.NullPointerException需要在nutch的配置文件 'conf/nutch-site.xml'. 里设置如下，不然就报上面的错误了。当然在crawl-urlfilter.txt里面也要相应于 urls/url.txt里的域名进行设置。ht... 阅读全文

posted @ 2014-09-15 10:38 雨渐渐阅读(508) 评论(0) 推荐(0) 编辑

java.io.IOException: Cannot run program "bash": error=12, Cannot allocate memory

摘要：java.io.IOException: Cannot run program "bash": error=12, Cannot allocate memory云服务器运行nutch报出的异常：解决方案：http://daimajishu.iteye.com/blog/959213最近在单机上测试H... 阅读全文

posted @ 2014-09-15 10:23 雨渐渐阅读(1033) 评论(0) 推荐(0) 编辑

NUTCH Exception in thread "Thread-12751" java.lang.OutOfMemoryError: PermGen space

摘要：转载自：http://greemranqq.iteye.com/blog/1705867转载自：http://www.cnblogs.com/xwdreamer/archive/2011/11/21/2296930.html修改bin/nutch 脚本加入#!/bin/bash# # The Nu... 阅读全文

posted @ 2014-09-12 09:38 雨渐渐阅读(210) 评论(0) 推荐(0) 编辑

nutch 生产者队列的大小如何控制 threadcount * 50

摘要：如果topN 设置为1000万，不会这1000万都放到QueueFeeder（内存）中，而是从文件系统中（hdfs）中迭代不断填充QueueFeeder。队列中默认存放 threadcount * 50 。这个类的作用是从文件系统读文件填充队列。/** * This class fee... 阅读全文

posted @ 2014-09-06 01:37 雨渐渐阅读(314) 评论(0) 推荐(0) 编辑

异常： http://www.ly.com/news/visa.html: java.io.IOException: unzipBestEffort returned null

摘要：nutch 运行时异常： http://www.ly.com/news/visa.html: java.io.IOException: unzipBestEffort returned null参考：http://www.tuicool.com/articles/faUB73此页面采用这个是一个分段... 阅读全文

posted @ 2014-09-04 19:34 雨渐渐阅读(354) 评论(0) 推荐(0) 编辑

nutch http file 截断问题

摘要：问题：列表页预计抽取 355+6 但实际只抽取到220条链接. 原因是nutch对http下载的内容的长度进行了限制。解决方案：这里将这个属性扩大10倍。vim conf/nutch-defalut.xml 修改http.content.limit属性，将其由65536 改为 655360 ht... 阅读全文

posted @ 2014-09-01 12:44 雨渐渐阅读(241) 评论(0) 推荐(0) 编辑

nutch 索引

摘要：nutch开发环境搭建 nutch-1.3导入eclipse nutch-1.7导入eclipsenutch部署 nutch-1.3linux下部署 nutch-1.7编译 nutch-1.2与nutch1.3部署的改变 nutch-2.2.1 hadoop-1.2.1 hbase-0.92.1集群... 阅读全文

posted @ 2014-08-28 17:00 雨渐渐阅读(242) 评论(0) 推荐(0) 编辑

摘要：Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209) at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) at org.apache.hadoop.util.... 阅读全文

posted @ 2014-03-24 09:10 雨渐渐阅读(339) 评论(6) 推荐(0) 编辑

nutch-2.2.1 hadoop-1.2.1 hbase-0.92.1 集群部署

摘要：参考网站：http://blog.csdn.net/weijonathan/article/details/10178919 一个完整的部署过程，只是版本有所区别http://m.blog.csdn.net/blog/WeiJonathan/9251597 杨尚川的博客（nutch 分布... 阅读全文

posted @ 2014-02-08 14:02 雨渐渐阅读(1718) 评论(0) 推荐(0) 编辑

nutch 大量网站

摘要：下载地址：http://rdf.dmoz.org/rdf/content.rdf.u8.gzDMOZ网站是一个著名的开放式分类目录（Open DirectoryProject），之所以称为开放式分类目录，是因为DMOZ不同于一般分类目录网站利用内部工作人员进行编辑的模式，而是由来自世界各地的志愿者共... 阅读全文

posted @ 2014-01-16 16:43 雨渐渐阅读(255) 评论(0) 推荐(0) 编辑

nutch fetcher.server.delay

摘要：1 配置因素 fetcher.server.delay 0.0 The number of seconds the fetcher will delay between successive requests to the same server.2 机器人协议因素FetchItemQueue fiq = fetchQueues.getFetchItemQueue(fit.queueID); fiq.crawlDelay = rules.getCrawlDelay(); if (LOG.isDebugEnabled()) {... 阅读全文

posted @ 2013-11-25 16:34 雨渐渐阅读(222) 评论(0) 推荐(0) 编辑

nutch Fetcer阶段详解

摘要：job.setSpeculativeExecution(false); 抓网页阶段，不允许同一个任务运行多次，否则，网页就抓重了为了充分利用闲置资源，加快map 和 reduce 的执行，于是有SpeculativeExecution机制，同时运行多个map 或 reduce，先运行完的获胜，其他的干掉阅读全文

posted @ 2013-11-25 11:42 雨渐渐阅读(236) 评论(0) 推荐(0) 编辑

nutch getOutLinks 外链的处理

摘要：转载自：http://blog.csdn.net/witsmakemen/article/details/8067530通过跟踪发现，Fetcher获得网页解析链接没有问题，获得了网页中所有的链接，然后在output()函数中通过FetcherOutputFormat类输出（包含在ParseResu... 阅读全文

posted @ 2013-11-18 15:59 雨渐渐阅读(390) 评论(0) 推荐(0) 编辑

could only be replicated to 0 nodes, instead of 1

摘要：周末机房断电，然后hadoop爆出如题的错误，解决方案就是关闭所有节点的防火墙，相关命令如下：查看防火墙状态：/etc/init.d/iptables status暂时关闭防火墙：/etc/init.d/iptables stop禁止防火墙在系统启动时启动/sbin/chkconfig --leve... 阅读全文

posted @ 2013-11-15 14:52 雨渐渐阅读(184) 评论(0) 推荐(0) 编辑

SegmentReader 批量 dump

摘要：/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the " 阅读全文

posted @ 2013-10-29 09:39 雨渐渐阅读(386) 评论(0) 推荐(0) 编辑

nutch 二次开发

摘要：/*深度控制*/深度控制：nutch是广域网的深度遍历，我们需要的是垂直采集（即只采集某一个栏目），举例，索引页总计20页，如果只有下一页，则深度为20，如果是1 2 3 4 5……20则深度为2即可。深度是未知的。相当于多了一个参数，不便于管理。解决方案：将深度设为无限大。依靠segments来退出采集，而不是依靠深度。/*批量dump*/目的：org.apache.nutch.segment.SegmentReader 类提供的命令 -dump仅仅是读取一个segment下的网页信息。为了实现批量dump，更改了代码，将输入路径该为\crawl\segments并遍历segments下的文阅读全文

posted @ 2013-10-08 10:58 雨渐渐阅读(517) 评论(0) 推荐(0) 编辑

nutch 采集效率问题

摘要：http://hi.baidu.com/jacklin/item/a8fbccf479f6a1d042c36a7c再附一篇：http://blog.csdn.net/laigood/article/details/6233561fetcher.threads.per.host fetcher.thr... 阅读全文

posted @ 2013-09-23 15:23 雨渐渐阅读(351) 评论(0) 推荐(1) 编辑

Fetcher类的工作流程

摘要：Fetcher类工作流程：FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.GENERATE_DIR_NAME));job.setInputFormat(InputFormat.class);----------------第一部分------------------------job.setMapRunnerClass(Fetcher.class);Fetcher类实现了MapRunnable这个接口，主要完成了生产者的启动与消费者的启动。Fetcher extends Configured implements T 阅读全文

posted @ 2013-09-23 12:16 雨渐渐阅读(482) 评论(0) 推荐(1) 编辑

nutch 1.7 修改代码后如何编译发布，并集群采集攻略

摘要：nutch 1.3之后，分布式的可执行文件与单机可执行文件进行了分离接上篇，nutch 1.7 导入 eclipse本篇所要解决的问题：nutch下载下来经过简单的配置即可进行采集，但有时候我们需要修改nutch的源码（比如不遵守机器人协议，比如我要保存网页的编码），这个时候如何编译为可执行程序呢？... 阅读全文

posted @ 2013-09-18 16:52 雨渐渐阅读(709) 评论(0) 推荐(0) 编辑

nutch 1.7 导入 eclipse

摘要：开发环境建议：ubuntu+eclipse （windows + cygwin + eclipse不推荐）第一步：下载http://archive.apache.org/dist/nutch/从上述站点下载src和bin两个压缩文件wget 'http://archive.apache.org/di... 阅读全文

posted @ 2013-09-16 12:59 雨渐渐阅读(1297) 评论(0) 推荐(1) 编辑

雨渐渐

随笔分类 - nutch