Fetcher类的工作流程
Fetcher类工作流程: FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.GENERATE_DIR_NAME)); job.setInputFormat(InputFormat.class); ----------------第一部分------------------------ job.setMapRunnerClass(Fetcher.class); Fetcher类实现了MapRunnable<Text, CrawlDatum, Text, NutchWritable>这个接口,主要完成了生产者的启动与消费者的启动。 Fetcher extends Configured implements Tool,MapRunnable<Text, CrawlDatum, Text, NutchWritable> public void run(RecordReader<Text, CrawlDatum> input,OutputCollector<Text, NutchWritable> output,Reporter reporter){ 启动生产者 feeder = new QueueFeeder(input, fetchQueues, threadCount * queueDepthMuliplier); 启动消费者 for (int i = 0; i < threadCount; i++) { // spawn threads new FetcherThread(getConf()).start(); } } FetcherThread类完成网页的下载,并用如下的方法多路输出。 ----------------第二部分:多路输出------------------------ FileOutputFormat.setOutputPath(job, segment); job.setOutputFormat(FetcherOutputFormat.class);/** Splits FetcherOutput entries into multiple map files. */ output.collect(key, new NutchWritable(datum)); output.collect(key, new NutchWritable(content)); output.collect(url, new NutchWritable(new ParseImpl(new ParseText(parse.getText()),parseData, parse.isCanonical())));