从本文开始,我们来分析与Heritrix3.1.0系统的处理器相关的源码,在Heritrix系统里面,待处理的CrawlURI cURI对象经过系统里面的处理器的重重处理最后才得以修成正果
因为处理器很多,除了处理器本身的继承层次的逻辑外,在系统里面将功能相近的处理器归入同一个处理器链
Heritrix3.1.0系统逻辑上抽象为两大处理器链(FetchChain和DispositionChain,CandidateChain逻辑上是属于DispositionChain)
我们先来看一下处理器链与处理器的相关UML图
上面是静态class图,处理器链ProcessorChain维持着一定数目的处理器processor的聚集,处理器链ProcessorChain实现了iterator<E>接口
在系统实际运行时,我上面说系统逻辑上抽象为两大处理器链(FetchChain和DispositionChain,CandidateChain逻辑上是属于DispositionChain)
我来解释一下
处理器链FetchChain(org.archive.modules.FetchChain)对应的处理器(url种子稍有不同,后文再分析):
org.archive.crawler.prefetch.Preselector
org.archive.crawler.prefetch.PreconditionEnforcer
org.archive.modules.fetcher.FetchDNS
org.archive.modules.fetcher.FetchHTTP
org.archive.modules.extractor.ExtractorHTTP
org.archive.modules.extractor.ExtractorHTML
org.archive.modules.extractor.ExtractorCSS
org.archive.modules.extractor.ExtractorJS
org.archive.modules.extractor.ExtractorSWF
处理器链DispositionChain(org.archive.modules.DispositionChain)对应的处理器:
org.archive.modules.writer.MyWriterProcessor
org.archive.crawler.postprocessor.CandidatesProcessor
org.archive.crawler.postprocessor.DispositionProcessor
实际运行中处理器CandidatesProcessor(org.archive.modules.CandidateChain)对应的处理器
org.archive.crawler.prefetch.CandidateScoper
org.archive.crawler.prefetch.FrontierPreparer
如果我们换成在系统中实际运行的对象动态图,可以看出这是一种不完美的composite模式与iterator模式结合,为什么输是不完美的呢
因为处理器链ProcessorChain与处理器processor并没有实现相同的接口(实际上都是process方法[方法签名不同],枝节点与叶节点包含相同的操作方法)
我们先来熟悉一下处理器链ProcessorChain的方法
该类实现了Iterable<Processor>接口,里面覆盖实现iterator()方法迭代自身维持的处理器聚集,相关方法如下
KeyedProperties kp = new KeyedProperties(); public KeyedProperties getKeyedProperties() { return kp; } public int size() { return getProcessors().size(); } public Iterator<Processor> iterator() { return getProcessors().iterator(); } @SuppressWarnings("unchecked") public List<Processor> getProcessors() { return (List<Processor>) kp.get("processors"); } public void setProcessors(List<Processor> processors) { kp.put("processors",processors); }
其中最重要的方法是void process(CrawlURI curi, ChainStatusReceiver thread)迭代处理器,并调用其process方法处理CrawlURI cURI对象
public void process(CrawlURI curi, ChainStatusReceiver thread) throws InterruptedException { assert KeyedProperties.overridesActiveFrom(curi); String skipToProc = null; ploop: for(Processor curProc : this ) { if(skipToProc!=null && !curProc.getBeanName().equals(skipToProc)) { continue; } else { skipToProc = null; } if(thread!=null) { thread.atProcessor(curProc); } ArchiveUtils.continueCheck(); ProcessResult pr = curProc.process(curi); switch (pr.getProcessStatus()) { case PROCEED: continue; case FINISH: break ploop; case JUMP: skipToProc = pr.getJumpTarget(); continue; } } }
ChainStatusReceiver thread接口实现类为ToeThread(回调void atProcessor(Processor proc)方法)
ProcessResult pr为处理结果,封装了枚举类型,其值有三
public enum ProcessStatus { /** * The URI was processed normally, and no special action needs to * be taken by the framework. */ PROCEED, /** * The Processor believes that the ProcessorURI is invalid, or * otherwise incapable of further processing at this time. The * chain should skip subsequent processors, returning the URI. */ FINISH, /** * The Processor has specified the next processor for the URI. The * china should skip forward to that processor instead of the reguarly * scheduled next processor. */ JUMP, }
处理器链ProcessorChain有三个继承类,分别为FetchChain、DispositionChain、CandidateChain
三者没有覆盖任何方法,Heritrix3.1.0大概是为了处理器链ProcessorChain对处理器聚集的逻辑分组
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3036879.html