如果从heritrix3.1.0系统的静态逻辑结构入手,往往看不到系统相关对象的交互作用;如果只从系统的对象动态结构 分析,则又看不到系统的逻辑轮廓
所以源码分析需要动静兼顾,使我们更容易理解它的逻辑与交互,本文采用这个分析方法入手
本文要分析的是spring给Heritrix3.1.0系统bean带来了什么样的管理方式,spring容器的配置文件我们已从上文有了初步的了解
先了解spring容器在系统中是怎样加载配置文件以及怎么初始化的,当我们执行采集任务的build操作时
调用CrawlJob对象的void validateConfiguration()
/** * Does the assembled ApplicationContext self-validate? Any failures * are reported as WARNING log events in the job log. * * TODO: make these severe? */ public synchronized void validateConfiguration() { instantiateContainer(); if(ac==null) { // fatal errors already encountered and reported return; } ac.validate(); HashMap<String,Errors> allErrors = ac.getAllErrors(); for(String name : allErrors.keySet()) { for(Object err : allErrors.get(name).getAllErrors()) { LOGGER.log(Level.WARNING,err.toString()); } } }
首先加载spring配置文件,初始化spring容器;然后是验证容器
/** * Can the configuration yield an assembled ApplicationContext? */ public synchronized void instantiateContainer() { checkXML(); if(ac==null) { try { ac = new PathSharingContext(new String[] {"file:"+primaryConfig.getAbsolutePath()},false,null); ac.addApplicationListener(this); ac.refresh(); getCrawlController(); // trigger NoSuchBeanDefinitionException if no CC getJobLogger().log(Level.INFO,"Job instantiated"); } catch (BeansException be) { // Calling doTeardown() and therefore ac.close() here sometimes // triggers an IllegalStateException and logs stack trace from // within spring, even if ac.isActive(). So, just null it. ac = null; beansException(be); } } }
上面方法是装载配置文件,添加CrawlJob对象监听器
Heritrix3.1.0的spring容器是经过系统封装的PathSharingContext对象,PathSharingContext类继承自spring的FileSystemXmlApplicationContext类,在它的构造函数里面传入配置文件
public PathSharingContext(String[] configLocations, boolean refresh, ApplicationContext parent) throws BeansException { super(configLocations, refresh, parent); }
当我们执行采集任务的launch操作时,调用CrawlJob对象的void launch()方法
/** * Launch a crawl into 'running' status, assembling if necessary. * * (Note the crawl may have been configured to start in a 'paused' * state.) */ public synchronized void launch() { if (isProfile()) { throw new IllegalArgumentException("Can't launch profile" + this); } if(isRunning()) { getJobLogger().log(Level.SEVERE,"Can't relaunch running job"); return; } else { CrawlController cc = getCrawlController(); if(cc!=null && cc.hasStarted()) { getJobLogger().log(Level.SEVERE,"Can't relaunch previously-launched assembled job"); return; } } validateConfiguration(); if(!hasValidApplicationContext()) { getJobLogger().log(Level.SEVERE,"Can't launch problem configuration"); return; } //final String job = changeState(j, ACTIVE); // this temporary thread ensures all crawl-created threads // land in the AlertThreadGroup, to assist crawl-wide // logging/alerting alertThreadGroup = new AlertThreadGroup(getShortName()); alertThreadGroup.addLogger(getJobLogger()); Thread launcher = new Thread(alertThreadGroup, getShortName()+" launchthread") { public void run() { CrawlController cc = getCrawlController(); startContext(); if(cc!=null) { cc.requestCrawlStart(); } } }; getJobLogger().log(Level.INFO,"Job launched"); scanJobLog(); launcher.start(); // look busy (and give startContext/crawlStart a chance) try { Thread.sleep(1500); } catch (InterruptedException e) { // do nothing } }
这里的重要方法是线程对象里面的void startContext()
/** * Start the context, catching and reporting any BeansExceptions. */ protected synchronized void startContext() { try { ac.start(); // job log file covering just this launch getJobLogger().removeHandler(currentLaunchJobLogHandler); File f = new File(ac.getCurrentLaunchDir(), "job.log"); currentLaunchJobLogHandler = new FileHandler(f.getAbsolutePath(), true); currentLaunchJobLogHandler.setFormatter(new JobLogFormatter()); getJobLogger().addHandler(currentLaunchJobLogHandler); } catch (BeansException be) { doTeardown(); beansException(be); } catch (Exception e) { LOGGER.log(Level.SEVERE,e.getClass().getSimpleName()+": "+e.getMessage(),e); try { doTeardown(); } catch (Exception e2) { e2.printStackTrace(System.err); } } }
该方法调用PathSharingContext对象的start方法
@Override public void start() { initLaunchDir(); super.start(); }
在上面方法里面,会执行spring容器里面所有bean(实现Lifecycle接口)的start方法
Lifecycle接口声明的方法如下,定义了bean组件的生命周期
public interface Lifecycle { /** * Start this component. * Should not throw an exception if the component is already running. * <p>In the case of a container, this will propagate the start signal * to all components that apply. */ void start(); /** * Stop this component. * Should not throw an exception if the component isn't started yet. * <p>In the case of a container, this will propagate the stop signal * to all components that apply. */ void stop(); /** * Check whether this component is currently running. * <p>In the case of a container, this will return <code>true</code> * only if <i>all</i> components that apply are currently running. * @return whether the component is currently running */ boolean isRunning(); }
从这里我们可以知道,Heritrix3.1.0系统是通过spring容器统一管理bean的生命周期(主要是初始化状态)的
本文通过打印输出了调用了系统哪些bean的start方法
name:scope
name:loggerModule||org.archive.crawler.reporting.CrawlerLoggerModule
name:scope||org.archive.modules.deciderules.DecideRuleSequence
name:candidateScoper
name:candidateScoper||org.archive.crawler.prefetch.CandidateScoper
name:preparer
name:preparer||org.archive.crawler.prefetch.FrontierPreparer
name:candidateProcessors
name:candidateProcessors||org.archive.modules.CandidateChain
name:preselector
name:preselector||org.archive.crawler.prefetch.MyPreselector
name:preconditions
name:bdb||org.archive.bdb.BdbModule
name:serverCache||org.archive.modules.net.BdbServerCache
name:preconditions||org.archive.crawler.prefetch.PreconditionEnforcer
name:fetchDns
name:fetchDns||org.archive.modules.fetcher.FetchDNS
name:fetchHttp
name:cookieStorage||org.archive.modules.fetcher.BdbCookieStorage
name:fetchHttp||org.archive.modules.fetcher.FetchHTTP
name:extractorHttp
name:statisticsTracker||org.archive.crawler.reporting.StatisticsTracker
name:extractorHtml||org.archive.modules.extractor.ExtractorHTML
name:extractorCss||org.archive.modules.extractor.ExtractorCSS
name:extractorJs||org.archive.modules.extractor.ExtractorJS
name:extractorSwf||org.archive.modules.extractor.ExtractorSWF
name:fetchProcessors||org.archive.modules.FetchChain
name:warcWriter||org.archive.modules.writer.MyWriterProcessor
name:candidates||org.archive.crawler.postprocessor.CandidatesProcessor
name:disposition||org.archive.crawler.postprocessor.DispositionProcessor
name:dispositionProcessors||org.archive.modules.DispositionChain
name:crawlController||org.archive.crawler.framework.CrawlController
name:uriUniqFilter||org.archive.crawler.util.BdbUriUniqFilter
name:frontier||org.archive.crawler.frontier.BdbFrontier
name:actionDirectory
name:actionDirectory||org.archive.crawler.framework.ActionDirectory
name:checkpointService
name:checkpointService||org.archive.crawler.framework.CheckpointService
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025410.html