君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

本文继续分析与heritrix3.1.0系统的处理器相关的源码

我们照例先来浏览一下class uml图

所有的处理器都继承自抽象父类Processor,其中重要的方法如下

/**
     * Processes the given URI.  First checks {@link #ENABLED} and
     * {@link #DECIDE_RULES}.  If ENABLED is false, then nothing happens.
     * If the DECIDE_RULES indicate REJECT, then the 
     * {@link #innerRejectProcess(ProcessorURI)} method is invoked, and
     * the process method returns.
     * 
     * <p>Next, the {@link #shouldProcess(ProcessorURI)} method is 
     * consulted to see if this Processor knows how to handle the given
     * URI.  If it returns false, then nothing futher occurs.
     * 
     * <p>FIXME: Should innerRejectProcess be called when ENABLED is false,
     * or when shouldProcess returns false?  The previous Processor 
     * implementation didn't handle it that way.
     * 
     * <p>Otherwise, the URI is considered valid.  This processor's count
     * of handled URIs is incremented, and the 
     * {@link #innerProcess(ProcessorURI)} method is invoked to actually
     * perform the process.
     * 
     * @param uri  The URI to process
     * @throws  InterruptedException   if the thread is interrupted
     */
    public ProcessResult process(CrawlURI uri) 
    throws InterruptedException {
        if (!getEnabled()) {
            return ProcessResult.PROCEED;
        }
        
        if (getShouldProcessRule().decisionFor(uri) == DecideResult.REJECT) {
            innerRejectProcess(uri);
            return ProcessResult.PROCEED;
        }
        
        if (shouldProcess(uri)) {
            uriCount.incrementAndGet();
            return innerProcessResult(uri);
        } else {
            return ProcessResult.PROCEED;
        }
    }

首先判断是否需要该处理器处理,shouldProcess(CrawlURI uri)为抽象方法,由子类实现(具体处理器类判断是否需要经过自身处理当前CrawlURI uri对象)

里面进一步调用ProcessResult innerProcessResult(CrawlURI uri) 方法(有些子类覆盖了该方法)

protected ProcessResult innerProcessResult(CrawlURI uri) 
    throws InterruptedException {
        innerProcess(uri);
        return ProcessResult.PROCEED;
    }

继续调用void innerProcess(CrawlURI uri)方法,该方法是抽象方法,由子类实现

/**
     * Actually performs the process.  By the time this method is invoked,
     * it is known that the given URI passes the {@link #ENABLED}, the 
     * {@link #DECIDE_RULES} and the {@link #shouldProcess(ProcessorURI)}
     * tests.  
     * 
     * @param uri    the URI to process
     * @throws InterruptedException   if the thread is interrupted
     */
    protected abstract void innerProcess(CrawlURI uri) 
    throws InterruptedException;

处理器Processor类的子类 逻辑上又分为几大不同类别的处理器,它们在系统运行时已经属于不同的处理器链,在类的继承层次上 又有各自的层次归属

本文以及接下来的文章我只能选择部分处理器Processor分析一下

CandidatesProcessor处理器:CandidatesProcessor处理器里面拥有CandidateChain candidateChain成员,调用该处理器链的处理器方法

通过该处理器的CrawlURI cURI对象最终调用BdbFrontier的schedule(CrawlURI cURI)方法添加到BDB数据库

 /**
     * Candidate chain
     */
    protected CandidateChain candidateChain;
    public CandidateChain getCandidateChain() {
        return this.candidateChain;
    }
    @Autowired
    public void setCandidateChain(CandidateChain candidateChain) {
        this.candidateChain = candidateChain;
    }
    
    /**
     * The frontier to use.
     */
    protected Frontier frontier;
    public Frontier getFrontier() {
        return this.frontier;
    }
    @Autowired
    public void setFrontier(Frontier frontier) {
        this.frontier = frontier;
    }

实际调用的处理器方法如下

/* (non-Javadoc)
     * @see org.archive.modules.Processor#innerProcess(org.archive.modules.CrawlURI)
     */
    @Override
    protected void innerProcess(final CrawlURI curi) throws InterruptedException {
        // Handle any prerequisites when S_DEFERRED for prereqs
        if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) {
            CrawlURI prereq = curi.getPrerequisiteUri();
            prereq.setFullVia(curi); 
            sheetOverlaysManager.applyOverlaysTo(prereq);
            try {
                KeyedProperties.clearOverridesFrom(curi); 
                KeyedProperties.loadOverridesFrom(prereq);

                getCandidateChain().process(prereq, null);
                
                if(prereq.getFetchStatus()>=0) {
                    
                    frontier.schedule(prereq);
                } else {
                    curi.setFetchStatus(S_PREREQUISITE_UNSCHEDULABLE_FAILURE);
                }
            } finally {
                KeyedProperties.clearOverridesFrom(prereq); 
                KeyedProperties.loadOverridesFrom(curi);
            }
            return;
        }

        // Don't consider candidate links of error pages
        if (curi.getFetchStatus() < 200 || curi.getFetchStatus() >= 400) {
            curi.getOutLinks().clear();
            return;
        }

        for (Link wref: curi.getOutLinks()) {
            CrawlURI candidate;
            try {
                candidate = curi.createCrawlURI(curi.getBaseURI(),wref);
                // at least for duration of candidatechain, offer
                // access to full CrawlURI of via
                candidate.setFullVia(curi); 
            } catch (URIException e) {
                loggerModule.logUriError(e, curi.getUURI(), 
                        wref.getDestination().toString());
                continue;
            }
            sheetOverlaysManager.applyOverlaysTo(candidate);
            try {
                KeyedProperties.clearOverridesFrom(curi); 
                KeyedProperties.loadOverridesFrom(candidate);
                
                if(getSeedsRedirectNewSeeds() && curi.isSeed() 
                        && wref.getHopType() == Hop.REFER
                        && candidate.getHopCount() < SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS) {
                    candidate.setSeed(true);                     
                }
                getCandidateChain().process(candidate, null); 
                if(candidate.getFetchStatus()>=0) {
                    if(checkForSeedPromotion(candidate)) {
                        /*
                         * We want to guarantee crawling of seed version of
                         * CrawlURI even if same url has already been enqueued,
                         * see https://webarchive.jira.com/browse/HER-1891
                         */
                        candidate.setForceFetch(true);                        
                        getSeeds().addSeed(candidate);
                    } else {                        
                        frontier.schedule(candidate);
                    }
                    curi.getOutCandidates().add(candidate);
                }
                
            } finally {
                KeyedProperties.clearOverridesFrom(candidate); 
                KeyedProperties.loadOverridesFrom(curi);
            }
        }
        curi.getOutLinks().clear();
    }

我们查看一下爬行任务配置文件crawler-beans.cxml,CandidateChain candidateChain处理器链的相关处理器如下

<!-- CANDIDATE CHAIN --> 
 <!-- first, processors are declared as top-level named beans -->
 <bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper">
 </bean>
 <bean id="preparer" class="org.archive.crawler.prefetch.FrontierPreparer">
  <!-- <property name="preferenceDepthHops" value="-1" /> -->
  <!-- <property name="preferenceEmbedHops" value="1" /> -->
  <!-- <property name="canonicalizationPolicy"> 
        <ref bean="canonicalizationPolicy" />
       </property> -->
  <!-- <property name="queueAssignmentPolicy"> 
        <ref bean="queueAssignmentPolicy" />
       </property> -->
  <!-- <property name="uriPrecedencePolicy"> 
        <ref bean="uriPrecedencePolicy" />
       </property> -->
  <!-- <property name="costAssignmentPolicy"> 
        <ref bean="costAssignmentPolicy" />
       </property> -->
 </bean>
 <!-- now, processors are assembled into ordered CandidateChain bean -->
 <bean id="candidateProcessors" class="org.archive.modules.CandidateChain">
  <property name="processors">
   <list>
    <!-- apply scoping rules to each individual candidate URI... -->
    <ref bean="candidateScoper"/>
    <!-- ...then prepare those ACCEPTed to be enqueued to frontier. -->
    <ref bean="preparer"/>
   </list>
  </property>
 </bean>

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3036954.html

posted on 2013-04-23 10:05  刺猬的温驯  阅读(774)  评论(0编辑  收藏  举报