君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

接下来分析BdbFrontier类的void finished(CrawlURI curi) 方法,完成CrawlURI对象的扫尾工作

BdbFrontier类的父类的父类AbstractFrontier里面

org.archive.crawler.frontier.BdbFrontier

      org.archive.crawler.frontier.AbstractFrontier

/**
     * Note that the previously emitted CrawlURI has completed
     * its processing (for now).
     *
     * The CrawlURI may be scheduled to retry, if appropriate,
     * and other related URIs may become eligible for release
     * via the next next() call, as a result of finished().
     *
     *  (non-Javadoc)
     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)
     */
    public void finished(CrawlURI curi) {
        try {
            KeyedProperties.loadOverridesFrom(curi);
            processFinish(curi);
        } finally {
            KeyedProperties.clearOverridesFrom(curi); 
        }
    }

继续调用BdbFrontier类的void processFinish(CrawlURI curi)方法,在BdbFrontier类的父类WorkQueueFrontier里面

org.archive.crawler.frontier.BdbFrontier

                org.archive.crawler.frontier.WorkQueueFrontier

/**
     * Note that the previously emitted CrawlURI has completed
     * its processing (for now).
     *
     * The CrawlURI may be scheduled to retry, if appropriate,
     * and other related URIs may become eligible for release
     * via the next next() call, as a result of finished().
     *
     * TODO: make as many decisions about what happens to the CrawlURI
     * (success, failure, retry) and queue (retire, snooze, ready) as 
     * possible elsewhere, such as in DispositionProcessor. Then, break
     * this into simple branches or focused methods for each case. 
     *  
     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)
     */
    protected void processFinish(CrawlURI curi) {
//        assert Thread.currentThread() == managerThread;
        
        long now = System.currentTimeMillis();

        curi.incrementFetchAttempts();
        logNonfatalErrors(curi);
        
        WorkQueue wq = (WorkQueue) curi.getHolder();
        // always refresh budgeting values from current curi
        // (whose overlay settings should be active here)
        wq.setSessionBudget(getBalanceReplenishAmount());
        wq.setTotalBudget(getQueueTotalBudget());
        
        assert (wq.peek(this) == curi) : "unexpected peek " + wq;

        int holderCost = curi.getHolderCost();

        if (needsReenqueuing(curi)) {
            // codes/errors which don't consume the URI, leaving it atop queue
            if(curi.getFetchStatus()!=S_DEFERRED) {
                wq.expend(holderCost); // all retries but DEFERRED cost
            }
            long delay_ms = retryDelayFor(curi) * 1000;
            curi.processingCleanup(); // lose state that shouldn't burden retry
            wq.unpeek(curi);
            wq.update(this, curi); // rewrite any changes
            handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));
            doJournalReenqueued(curi);
            wq.makeDirty();
            return; // no further dequeueing, logging, rescheduling to occur
        }

        // Curi will definitely be disposed of without retry, so remove from queue
        wq.dequeue(this,curi);
        decrementQueuedCount(1);
        largestQueues.update(wq.getClassKey(), wq.getCount());
        log(curi);

        
        if (curi.isSuccess()) {
            // codes deemed 'success' 
            incrementSucceededFetchCount();
            totalProcessedBytes.addAndGet(curi.getRecordedSize());
            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED));
            doJournalFinishedSuccess(curi);
           
        } else if (isDisregarded(curi)) {
            // codes meaning 'undo' (even though URI was enqueued, 
            // we now want to disregard it from normal success/failure tallies)
            // (eg robots-excluded, operator-changed-scope, etc)
            incrementDisregardedUriCount();
            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED));
            holderCost = 0; // no charge for disregarded URIs
            // TODO: consider reinstating forget-URI capability, so URI could be
            // re-enqueued if discovered again
            doJournalDisregarded(curi);
            
        } else {
            // codes meaning 'failure'
            incrementFailedFetchCount();
            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED));
            // if exception, also send to crawlErrors
            if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) {
                Object[] array = { curi };
                loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI()
                        .toString(), array);
            }        
            // charge queue any extra error penalty
            wq.noteError(getErrorPenaltyAmount());
            doJournalFinishedFailure(curi);
            
        }

        wq.expend(holderCost); // successes & failures charge cost to queue
        
        long delay_ms = curi.getPolitenessDelay();
        handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
        wq.makeDirty();
        
        if(curi.getRescheduleTime()>0) {
            // marked up for forced-revisit at a set time
            curi.processingCleanup();
            curi.resetForRescheduling(); 
            futureUris.put(curi.getRescheduleTime(),curi);
            futureUriCount.incrementAndGet(); 
        } else {
            curi.stripToMinimal();
            curi.processingCleanup();
        }
    }

上述方面首先获取CrawlURI curi的holder属性(该CrawlURI curi对象对应classkey值得BdbWorkQueue对象,这里涉及到Heritrix3.1.0工作队列的调度,后文再分析),

然后调用BdbWorkQueue对象的synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected)方法

org.archive.crawler.frontier.BdbWorkQueue

      org.archive.crawler.frontier.WorkQueue

/**
     * Remove the peekItem from the queue and adjusts the count.
     * 
     * @param frontier  Work queues manager.
     */
    protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) {
        try {
            deleteItem(frontier, peekItem);
        } catch (IOException e) {
            //FIXME better exception handling
            e.printStackTrace();
            throw new RuntimeException(e);
        }
        unpeek(expected);
        count--;
        lastDequeueTime = System.currentTimeMillis();
    }

org.archive.crawler.frontier.BdbWorkQueue

protected void deleteItem(final WorkQueueFrontier frontier,
            final CrawlURI peekItem) throws IOException {
        try {
            final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)
                .getWorkQueues();
             queues.delete(peekItem);
        } catch (DatabaseException e) {
            throw new IOException(e);
        }
    }

最后调用BdbMultipleWorkQueues对象的void delete(CrawlURI item) 方法,前面文章已经涉及过,这里不再重复这个方法了

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html 

posted on 2013-04-20 07:16  刺猬的温驯  阅读(793)  评论(0编辑  收藏  举报