我们接下来分析与与BdbFrontier对象void finished(CrawlURI cURI)方法相关的方法
/** * Note that the previously emitted CrawlURI has completed * its processing (for now). * * The CrawlURI may be scheduled to retry, if appropriate, * and other related URIs may become eligible for release * via the next next() call, as a result of finished(). * * TODO: make as many decisions about what happens to the CrawlURI * (success, failure, retry) and queue (retire, snooze, ready) as * possible elsewhere, such as in DispositionProcessor. Then, break * this into simple branches or focused methods for each case. * * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI) */ protected void processFinish(CrawlURI curi) { // assert Thread.currentThread() == managerThread; long now = System.currentTimeMillis(); //尝试次数 curi.incrementFetchAttempts(); logNonfatalErrors(curi); WorkQueue wq = (WorkQueue) curi.getHolder(); // always refresh budgeting values from current curi // (whose overlay settings should be active here) wq.setSessionBudget(getBalanceReplenishAmount()); wq.setTotalBudget(getQueueTotalBudget()); assert (wq.peek(this) == curi) : "unexpected peek " + wq; int holderCost = curi.getHolderCost(); //是否需要重新处理 if (needsReenqueuing(curi)) { // codes/errors which don't consume the URI, leaving it atop queue if(curi.getFetchStatus()!=S_DEFERRED) { wq.expend(holderCost); // all retries but DEFERRED cost } //延时时间 long delay_ms = retryDelayFor(curi) * 1000; curi.processingCleanup(); // lose state that shouldn't burden retry wq.unpeek(curi); //更新到WorkQueue wq wq.update(this, curi); // rewrite any changes //重新归队 handleQueue(wq,curi.includesRetireDirective(),now,delay_ms); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY)); doJournalReenqueued(curi); wq.makeDirty(); return; // no further dequeueing, logging, rescheduling to occur } // Curi will definitely be disposed of without retry, so remove from queue //从WorkQueue wq中移除该CrawlURI curi对象 wq.dequeue(this,curi); decrementQueuedCount(1); largestQueues.update(wq.getClassKey(), wq.getCount()); log(curi); if (curi.isSuccess()) { // codes deemed 'success' incrementSucceededFetchCount(); totalProcessedBytes.addAndGet(curi.getRecordedSize()); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED)); doJournalFinishedSuccess(curi); } else if (isDisregarded(curi)) { // codes meaning 'undo' (even though URI was enqueued, // we now want to disregard it from normal success/failure tallies) // (eg robots-excluded, operator-changed-scope, etc) incrementDisregardedUriCount(); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED)); holderCost = 0; // no charge for disregarded URIs // TODO: consider reinstating forget-URI capability, so URI could be // re-enqueued if discovered again doJournalDisregarded(curi); } else { // codes meaning 'failure' incrementFailedFetchCount(); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED)); // if exception, also send to crawlErrors if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) { Object[] array = { curi }; loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI() .toString(), array); } // charge queue any extra error penalty wq.noteError(getErrorPenaltyAmount()); doJournalFinishedFailure(curi); } wq.expend(holderCost); // successes & failures charge cost to queue //延时时间 long delay_ms = curi.getPolitenessDelay(); //long delay_ms = 0; //重新归队 handleQueue(wq,curi.includesRetireDirective(),now,delay_ms); wq.makeDirty(); if(curi.getRescheduleTime()>0) { // marked up for forced-revisit at a set time curi.processingCleanup(); curi.resetForRescheduling(); futureUris.put(curi.getRescheduleTime(),curi); futureUriCount.incrementAndGet(); } else { curi.stripToMinimal(); curi.processingCleanup(); } }
首先判断CrawlURI curi对象是否需要重新放入队列,方法如下
/** * Checks if a recently processed CrawlURI that did not finish successfully * needs to be reenqueued (and thus possibly, processed again after some * time elapses) * * @param curi * The CrawlURI to check * @return True if we need to retry. */ protected boolean needsReenqueuing(CrawlURI curi) { //是否超过最大的尝试次数,默认为30次 if (overMaxRetries(curi)) { return false; } //根据状态判断 switch (curi.getFetchStatus()) { case HttpStatus.SC_UNAUTHORIZED: // We can get here though usually a positive status code is // a success. We get here if there is rfc2617 credential data // loaded and we're supposed to go around again. See if any // rfc2617 credential present and if there, assume it got // loaded in FetchHTTP on expectation that we're to go around // again. If no rfc2617 loaded, we should not be here. boolean loaded = curi.hasRfc2617Credential(); if (!loaded && logger.isLoggable(Level.FINE)) { logger.fine("Have 401 but no creds loaded " + curi); } return loaded; case S_DEFERRED: case S_CONNECT_FAILED: case S_CONNECT_LOST: case S_DOMAIN_UNRESOLVABLE: // these are all worth a retry // TODO: consider if any others (S_TIMEOUT in some cases?) deserve // retry return true; case S_UNATTEMPTED: if(curi.includesRetireDirective()) { return true; } // otherwise, fall-through: no status is an error without queue-directive default: return false; } }
long retryDelayFor(CrawlURI curi)方法为设置WorkQueue wq延时时间
/** * Return a suitable value to wait before retrying the given URI. * * @param curi * CrawlURI to be retried * @return millisecond delay before retry */ protected long retryDelayFor(CrawlURI curi) { int status = curi.getFetchStatus(); return (status == S_CONNECT_FAILED || status == S_CONNECT_LOST || status == S_DOMAIN_UNRESOLVABLE)? getRetryDelaySeconds() : 0; // no delay for most }
getRetryDelaySeconds()的值默认为900秒(15分)
后面为将CrawlURI curi对象更新到WorkQueue wq,最后 重置WorkQueue wq的队列归属(放入不再激活的队列或休眠队列或reenqueueQueue(wq)进一步处理)
/** * 重置WorkQueue wq的队列归属 * Send an active queue to its next state, based on the supplied * parameters. * * @param wq * @param forceRetire * @param now * @param delay_ms */ protected void handleQueue(WorkQueue wq, boolean forceRetire, long now, long delay_ms) { inProcessQueues.remove(wq); if(forceRetire) { retireQueue(wq); } else if (delay_ms > 0) { snoozeQueue(wq, now, delay_ms); } else { //Enqueue the given queue to either readyClassQueues or inactiveQueues,as appropriate reenqueueQueue(wq); } }
接下来看后面的方法wq.dequeue(this,curi)为将CrawlURI curi对象从WorkQueue wq中移除
最后重置WorkQueue wq的队列归属
long delay_ms = curi.getPolitenessDelay(); handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
handleQueue方法在上面部分
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/21/3033520.html