君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

上文分析了Heritrix3.1.0系统是怎么添加CrawlURI curi对象的,那么在系统初始化的时候,是怎么载入CrawlURI curi种子的呢?

我们回顾前面的文章,在我们执行采集任务的launch指令的时候,实际会调用CrawlController对象的void requestCrawlStart()方法

/** 
     * Operator requested crawl begin
     */
    public void requestCrawlStart() {
        hasStarted = true; 
        sendCrawlStateChangeEvent(State.PREPARING, CrawlStatus.PREPARING);
        
        if(recoveryCheckpoint==null) {
            // only announce (trigger scheduling of) seeds
            // when doing a cold (non-recovery) start
            getSeeds().announceSeeds();
        }
        
        setupToePool();

        // A proper exit will change this value.
        this.sExit = CrawlStatus.FINISHED_ABNORMAL;
        
        if (getPauseAtStart()) {
            // frontier is already paused unless started, so just 
            // 'complete'/ack pause
            completePause();
        } else {
            getFrontier().run();
        }
    }

继续调用getSeeds().announceSeeds()方法,这里的getSeeds()真实对象是TextSeedModule(spring自动注入的),然后调用它的void announceSeeds()方法

/**
     * Announce all seeds from configured source to SeedListeners 
     * (including nonseed lines mixed in). 
     * @see org.archive.modules.seeds.SeedModule#announceSeeds()
     */
    public void announceSeeds() {
        if(getBlockAwaitingSeedLines()>-1) {
            final CountDownLatch latch = new CountDownLatch(getBlockAwaitingSeedLines());
            new Thread(){
                @Override
                public void run() {
                    announceSeeds(latch); 
                    while(latch.getCount()>0) {
                        latch.countDown();
                    }
                }
            }.start();
            try {
                latch.await();
            } catch (InterruptedException e) {
                // do nothing
            } 
        } else {
            announceSeeds(null); 
        }
    }

 上面方法中if后面的CountDownLatch latch是线程计数,else后面是null,继续调用void announceSeeds(CountDownLatch latchOrNull)方法 

protected void announceSeeds(CountDownLatch latchOrNull) {
        BufferedReader reader = new BufferedReader(textSource.obtainReader());       
        try {
            announceSeedsFromReader(reader,latchOrNull);    
        } finally {
            IOUtils.closeQuietly(reader);
        }
    }

 首先获取ReadSource textSource(org.archive.spring.ConfigString)的Reader(StringReader),然后调用void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull)方法 

/**
     * Announce all seeds (and nonseed possible-directive lines) from
     * the given Reader
     * @param reader source of seed/directive lines
     * @param latchOrNull if non-null, sent countDown after each line, allowing 
     * another thread to proceed after a configurable number of lines processed
     */
    protected void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull) {
        String s;
        Iterator<String> iter = 
            new RegexLineIterator(
                    new LineReadingIterator(reader),
                    RegexLineIterator.COMMENT_LINE,
                    RegexLineIterator.NONWHITESPACE_ENTRY_TRAILING_COMMENT,
                    RegexLineIterator.ENTRY);

        int count = 0; 
        while (iter.hasNext()) {
            s = (String) iter.next();
            if(Character.isLetterOrDigit(s.charAt(0))) {
                // consider a likely URI
                seedLine(s);
                count++;
                if(count%20000==0) {
                    System.runFinalization();
                }
            } else {
                // report just in case it's a useful directive
                nonseedLine(s);
            }
            if(latchOrNull!=null) {
                latchOrNull.countDown(); 
            }
        }
        publishConcludedSeedBatch(); 
    }

 迭代url字符串并调用void seedLine(String uri)方法

/**
     * Handle a read line that is probably a seed.
     * 
     * @param uri String seed-containing line
     */
    protected void seedLine(String uri) {
        if (!uri.matches("[a-zA-Z][\\w+\\-]+:.*")) { // Rfc2396 s3.1 scheme,
                                                     // minus '.'
            // Does not begin with scheme, so try http://
            uri = "http://" + uri;
        }
        try {
            UURI uuri = UURIFactory.getInstance(uri);
            CrawlURI curi = new CrawlURI(uuri);
            curi.setSeed(true);
            curi.setSchedulingDirective(SchedulingConstants.MEDIUM);
            if (getSourceTagSeeds()) {
                curi.setSourceTag(curi.toString());
            }
            publishAddedSeed(curi);
        } catch (URIException e) {
            // try as nonseed line as fallback
            nonseedLine(uri);
        }
    }

最后调用父类SeedModule的void publishAddedSeed(CrawlURI curi)方法(observer模式)

protected void publishAddedSeed(CrawlURI curi) {
        for (SeedListener l: seedListeners) {
            l.addedSeed(curi);
        }
    }

BdbFrontier类间接实现了SeedListener接口(AbstractFrontier抽象类void addedSeed(CrawlURI puri)方法)

/**
     * When notified of a seed via the SeedListener interface, 
     * schedule it.
     * 
     * @see org.archive.modules.seeds.SeedListener#addedSeed(org.archive.modules.CrawlURI)
     */
    public void addedSeed(CrawlURI puri) {
        schedule(puri);
    }

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/20/3031924.html

posted on 2013-04-20 06:38  刺猬的温驯  阅读(800)  评论(0编辑  收藏  举报