君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

作为CrawlURI uri对象在处理器链中的生命周期,本人认为逻辑上应该从FrontierPreparer处理器开始,再经过后续的处理器(其实具体CrawlURI uri对象的生命周期,是在它的父级CrawlURI uri对象的抽取处理器处理时已经初具雏形,父级CrawlURI uri对象与它的子级CrawlURI uri对象的生命周期是交错的,关于处理器的流程我在前面已经描述过)

经过FrontierPreparer处理器处理的CrawlURI uri对象下一步才是进入BdbFrontier对象的Schedule方法添加到BdbWorkQueue工作队列

该处理器主要是为CrawlURI uri对象初始化配置,包括调度等级、格式化URL链接、生成classkey、设置holderCost、设置优先级策略,为BdbFrontier对象对其调度做准备

本人在Heritrix 3.1.0 源码解析(二十)解析CandidateChain candidateChain处理器链相关联的处理器时已经提到FrontierPreparer处理器,此文并没有分析该处理器的作用,现在回顾一下

首先是设置CrawlURI curi对象的调度等级,是根据当前CrawlURI curi对象的pathFromSeed属性(从seed到当前CrawlURI curi的Hop值,不同链接类型有不同的代码)

/**
     * Calculate the coarse, original 'schedulingDirective' prioritization
     * for the given CrawlURI
     * 
     * @param curi
     * @return
     */
    protected int getSchedulingDirective(CrawlURI curi) {
        if(StringUtils.isNotEmpty(curi.getPathFromSeed())) {
            char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1);
            if(lastHop == 'R') {
                // refer
                return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM;
            } 
        }
        if (getPreferenceDepthHops() == 0) {
            return HIGH;
            // this implies seed redirects are treated as path
            // length 1, which I belive is standard.
            // curi.getPathFromSeed() can never be null here, because
            // we're processing a link extracted from curi
        } else if (getPreferenceDepthHops() > 0 && 
            curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) {
            return HIGH;
        } else {
            // optionally preferencing embeds up to MEDIUM
            int prefHops = getPreferenceEmbedHops(); 
            if (prefHops > 0) {
                int embedHops = curi.getTransHops();
                if (embedHops > 0 && embedHops <= prefHops
                        && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) {
                    // number of embed hops falls within the preferenced range, and
                    // uri is not already MEDIUM -- so promote it
                    return MEDIUM;
                }
            }
            // Everything else stays as previously assigned
            // (probably NORMAL, at least for now)
            return curi.getSchedulingDirective();
        }
    }

UriCanonicalizationPolicy,姑且称为URL格式化策略类,该类为抽象类,提供格式化URL的抽象方法,由具体子类实现

/**
 * URI Canonicalizatioon Policy
 * 
 * @contributor stack
 * @contributor gojomo
 */
public abstract class UriCanonicalizationPolicy {
    public abstract String canonicalize(String uri);
}

RulesCanonicalizationPolicy类继承自抽象类UriCanonicalizationPolicy,实现格式化URL方法

/**
 * URI Canonicalizatioon Policy
 * 
 * @contributor stack
 * @contributor gojomo
 */
public class RulesCanonicalizationPolicy 
    extends UriCanonicalizationPolicy
    implements HasKeyedProperties {
    private static Logger logger =
        Logger.getLogger(RulesCanonicalizationPolicy.class.getName());
    
    protected KeyedProperties kp = new KeyedProperties();
    public KeyedProperties getKeyedProperties() {
        return kp;
    }
    
    {
        setRules(getDefaultRules());
    }
    @SuppressWarnings("unchecked")
    public List<CanonicalizationRule> getRules() {
        return (List<CanonicalizationRule>) kp.get("rules");
    }
    public void setRules(List<CanonicalizationRule> rules) {
        kp.put("rules", rules);
    }
    
    /**
     * Run the passed uuri through the list of rules.
     * @param context Url to canonicalize.
     * @param rules Iterator of canonicalization rules to apply (Get one
     * of these on the url-canonicalizer-rules element in order files or
     * create a list externally).  Rules must implement the Rule interface.
     * @return Canonicalized URL.
     */
    public String canonicalize(String before) {
        String canonical = before;
        if (logger.isLoggable(Level.FINER)) {
            logger.finer("Canonicalizing: "+before);
        }
        for (CanonicalizationRule rule : getRules()) {
            if(rule.getEnabled()) {
                canonical = rule.canonicalize(canonical);
            }
            if (logger.isLoggable(Level.FINER)) {
                logger.finer(
                    "Rule " + rule.getClass().getName() + " "
                    + (rule.getEnabled()
                            ? canonical :" (disabled)"));
            }
        }
        return canonical;
    }
    
    /**
     * A reasonable set of default rules to use, if no others are
     * provided by operator configuration.
     */
    public static List<CanonicalizationRule> getDefaultRules() {
        List<CanonicalizationRule> rules = new ArrayList<CanonicalizationRule>(6);
        rules.add(new LowercaseRule());
        rules.add(new StripUserinfoRule());
        rules.add(new StripWWWNRule());
        rules.add(new StripSessionIDs());
        rules.add(new StripSessionCFIDs());
        rules.add(new FixupQueryString());
        return rules;
    }
}

格式化URL方法里面迭代调用CanonicalizationRule类型集合里面的成员对象的String canonicalize(String url)方法

CanonicalizationRule是接口,接口声明了String canonicalize(String url)方法,实现该接口的有上面静态方法List<CanonicalizationRule> getDefaultRules()中添加的类,这种处理方式有点类似composite模式与Iterator模式的结合,不过枝节点与叶节点并没有实现共同的接口类型

QueueAssignmentPolicy类为生成URL对象的Classkey策略,该类同样为抽象类,提供生成Classkey的方法(工作队列的标识也就是根据这个生成的Classkey)

默认的生成URL对象的Classkey策略为SurtAuthorityQueueAssignmentPolicy实现类,是根据URL对象的域名生成字符串,因此相同域名的站点里面的URL对象也就只有这一个Classkey标识,也就是只有一个工作队列

我们可以扩展Classkey生成策略,比较经典的是利用ELFHash算法为CrawlURI curi对象分配Key值 ,我这里做一个示例,新建MyQueueAssignmentPolicy类,继承自抽象类QueueAssignmentPolicy,相关源码如下:

/**
     * 
     */
    private static final long serialVersionUID = 1L;

    @Override
    public String getClassKey(CrawlURI cauri) 
    {
        // TODO Auto-generated method stub
        String uri = cauri.getURI().toString();         
        long hash = ELFHash(uri);//利用ELFHash算法为uri分配Key值         
        String a = Long.toString(hash % 50);//取模50,对应50个线程         
        return a;
    }
    public long ELFHash(String str)      
    {         
        long hash = 0;         
        long x   = 0;         
        for(int i = 0; i < str.length(); i++)         
        {            
            hash = (hash << 4) + str.charAt(i);//将字符中的每个元素依次按前四位与上            
            if((x = hash & 0xF0000000L) != 0)//个元素的低四位想与           
            {               
                hash ^= (x >> 24);//长整的高四位大于零,折回再与长整后四位异或              
                hash &= ~x;            
            }         
        }         
        return (hash & 0x7FFFFFFF);      
    }

然后我们在配置文件crawler-beans.cxml里面将FrontierPreparer处理器Bean的queueAssignmentPolicy属性设置成我们扩展的MyQueueAssignmentPolicy类的Bean就可以了

UriPrecedencePolicy类为CrawlURI curi对象优先级策略,该类同样为抽象类,提供设置CrawlURI curi对象的优先级的抽象方法

abstract public class UriPrecedencePolicy implements Serializable {

    /**
     * Add a precedence value to the supplied CrawlURI, which is being 
     * scheduled onto a frontier queue for the first time. 
     * @param curi CrawlURI to assign a precedence value
     */
    abstract public void uriScheduled(CrawlURI curi);

}

默认为CostUriPrecedencePolicy类,根据CrawlURI curi对象的持有成本设置其优先级

/**
 * UriPrecedencePolicy which sets a URI's precedence to its 'cost' -- which
 * simulates the in-queue sorting order in Heritrix 1.x, where cost 
 * contributed the same bits to the queue-insert-key that precedence now does.
 */
public class CostUriPrecedencePolicy extends UriPrecedencePolicy {
    private static final long serialVersionUID = -8164425278358540710L;

    /* (non-Javadoc)
     * @see org.archive.crawler.frontier.precedence.UriPrecedencePolicy#uriScheduled(org.archive.crawler.datamodel.CrawlURI)
     */
    @Override
    public void uriScheduled(CrawlURI curi) {
        curi.setPrecedence(curi.getHolderCost()); 
    }
}

FrontierPreparer处理器Bean的相关策略在crawler-beans.cxml配置文件中的配置如下

 <!-- 
   OPTIONAL BEANS
    Uncomment and expand as needed, or if non-default alternate 
    implementations are preferred.
  -->
  
 <!-- CANONICALIZATION POLICY -->
 <bean id="canonicalizationPolicy" 
   class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy">
   <property name="rules">
    <list>
     <bean class="org.archive.modules.canonicalize.LowercaseRule" />
     <bean class="org.archive.modules.canonicalize.StripUserinfoRule" />
     <bean class="org.archive.modules.canonicalize.StripWWWNRule" />
     <bean class="org.archive.modules.canonicalize.StripSessionIDs" />
     <bean class="org.archive.modules.canonicalize.StripSessionCFIDs" />
     <bean class="org.archive.modules.canonicalize.FixupQueryString" />
    </list>
  </property>
 </bean> 

 <!-- QUEUE ASSIGNMENT POLICY -->
 <bean id="queueAssignmentPolicy" 
   class="org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy">
  <property name="forceQueueAssignment" value="" />
  <property name="deferToPrevious" value="true" />
  <property name="parallelQueues" value="1" />
 </bean>
 
 <!-- URI PRECEDENCE POLICY -->
 <bean id="uriPrecedencePolicy" 
   class="org.archive.crawler.frontier.precedence.CostUriPrecedencePolicy">
 </bean>
 
 <!-- COST ASSIGNMENT POLICY -->
 <bean id="costAssignmentPolicy" 
   class="org.archive.crawler.frontier.UnitCostAssignmentPolicy">
 </bean>

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/29/3050992.html

posted on 2013-04-30 18:59  刺猬的温驯  阅读(720)  评论(0编辑  收藏  举报