作为CrawlURI uri对象在处理器链中的生命周期,本人认为逻辑上应该从FrontierPreparer处理器开始,再经过后续的处理器(其实具体CrawlURI uri对象的生命周期,是在它的父级CrawlURI uri对象的抽取处理器处理时已经初具雏形,父级CrawlURI uri对象与它的子级CrawlURI uri对象的生命周期是交错的,关于处理器的流程我在前面已经描述过)
经过FrontierPreparer处理器处理的CrawlURI uri对象下一步才是进入BdbFrontier对象的Schedule方法添加到BdbWorkQueue工作队列
该处理器主要是为CrawlURI uri对象初始化配置,包括调度等级、格式化URL链接、生成classkey、设置holderCost、设置优先级策略,为BdbFrontier对象对其调度做准备
本人在Heritrix 3.1.0 源码解析(二十)解析CandidateChain candidateChain处理器链相关联的处理器时已经提到FrontierPreparer处理器,此文并没有分析该处理器的作用,现在回顾一下
首先是设置CrawlURI curi对象的调度等级,是根据当前CrawlURI curi对象的pathFromSeed属性(从seed到当前CrawlURI curi的Hop值,不同链接类型有不同的代码)
/** * Calculate the coarse, original 'schedulingDirective' prioritization * for the given CrawlURI * * @param curi * @return */ protected int getSchedulingDirective(CrawlURI curi) { if(StringUtils.isNotEmpty(curi.getPathFromSeed())) { char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1); if(lastHop == 'R') { // refer return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM; } } if (getPreferenceDepthHops() == 0) { return HIGH; // this implies seed redirects are treated as path // length 1, which I belive is standard. // curi.getPathFromSeed() can never be null here, because // we're processing a link extracted from curi } else if (getPreferenceDepthHops() > 0 && curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) { return HIGH; } else { // optionally preferencing embeds up to MEDIUM int prefHops = getPreferenceEmbedHops(); if (prefHops > 0) { int embedHops = curi.getTransHops(); if (embedHops > 0 && embedHops <= prefHops && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) { // number of embed hops falls within the preferenced range, and // uri is not already MEDIUM -- so promote it return MEDIUM; } } // Everything else stays as previously assigned // (probably NORMAL, at least for now) return curi.getSchedulingDirective(); } }
UriCanonicalizationPolicy,姑且称为URL格式化策略类,该类为抽象类,提供格式化URL的抽象方法,由具体子类实现
/** * URI Canonicalizatioon Policy * * @contributor stack * @contributor gojomo */ public abstract class UriCanonicalizationPolicy { public abstract String canonicalize(String uri); }
RulesCanonicalizationPolicy类继承自抽象类UriCanonicalizationPolicy,实现格式化URL方法
/** * URI Canonicalizatioon Policy * * @contributor stack * @contributor gojomo */ public class RulesCanonicalizationPolicy extends UriCanonicalizationPolicy implements HasKeyedProperties { private static Logger logger = Logger.getLogger(RulesCanonicalizationPolicy.class.getName()); protected KeyedProperties kp = new KeyedProperties(); public KeyedProperties getKeyedProperties() { return kp; } { setRules(getDefaultRules()); } @SuppressWarnings("unchecked") public List<CanonicalizationRule> getRules() { return (List<CanonicalizationRule>) kp.get("rules"); } public void setRules(List<CanonicalizationRule> rules) { kp.put("rules", rules); } /** * Run the passed uuri through the list of rules. * @param context Url to canonicalize. * @param rules Iterator of canonicalization rules to apply (Get one * of these on the url-canonicalizer-rules element in order files or * create a list externally). Rules must implement the Rule interface. * @return Canonicalized URL. */ public String canonicalize(String before) { String canonical = before; if (logger.isLoggable(Level.FINER)) { logger.finer("Canonicalizing: "+before); } for (CanonicalizationRule rule : getRules()) { if(rule.getEnabled()) { canonical = rule.canonicalize(canonical); } if (logger.isLoggable(Level.FINER)) { logger.finer( "Rule " + rule.getClass().getName() + " " + (rule.getEnabled() ? canonical :" (disabled)")); } } return canonical; } /** * A reasonable set of default rules to use, if no others are * provided by operator configuration. */ public static List<CanonicalizationRule> getDefaultRules() { List<CanonicalizationRule> rules = new ArrayList<CanonicalizationRule>(6); rules.add(new LowercaseRule()); rules.add(new StripUserinfoRule()); rules.add(new StripWWWNRule()); rules.add(new StripSessionIDs()); rules.add(new StripSessionCFIDs()); rules.add(new FixupQueryString()); return rules; } }
格式化URL方法里面迭代调用CanonicalizationRule类型集合里面的成员对象的String canonicalize(String url)方法
CanonicalizationRule是接口,接口声明了String canonicalize(String url)方法,实现该接口的有上面静态方法List<CanonicalizationRule> getDefaultRules()中添加的类,这种处理方式有点类似composite模式与Iterator模式的结合,不过枝节点与叶节点并没有实现共同的接口类型
QueueAssignmentPolicy类为生成URL对象的Classkey策略,该类同样为抽象类,提供生成Classkey的方法(工作队列的标识也就是根据这个生成的Classkey)
默认的生成URL对象的Classkey策略为SurtAuthorityQueueAssignmentPolicy实现类,是根据URL对象的域名生成字符串,因此相同域名的站点里面的URL对象也就只有这一个Classkey标识,也就是只有一个工作队列
我们可以扩展Classkey生成策略,比较经典的是利用ELFHash算法为CrawlURI curi对象分配Key值 ,我这里做一个示例,新建MyQueueAssignmentPolicy类,继承自抽象类QueueAssignmentPolicy,相关源码如下:
/** * */ private static final long serialVersionUID = 1L; @Override public String getClassKey(CrawlURI cauri) { // TODO Auto-generated method stub String uri = cauri.getURI().toString(); long hash = ELFHash(uri);//利用ELFHash算法为uri分配Key值 String a = Long.toString(hash % 50);//取模50,对应50个线程 return a; } public long ELFHash(String str) { long hash = 0; long x = 0; for(int i = 0; i < str.length(); i++) { hash = (hash << 4) + str.charAt(i);//将字符中的每个元素依次按前四位与上 if((x = hash & 0xF0000000L) != 0)//个元素的低四位想与 { hash ^= (x >> 24);//长整的高四位大于零,折回再与长整后四位异或 hash &= ~x; } } return (hash & 0x7FFFFFFF); }
然后我们在配置文件crawler-beans.cxml里面将FrontierPreparer处理器Bean的queueAssignmentPolicy属性设置成我们扩展的MyQueueAssignmentPolicy类的Bean就可以了
UriPrecedencePolicy类为CrawlURI curi对象优先级策略,该类同样为抽象类,提供设置CrawlURI curi对象的优先级的抽象方法
abstract public class UriPrecedencePolicy implements Serializable { /** * Add a precedence value to the supplied CrawlURI, which is being * scheduled onto a frontier queue for the first time. * @param curi CrawlURI to assign a precedence value */ abstract public void uriScheduled(CrawlURI curi); }
默认为CostUriPrecedencePolicy类,根据CrawlURI curi对象的持有成本设置其优先级
/** * UriPrecedencePolicy which sets a URI's precedence to its 'cost' -- which * simulates the in-queue sorting order in Heritrix 1.x, where cost * contributed the same bits to the queue-insert-key that precedence now does. */ public class CostUriPrecedencePolicy extends UriPrecedencePolicy { private static final long serialVersionUID = -8164425278358540710L; /* (non-Javadoc) * @see org.archive.crawler.frontier.precedence.UriPrecedencePolicy#uriScheduled(org.archive.crawler.datamodel.CrawlURI) */ @Override public void uriScheduled(CrawlURI curi) { curi.setPrecedence(curi.getHolderCost()); } }
FrontierPreparer处理器Bean的相关策略在crawler-beans.cxml配置文件中的配置如下
<!-- OPTIONAL BEANS Uncomment and expand as needed, or if non-default alternate implementations are preferred. --> <!-- CANONICALIZATION POLICY --> <bean id="canonicalizationPolicy" class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy"> <property name="rules"> <list> <bean class="org.archive.modules.canonicalize.LowercaseRule" /> <bean class="org.archive.modules.canonicalize.StripUserinfoRule" /> <bean class="org.archive.modules.canonicalize.StripWWWNRule" /> <bean class="org.archive.modules.canonicalize.StripSessionIDs" /> <bean class="org.archive.modules.canonicalize.StripSessionCFIDs" /> <bean class="org.archive.modules.canonicalize.FixupQueryString" /> </list> </property> </bean> <!-- QUEUE ASSIGNMENT POLICY --> <bean id="queueAssignmentPolicy" class="org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy"> <property name="forceQueueAssignment" value="" /> <property name="deferToPrevious" value="true" /> <property name="parallelQueues" value="1" /> </bean> <!-- URI PRECEDENCE POLICY --> <bean id="uriPrecedencePolicy" class="org.archive.crawler.frontier.precedence.CostUriPrecedencePolicy"> </bean> <!-- COST ASSIGNMENT POLICY --> <bean id="costAssignmentPolicy" class="org.archive.crawler.frontier.UnitCostAssignmentPolicy"> </bean>
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/29/3050992.html