上文中的抽象类Scoper关联到另外一个成员变量DecideRule scope,我不得不先中断处理器类的分析(后面再继续处理器分析),来插叙一下DecideRule scope对象,我说了,DecideRule scope成员是用来控制CrawlURI caUri对象的范围
照例先来浏览一下DecideRule相关类图
DecideRule类是一个抽象类,用来判断一个CrawlURI caUri对象是接受还是拒绝
public DecideResult decisionFor(CrawlURI uri) { if (!getEnabled()) { return DecideResult.NONE; } DecideResult result = innerDecide(uri); if (result == DecideResult.NONE) { return result; } return result; } protected abstract DecideResult innerDecide(CrawlURI uri); public DecideResult onlyDecision(CrawlURI uri) { return null; } public boolean accepts(CrawlURI uri) { return DecideResult.ACCEPT == decisionFor(uri); }
上面抽象方法由子类DecideResult innerDecide(CrawlURI uri)实现
DecideResult为枚举类,其值有三
/** * The decision of a DecideRule. * * @author pjack */ public enum DecideResult { /** Indicates the URI was accepted. */ ACCEPT, /** Indicates the URI was neither accepted nor rejected. */ NONE, /** Indicates the URI was rejected. */ REJECT; public static DecideResult invert(DecideResult result) { switch (result) { case ACCEPT: return REJECT; case REJECT: return ACCEPT; default: return result; } } }
我们再来看它的重要子类DecideRuleSequence,该类拥有DecideRule聚集,DecideResult innerDecide(CrawlURI uri)方法里面迭代调用聚集元素的DecideResult decisionFor(CrawlURI uri)方法(composite模式与Iterator模式结合)
@SuppressWarnings("unchecked") public List<DecideRule> getRules() { return (List<DecideRule>) kp.get("rules"); } public void setRules(List<DecideRule> rules) { kp.put("rules", rules); } public DecideResult innerDecide(CrawlURI uri) { DecideRule decisiveRule = null; int decisiveRuleNumber = -1; DecideResult result = DecideResult.NONE; List<DecideRule> rules = getRules(); int max = rules.size(); for (int i = 0; i < max; i++) { DecideRule rule = rules.get(i); if (rule.onlyDecision(uri) != result) { DecideResult r = rule.decisionFor(uri); if (LOGGER.isLoggable(Level.FINEST)) { LOGGER.finest("DecideRule #" + i + " " + rule.getClass().getName() + " returned " + r + " for url: " + uri); } if (r != DecideResult.NONE) { result = r; decisiveRule = rule; decisiveRuleNumber = i; } } } if (fileLogger != null) { fileLogger.info(decisiveRuleNumber + " " + decisiveRule.getClass().getSimpleName() + " " + result + " " + uri); } return result; }
运行环境中该聚集元素我们可以通过crawler-beans.cxml配置文件看到
<!-- SCOPE: rules for which discovered URIs to crawl; order is very important because last decision returned other than 'NONE' wins. --> <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence"> <!-- <property name="logToFile" value="false" /> --> <property name="rules"> <list> <!-- Begin by REJECTing all... --> <bean class="org.archive.modules.deciderules.RejectDecideRule"> </bean> <!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... --> <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule"> <!-- <property name="seedsAsSurtPrefixes" value="true" /> --> <!-- <property name="alsoCheckVia" value="false" /> --> <!-- <property name="surtsSourceFile" value="" /> --> <!-- <property name="surtsDumpFile" value="${launchId}/surts.dump" /> --> <!-- <property name="surtsSource"> <bean class="org.archive.spring.ConfigString"> <property name="value"> <value> # example.com # http://www.example.edu/path1/ # +http://(org,example, </value> </property> </bean> </property> --> </bean> <!-- ...but REJECT those more than a configured link-hop-count from start... --> <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule"> <!-- <property name="maxHops" value="20" /> --> </bean> <!-- ...but ACCEPT those more than a configured link-hop-count from start... --> <bean class="org.archive.modules.deciderules.TransclusionDecideRule"> <!-- <property name="maxTransHops" value="2" /> --> <!-- <property name="maxSpeculativeHops" value="1" /> --> </bean> <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... --> <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule"> <property name="decision" value="REJECT"/> <property name="seedsAsSurtPrefixes" value="false"/> <property name="surtsDumpFile" value="${launchId}/negative-surts.dump" /> <!-- <property name="surtsSource"> <bean class="org.archive.spring.ConfigFile"> <property name="path" value="negative-surts.txt" /> </bean> </property> --> </bean> <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... --> <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule"> <property name="decision" value="REJECT"/> <!-- <property name="listLogicalOr" value="true" /> --> <!-- <property name="regexList"> <list> </list> </property> --> </bean> <!-- ...and REJECT those with suspicious repeating path-segments... --> <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule"> <!-- <property name="maxRepetitions" value="2" /> --> </bean> <!-- ...and REJECT those with more than threshold number of path-segments... --> <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule"> <!-- <property name="maxPathDepth" value="20" /> --> </bean> <!-- ...but always ACCEPT those marked as prerequisitee for another URI... --> <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule"> </bean> <!-- ...but always REJECT those with unsupported URI schemes --> <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule"> </bean> </list> </property> </bean>
抽象类PredicatedDecideRule继承自DecideRule类
@Override protected DecideResult innerDecide(CrawlURI uri) { if (evaluate(uri)) { return getDecision(); } return DecideResult.NONE; } protected abstract boolean evaluate(CrawlURI object);
boolean evaluate(CrawlURI object)方法由子类实现
其他相关实现类我不再一一介绍了
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037547.html