上文分析了Heritrix3.1.0系统对HttpClient组件的请求处理类的封装,本文接下来分析Heritrix3.1.0系统是怎样封装请求证书的
Heritrix3.1.0系统的package org.archive.modules.credential里面的相关类都是与请求证书有关的
先来了解一下CredentialStore类,该类用Map类型存储了应用的所有证书(Credential),外部只要调用这个类就可以获取证书
该类重要方法如下
KeyedProperties kp = new KeyedProperties(); public KeyedProperties getKeyedProperties() { return kp; } /** * Credentials used by heritrix authenticating. See * http://crawler.archive.org/proposals/auth/ for background. * * @see http://crawler.archive.org/proposals/auth/ */ { setCredentials(new HashMap<String, Credential>()); } @SuppressWarnings("unchecked") public Map<String,Credential> getCredentials() { return (Map<String,Credential>) kp.get("credentials"); } public void setCredentials(Map<String,Credential> map) { kp.put("credentials",map); } /** * List of possible credential types as a List. * * This types are inner classes of this credential type so they cannot * be created without their being associated with a credential list. */ private static final List<Class<?>> credentialTypes; // Initialize the credentialType data member. static { // Array of all known credential types. Class<?> [] tmp = {HtmlFormCredential.class, HttpAuthenticationCredential.class}; credentialTypes = Collections.unmodifiableList(Arrays.asList(tmp)); } /** * Constructor. */ public CredentialStore() { } /** * @return Unmodifable list of credential types. */ public static List<Class<?>> getCredentialTypes() { return CredentialStore.credentialTypes; } /** * @param context Pass a ProcessorURI. Used to set * context. * @return An iterator or null. */ public Collection<Credential> getAll() { Map<String,Credential> map = getCredentials(); return map.values(); } /** * @param context Used to set context. * @param name Name to give the manufactured credential. Should be unique * else the add of the credential to the list of credentials will fail. * @return Returns <code>name</code>'d credential. * @throws AttributeNotFoundException * @throws MBeanException * @throws ReflectionException */ public Credential get(/*StateProvider*/Object context, String name) { return getCredentials().get(name); } /** * Return set made up of all credentials of the passed * <code>type</code>. * * @param context Used to set context. * @param type Type of the list to return. Type is some superclass of * credentials. * @param rootUri RootUri to match. May be null. In this case we return * all. Currently we expect the CrawlServer name to equate to root Uri. * Its not. Currently it doesn't distingush between servers of same name * but different ports (e.g. http and https). * @return Unmodifable sublist of all elements of passed type. */ public Set<Credential> subset(CrawlURI context, Class<?> type, String rootUri) { Set<Credential> result = null; for (Credential c: getAll()) { if (!type.isInstance(c)) { continue; } if (rootUri != null) { String cd = c.getDomain(); if (cd == null) { continue; } if (!rootUri.equalsIgnoreCase(cd)) { continue; } } if (result == null) { result = new HashSet<Credential>(); } result.add(c); } return result; }
上面方法分别提供了获取所有证书(Map类型),根据名称(Map的key键)获取证书和获取所有证书类型
(注意到最后的subset方法,好像没有用到CrawlURI context参数,方法返回的只能是指定域并且指定证书类型的证书集合)
从它的静态代码块可以看到,系统提供了两种类型的证书类型,分别是HtmlFormCredential.class, HttpAuthenticationCredential.class,前者用于form认证,后者用于Basic/Digest HTTP认证
两种证书类型继承自抽象类Credential,先看一下该抽象类的方法
/** *域名 * The root domain this credential goes against: E.g. www.archive.org */ String domain = ""; /** * @param context Context to use when searching for credential domain. * @return The domain/root URI this credential is to go against. * @throws AttributeNotFoundException If attribute not found. */ public String getDomain() { return this.domain; } public void setDomain(String domain) { this.domain = domain; } /** *为CrawlURI curi对象添加当前证书 * Attach this credentials avatar to the passed <code>curi</code> . * * Override if credential knows internally what it wants to attach as * payload. Otherwise, if payload is external, use the below * {@link #attach(CrawlURI, String)}. * * @param curi CrawlURI to load with credentials. */ public void attach(CrawlURI curi) { curi.getCredentials().add(this); } /** *为CrawlURI curi对象移除当前证书 * Detach this credential from passed curi. * * @param curi * @return True if we detached a Credential reference. */ public boolean detach(CrawlURI curi) { return curi.getCredentials().remove(this); } /** *为CrawlURI curi对象移除所有证书 * Detach all credentials of this type from passed curi. * * @param curi * @return True if we detached references. */ public boolean detachAll(CrawlURI curi) { boolean result = false; Iterator<Credential> iter = curi.getCredentials().iterator(); while (iter.hasNext()) { Credential cred = iter.next(); if (cred.getClass() == this.getClass()) { iter.remove(); result = true; } } return result; } /** *判断CrawlURI curi对象是否需要当前证书认证 * @param curi CrawlURI to look at. * @return True if this credential IS a prerequisite for passed * CrawlURI. */ public abstract boolean isPrerequisite(CrawlURI curi); /** *判断CrawlURI curi对象是否存在认证URI * @param curi CrawlURI to look at. * @return True if this credential HAS a prerequisite for passed CrawlURI. */ public abstract boolean hasPrerequisite(CrawlURI curi); /** *获取CrawlURI curi对象的认证URI * Return the authentication URI, either absolute or relative, that serves * as prerequisite the passed <code>curi</code>. * * @param curi CrawlURI to look at. * @return Prerequisite URI for the passed curi. */ public abstract String getPrerequisite(CrawlURI curi); /** *获取CrawlURI curi对象的认证URI * @param context Context to use when searching for credential domain. * @return Key that is unique to this credential type. * @throws AttributeNotFoundException */ public abstract String getKey(); /** *判断CrawlURI curi对象是否每次都要认证 * @return True if this credential is of the type that needs to be offered * on each visit to the server (e.g. Rfc2617 is such a type). */ public abstract boolean isEveryTime(); /** *为HttpMethod method添加认证参数 * @param curi CrawlURI to as for context. * @param http Instance of httpclient. * @param method Method to populate. * @return True if added a credentials. */ public abstract boolean populate(CrawlURI curi, HttpClient http, HttpMethod method); /** *是否post认证 * @param curi CrawlURI to look at. * @return True if this credential is to be posted. Return false if the * credential is to be GET'd or if POST'd or GET'd are not pretinent to this * credential type. */ public abstract boolean isPost(); /** * 判断CrawlURI curi对象的CrawlServer类中的名称与当前认证对象的域名是否一致(用于排除不需要当前认证的CrawlURI curi对象) * Test passed curi matches this credentials rootUri. * @param controller * @param curi CrawlURI to test. * @return True if domain for credential matches that of the passed curi. */ public boolean rootUriMatch(ServerCache cache, CrawlURI curi) { String cd = getDomain(); CrawlServer serv = cache.getServerFor(curi.getUURI()); String serverName = serv.getName(); // String serverName = controller.getServerCache().getServerFor(curi). // getName(); logger.fine("RootURI: Comparing " + serverName + " " + cd); return cd != null && serverName != null && serverName.equalsIgnoreCase(cd); }
上述方法的功能是为CrawlURI curi对象添加当前证书、移除当前证书、为HttpMethod method对象添加证书参数、判断CrawlURI curi对象的域名与当前证书的域名是否一致等
HtmlFormCredential对象继承自上述证书类Credential,为CrawlURI curi对象提供form认证,相关方法实现如下
/** * Full URI of page that contains the HTML login form we're to apply these * credentials too: E.g. http://www.archive.org */ String loginUri = ""; public String getLoginUri() { return this.loginUri; } public void setLoginUri(String loginUri) { this.loginUri = loginUri; } /** * Form items. */ Map<String,String> formItems = new HashMap<String,String>(); public Map<String,String> getFormItems() { return this.formItems; } public void setFormItems(Map<String,String> formItems) { this.formItems = formItems; } enum Method { GET, POST } /** * GET or POST. */ Method httpMethod = Method.POST; public Method getHttpMethod() { return this.httpMethod; } public void setHttpMethod(Method method) { this.httpMethod = method; } /** * Constructor. */ public HtmlFormCredential() { } public boolean isPrerequisite(final CrawlURI curi) { boolean result = false; String curiStr = curi.getUURI().toString(); String loginUri = getPrerequisite(curi); if (loginUri != null) { try {
//登录url UURI uuri = UURIFactory.getInstance(curi.getUURI(), loginUri); if (uuri != null && curiStr != null && uuri.toString().equals(curiStr)) { result = true; if (!curi.isPrerequisite()) { curi.setPrerequisite(true); logger.fine(curi + " is prereq."); } } } catch (URIException e) { logger.severe("Failed to uuri: " + curi + ", " + e.getMessage()); } } return result; } public boolean hasPrerequisite(CrawlURI curi) { return getPrerequisite(curi) != null; } public String getPrerequisite(CrawlURI curi) { return getLoginUri(); } public String getKey() { return getLoginUri(); } public boolean isEveryTime() { // This authentication is one time only. return false; } public boolean populate(CrawlURI curi, HttpClient http, HttpMethod method) { // http is not used boolean result = false; Map<String,String> formItems = getFormItems(); if (formItems == null || formItems.size() <= 0) { try { logger.severe("No form items for " + method.getURI()); } catch (URIException e) { logger.severe("No form items and exception getting uri: " + e.getMessage()); } return result; } NameValuePair[] data = new NameValuePair[formItems.size()]; int index = 0; String key = null; for (Iterator<String> i = formItems.keySet().iterator(); i.hasNext();) { key = i.next(); data[index++] = new NameValuePair(key, (String)formItems.get(key)); } if (method instanceof PostMethod) { ((PostMethod)method).setRequestBody(data); result = true; } else if (method instanceof GetMethod) { // Append these values to the query string. // Get current query string, then add data, then get it again // only this time its our data only... then append. HttpMethodBase hmb = (HttpMethodBase)method; String currentQuery = hmb.getQueryString(); hmb.setQueryString(data); String newQuery = hmb.getQueryString(); hmb.setQueryString( ((StringUtils.isNotEmpty(currentQuery)) ? currentQuery + "&" : "") + newQuery); result = true; } else { logger.severe("Unknown method type: " + method); } return result; } public boolean isPost() { return Method.POST.equals(getHttpMethod()); }
上述方法的功能 我在它的接口方法里面已经注释了,这里不再重复
另外HttpAuthenticationCredential证书类提供了Basic/Digest HTTP认证功能,源码我就不具体分析了,可以参照HtmlFormCredential类的认证功能对比不难理解了
在Heritrix3.1.0官方的参考文档里面提供了两种认证方式在配置文件crawler-beans.cxml中的示例(官方的示例里面关键词有误)
<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore"> <property name="credentials"> <map> <entry key="formCredential" value-ref="formCredential" /> </map> </property> </bean> <bean id="credential" class="org.archive.modules.credential.HtmlFormCredential"> <property name="domain" value="example.com" /> <property name="login-uri" value="http://example.com/login"/> <property name="form-items"> <map> <entry key="login" value="mylogin"/> <entry key="password" value="mypassword"/> <entry key="submit" value="submit"/> </map> </property> </bean> <bean id="credential" class="org.archive.modules.credential.HttpAuthenticationCredential"> <property name="domain"><value>domain</value></property> <property name="realm"><value>myrealm</value></property> <property name="login"><value>mylogin</value></property> <property name="password"><value>mypassword</value></property> </bean>
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/28/3049042.html