君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

上文分析了Heritrix3.1.0系统对HttpClient组件的请求处理类的封装,本文接下来分析Heritrix3.1.0系统是怎样封装请求证书的

Heritrix3.1.0系统的package org.archive.modules.credential里面的相关类都是与请求证书有关的

先来了解一下CredentialStore类,该类用Map类型存储了应用的所有证书(Credential),外部只要调用这个类就可以获取证书

该类重要方法如下

KeyedProperties kp = new KeyedProperties();
    public KeyedProperties getKeyedProperties() {
        return kp;
    }
    
    /**
     * Credentials used by heritrix authenticating. See
     * http://crawler.archive.org/proposals/auth/ for background.
     * 
     * @see http://crawler.archive.org/proposals/auth/
     */
    {
        setCredentials(new HashMap<String, Credential>());
    }
    @SuppressWarnings("unchecked")
    public Map<String,Credential> getCredentials() {
        return (Map<String,Credential>) kp.get("credentials");
    }
    public void setCredentials(Map<String,Credential> map) {
        kp.put("credentials",map);
    }
    
    /**
     * List of possible credential types as a List.
     *
     * This types are inner classes of this credential type so they cannot
     * be created without their being associated with a credential list.
     */
    private static final List<Class<?>> credentialTypes;
    // Initialize the credentialType data member.
    static {
        // Array of all known credential types.
        Class<?> [] tmp = {HtmlFormCredential.class, HttpAuthenticationCredential.class};
        credentialTypes = Collections.unmodifiableList(Arrays.asList(tmp));
    }

    /**
     * Constructor.
     */
    public CredentialStore() {
    }

    /**
     * @return Unmodifable list of credential types.
     */
    public static List<Class<?>> getCredentialTypes() {
        return CredentialStore.credentialTypes;
    }


    /**
     * @param context Pass a ProcessorURI.  Used to set
     * context.
     * @return An iterator or null.
     */
    public Collection<Credential> getAll() {
        Map<String,Credential> map = getCredentials();
        return map.values();
    }

    /**
     * @param context  Used to set context.
     * @param name Name to give the manufactured credential.  Should be unique
     * else the add of the credential to the list of credentials will fail.
     * @return Returns <code>name</code>'d credential.
     * @throws AttributeNotFoundException
     * @throws MBeanException
     * @throws ReflectionException
     */
    public Credential get(/*StateProvider*/Object context, String name) {
        return getCredentials().get(name);
    }
/**
     * Return set made up of all credentials of the passed
     * <code>type</code>.
     *
     * @param context  Used to set context.  
     * @param type Type of the list to return.  Type is some superclass of
     * credentials.
     * @param rootUri RootUri to match.  May be null.  In this case we return
     * all.  Currently we expect the CrawlServer name to equate to root Uri.
     * Its not.  Currently it doesn't distingush between servers of same name
     * but different ports (e.g. http and https).
     * @return Unmodifable sublist of all elements of passed type.
     */
    public Set<Credential> subset(CrawlURI context, Class<?> type, String rootUri) {
        Set<Credential> result = null;
        for (Credential c: getAll()) {
            if (!type.isInstance(c)) {
                continue;
            }
            if (rootUri != null) {
                String cd = c.getDomain();
                if (cd == null) {
                    continue;
                }
                if (!rootUri.equalsIgnoreCase(cd)) {
                    continue;
                }
            }
            if (result == null) {
                result = new HashSet<Credential>();
            }
            result.add(c);
        }
        return result;
    }

上面方法分别提供了获取所有证书(Map类型),根据名称(Map的key键)获取证书和获取所有证书类型

(注意到最后的subset方法,好像没有用到CrawlURI context参数,方法返回的只能是指定域并且指定证书类型的证书集合)

从它的静态代码块可以看到,系统提供了两种类型的证书类型,分别是HtmlFormCredential.class, HttpAuthenticationCredential.class,前者用于form认证,后者用于Basic/Digest HTTP认证

两种证书类型继承自抽象类Credential,先看一下该抽象类的方法

    /**
     *域名
     * The root domain this credential goes against: E.g. www.archive.org
     */
    String domain = "";
    /**
     * @param context Context to use when searching for credential domain.
     * @return The domain/root URI this credential is to go against.
     * @throws AttributeNotFoundException If attribute not found.
     */
    public String getDomain() {
        return this.domain;
    }
    public void setDomain(String domain) {
        this.domain = domain;
    }
/**
     *为CrawlURI curi对象添加当前证书
     * Attach this credentials avatar to the passed <code>curi</code> .
     *
     * Override if credential knows internally what it wants to attach as
     * payload.  Otherwise, if payload is external, use the below
     * {@link #attach(CrawlURI, String)}.
     *
     * @param curi CrawlURI to load with credentials.
     */
    public void attach(CrawlURI curi) {
        curi.getCredentials().add(this);
    }

    /**
     *为CrawlURI curi对象移除当前证书
     * Detach this credential from passed curi.
     *
     * @param curi
     * @return True if we detached a Credential reference.
     */
    public boolean detach(CrawlURI curi) {
        return curi.getCredentials().remove(this);
    }

    /**
     *为CrawlURI curi对象移除所有证书
     * Detach all credentials of this type from passed curi.
     *
     * @param curi
     * @return True if we detached references.
     */
    public boolean detachAll(CrawlURI curi) {
        boolean result = false;
        Iterator<Credential> iter = curi.getCredentials().iterator();
        while (iter.hasNext()) {
            Credential cred = iter.next();
            if (cred.getClass() ==  this.getClass()) {
                iter.remove();
                result = true;
            }
        }
        return result;
    }

    /**
     *判断CrawlURI curi对象是否需要当前证书认证
     * @param curi CrawlURI to look at.
     * @return True if this credential IS a prerequisite for passed
     * CrawlURI.
     */
    public abstract boolean isPrerequisite(CrawlURI curi);

    /**
     *判断CrawlURI curi对象是否存在认证URI
     * @param curi CrawlURI to look at.
     * @return True if this credential HAS a prerequisite for passed CrawlURI.
     */
    public abstract boolean hasPrerequisite(CrawlURI curi);

    /**
     *获取CrawlURI curi对象的认证URI
     * Return the authentication URI, either absolute or relative, that serves
     * as prerequisite the passed <code>curi</code>.
     *
     * @param curi CrawlURI to look at.
     * @return Prerequisite URI for the passed curi.
     */
    public abstract String getPrerequisite(CrawlURI curi);

    /**
     *获取CrawlURI curi对象的认证URI
     * @param context Context to use when searching for credential domain.
     * @return Key that is unique to this credential type.
     * @throws AttributeNotFoundException
     */
    public abstract String getKey();


    /**
     *判断CrawlURI curi对象是否每次都要认证
     * @return True if this credential is of the type that needs to be offered
     * on each visit to the server (e.g. Rfc2617 is such a type).
     */
    public abstract boolean isEveryTime();

    /**
     *为HttpMethod method添加认证参数
     * @param curi CrawlURI to as for context.
     * @param http Instance of httpclient.
     * @param method Method to populate.
     * @return True if added a credentials.
     */
    public abstract boolean populate(CrawlURI curi, HttpClient http,
        HttpMethod method);

    /**
     *是否post认证
     * @param curi CrawlURI to look at.
     * @return True if this credential is to be posted.  Return false if the
     * credential is to be GET'd or if POST'd or GET'd are not pretinent to this
     * credential type.
     */
    public abstract boolean isPost();

    /**
     * 判断CrawlURI curi对象的CrawlServer类中的名称与当前认证对象的域名是否一致(用于排除不需要当前认证的CrawlURI curi对象)
     * Test passed curi matches this credentials rootUri.
     * @param controller
     * @param curi CrawlURI to test.
     * @return True if domain for credential matches that of the passed curi.
     */
    public boolean rootUriMatch(ServerCache cache, 
            CrawlURI curi) {
        String cd = getDomain();

        CrawlServer serv = cache.getServerFor(curi.getUURI());
        String serverName = serv.getName();
//        String serverName = controller.getServerCache().getServerFor(curi).
//            getName();
        logger.fine("RootURI: Comparing " + serverName + " " + cd);
        return cd != null && serverName != null &&
            serverName.equalsIgnoreCase(cd);
    }

上述方法的功能是为CrawlURI curi对象添加当前证书、移除当前证书、为HttpMethod method对象添加证书参数、判断CrawlURI curi对象的域名与当前证书的域名是否一致等

HtmlFormCredential对象继承自上述证书类Credential,为CrawlURI curi对象提供form认证,相关方法实现如下

/**
     * Full URI of page that contains the HTML login form we're to apply these
     * credentials too: E.g. http://www.archive.org
     */
    String loginUri = "";
    public String getLoginUri() {
        return this.loginUri;
    }
    public void setLoginUri(String loginUri) {
        this.loginUri = loginUri;
    }
    
    /**
     * Form items.
     */
    Map<String,String> formItems = new HashMap<String,String>();
    public Map<String,String> getFormItems() {
        return this.formItems;
    }
    public void setFormItems(Map<String,String> formItems) {
        this.formItems = formItems;
    }
    
    
    enum Method {
        GET,
        POST
    }
    /**
     * GET or POST.
     */
    Method httpMethod = Method.POST;
    public Method getHttpMethod() {
        return this.httpMethod;
    }
    public void setHttpMethod(Method method) {
        this.httpMethod = method; 
    }

    /**
     * Constructor.
     */
    public HtmlFormCredential() {
    }

    public boolean isPrerequisite(final CrawlURI curi) {
        boolean result = false;
        String curiStr = curi.getUURI().toString();
        String loginUri = getPrerequisite(curi);
        if (loginUri != null) {
            try {
//登录url UURI uuri
= UURIFactory.getInstance(curi.getUURI(), loginUri); if (uuri != null && curiStr != null && uuri.toString().equals(curiStr)) { result = true; if (!curi.isPrerequisite()) { curi.setPrerequisite(true); logger.fine(curi + " is prereq."); } } } catch (URIException e) { logger.severe("Failed to uuri: " + curi + ", " + e.getMessage()); } } return result; } public boolean hasPrerequisite(CrawlURI curi) { return getPrerequisite(curi) != null; } public String getPrerequisite(CrawlURI curi) { return getLoginUri(); } public String getKey() { return getLoginUri(); } public boolean isEveryTime() { // This authentication is one time only. return false; } public boolean populate(CrawlURI curi, HttpClient http, HttpMethod method) { // http is not used boolean result = false; Map<String,String> formItems = getFormItems(); if (formItems == null || formItems.size() <= 0) { try { logger.severe("No form items for " + method.getURI()); } catch (URIException e) { logger.severe("No form items and exception getting uri: " + e.getMessage()); } return result; } NameValuePair[] data = new NameValuePair[formItems.size()]; int index = 0; String key = null; for (Iterator<String> i = formItems.keySet().iterator(); i.hasNext();) { key = i.next(); data[index++] = new NameValuePair(key, (String)formItems.get(key)); } if (method instanceof PostMethod) { ((PostMethod)method).setRequestBody(data); result = true; } else if (method instanceof GetMethod) { // Append these values to the query string. // Get current query string, then add data, then get it again // only this time its our data only... then append. HttpMethodBase hmb = (HttpMethodBase)method; String currentQuery = hmb.getQueryString(); hmb.setQueryString(data); String newQuery = hmb.getQueryString(); hmb.setQueryString( ((StringUtils.isNotEmpty(currentQuery)) ? currentQuery + "&" : "") + newQuery); result = true; } else { logger.severe("Unknown method type: " + method); } return result; } public boolean isPost() { return Method.POST.equals(getHttpMethod()); }

上述方法的功能 我在它的接口方法里面已经注释了,这里不再重复

另外HttpAuthenticationCredential证书类提供了Basic/Digest HTTP认证功能,源码我就不具体分析了,可以参照HtmlFormCredential类的认证功能对比不难理解了

在Heritrix3.1.0官方的参考文档里面提供了两种认证方式在配置文件crawler-beans.cxml中的示例(官方的示例里面关键词有误)

<bean id="credentialStore"
   class="org.archive.modules.credential.CredentialStore">
     <property name="credentials">
       <map>
         <entry key="formCredential" value-ref="formCredential" />
       </map>
 </property>
</bean>
<bean id="credential"
   class="org.archive.modules.credential.HtmlFormCredential"> 
    <property name="domain" value="example.com" /> 
    <property name="login-uri" value="http://example.com/login"/> 
    <property name="form-items">
        <map>
            <entry key="login" value="mylogin"/>
            <entry key="password" value="mypassword"/>
            <entry key="submit" value="submit"/>
        </map>
    </property>
</bean>
<bean id="credential"
  class="org.archive.modules.credential.HttpAuthenticationCredential"> 
    <property name="domain"><value>domain</value></property> 
    <property name="realm"><value>myrealm</value></property> 
    <property name="login"><value>mylogin</value></property> 
    <property name="password"><value>mypassword</value></property> 
</bean>

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/28/3049042.html

posted on 2013-04-28 16:50  刺猬的温驯  阅读(706)  评论(1编辑  收藏  举报