君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

上文分析了Apache Tika的编码识别相关接口和实现类

本文接着分析Apache Tika用到的一个关键类ParseContext,这里要明白Tika解析文档的方式,Tika将文件都解析为XHTML格式的文档,然后采用SAX基于事件的方式来解析这个XHTML格式,先来看看ParseContext类的源码:

public class ParseContext implements Serializable {

    /** Serial version UID. */
    private static final long serialVersionUID = -5921436862145826534L;

    /** Map of objects in this context */
    private final Map<String, Object> context = new HashMap<String, Object>();
 
    /**
     * Adds the given value to the context as an implementation of the given
     * interface.
     *
     * @param key the interface implemented by the given value
     * @param value the value to be added, or <code>null</code> to remove
     */
    public <T> void set(Class<T> key, T value) {
        if (value != null) {
            context.put(key.getName(), value);
        } else {
            context.remove(key.getName());
        }
    }

    /**
     * Returns the object in this context that implements the given interface.
     *
     * @param key the interface implemented by the requested object
     * @return the object that implements the given interface,
     *         or <code>null</code> if not found
     */
    @SuppressWarnings("unchecked")
    public <T> T get(Class<T> key) {
        return (T) context.get(key.getName());
    }

    /**
     * Returns the object in this context that implements the given interface,
     * or the given default value if such an object is not found.
     *
     * @param key the interface implemented by the requested object
     * @param defaultValue value to return if the requested object is not found
     * @return the object that implements the given interface,
     *         or the given default value if not found
     */
    public <T> T get(Class<T> key, T defaultValue) {
        T value = get(key);
        if (value != null) {
            return value;
        } else {
            return defaultValue;
        }
    }

    /**
     * Returns the SAX parser specified in this parsing context. If a parser
     * is not explicitly specified, then one is created using the specified
     * or the default SAX parser factory.
     *
     * @see #getSAXParserFactory()
     * @since Apache Tika 0.8
     * @return SAX parser
     * @throws TikaException if a SAX parser could not be created
     */
    public SAXParser getSAXParser() throws TikaException {
        SAXParser parser = get(SAXParser.class);
        if (parser != null) {
            return parser;
        } else {
            try {
                return getSAXParserFactory().newSAXParser();
            } catch (ParserConfigurationException e) {
                throw new TikaException("Unable to configure a SAX parser", e);
            } catch (SAXException e) {
                throw new TikaException("Unable to create a SAX parser", e);
            }
        }
    }

    /**
     * Returns the SAX parser factory specified in this parsing context.
     * If a factory is not explicitly specified, then a default factory
     * instance is created and returned. The default factory instance is
     * configured to be namespace-aware and to use
     * {@link XMLConstants#FEATURE_SECURE_PROCESSING secure XML processing}.
     *
     * @since Apache Tika 0.8
     * @return SAX parser factory
     */
    public SAXParserFactory getSAXParserFactory() {
        SAXParserFactory factory = get(SAXParserFactory.class);
        if (factory == null) {
            factory = SAXParserFactory.newInstance();
            factory.setNamespaceAware(true);
            try {
                factory.setFeature(
                        XMLConstants.FEATURE_SECURE_PROCESSING, true);
            } catch (ParserConfigurationException e) {
            } catch (SAXNotSupportedException e) {
            } catch (SAXNotRecognizedException e) {
                // TIKA-271: Some XML parsers do not support the
                // secure-processing feature, even though it's required by
                // JAXP in Java 5. Ignoring the exception is fine here, as
                // deployments without this feature are inherently vulnerable
                // to XML denial-of-service attacks.
            }
        }
        return factory;
    }

}

从该类的源码可以看出,ParseContext类的主要作用是获取XML的SAX解析类SAXParser

如果了解JAXP,上面的源码是很容易看懂的,Tika是采用SAX方式解析XML格式文档的,SAXParserFactory为抽象类,具体采用的哪个实现类呢,待分析

 

posted on 2013-03-07 17:04  刺猬的温驯  阅读(756)  评论(0编辑  收藏  举报