上文分析了Apache Tika的编码识别相关接口和实现类
本文接着分析Apache Tika用到的一个关键类ParseContext,这里要明白Tika解析文档的方式,Tika将文件都解析为XHTML格式的文档,然后采用SAX基于事件的方式来解析这个XHTML格式,先来看看ParseContext类的源码:
public class ParseContext implements Serializable { /** Serial version UID. */ private static final long serialVersionUID = -5921436862145826534L; /** Map of objects in this context */ private final Map<String, Object> context = new HashMap<String, Object>(); /** * Adds the given value to the context as an implementation of the given * interface. * * @param key the interface implemented by the given value * @param value the value to be added, or <code>null</code> to remove */ public <T> void set(Class<T> key, T value) { if (value != null) { context.put(key.getName(), value); } else { context.remove(key.getName()); } } /** * Returns the object in this context that implements the given interface. * * @param key the interface implemented by the requested object * @return the object that implements the given interface, * or <code>null</code> if not found */ @SuppressWarnings("unchecked") public <T> T get(Class<T> key) { return (T) context.get(key.getName()); } /** * Returns the object in this context that implements the given interface, * or the given default value if such an object is not found. * * @param key the interface implemented by the requested object * @param defaultValue value to return if the requested object is not found * @return the object that implements the given interface, * or the given default value if not found */ public <T> T get(Class<T> key, T defaultValue) { T value = get(key); if (value != null) { return value; } else { return defaultValue; } } /** * Returns the SAX parser specified in this parsing context. If a parser * is not explicitly specified, then one is created using the specified * or the default SAX parser factory. * * @see #getSAXParserFactory() * @since Apache Tika 0.8 * @return SAX parser * @throws TikaException if a SAX parser could not be created */ public SAXParser getSAXParser() throws TikaException { SAXParser parser = get(SAXParser.class); if (parser != null) { return parser; } else { try { return getSAXParserFactory().newSAXParser(); } catch (ParserConfigurationException e) { throw new TikaException("Unable to configure a SAX parser", e); } catch (SAXException e) { throw new TikaException("Unable to create a SAX parser", e); } } } /** * Returns the SAX parser factory specified in this parsing context. * If a factory is not explicitly specified, then a default factory * instance is created and returned. The default factory instance is * configured to be namespace-aware and to use * {@link XMLConstants#FEATURE_SECURE_PROCESSING secure XML processing}. * * @since Apache Tika 0.8 * @return SAX parser factory */ public SAXParserFactory getSAXParserFactory() { SAXParserFactory factory = get(SAXParserFactory.class); if (factory == null) { factory = SAXParserFactory.newInstance(); factory.setNamespaceAware(true); try { factory.setFeature( XMLConstants.FEATURE_SECURE_PROCESSING, true); } catch (ParserConfigurationException e) { } catch (SAXNotSupportedException e) { } catch (SAXNotRecognizedException e) { // TIKA-271: Some XML parsers do not support the // secure-processing feature, even though it's required by // JAXP in Java 5. Ignoring the exception is fine here, as // deployments without this feature are inherently vulnerable // to XML denial-of-service attacks. } } return factory; } }
从该类的源码可以看出,ParseContext类的主要作用是获取XML的SAX解析类SAXParser
如果了解JAXP,上面的源码是很容易看懂的,Tika是采用SAX方式解析XML格式文档的,SAXParserFactory为抽象类,具体采用的哪个实现类呢,待分析