君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

上文分析了具体的解析类HtmlParser对网页文档的解析实现源码,了解到了Apache Tika的编码识别的处理方式。

(HtmlParser对网页文件的解析其实并没有用到ParseContext上下文类的SAXParser对象,而是用到了另外一个TagSoup组件)

本文继续分析Tika对xml格式文件SAX解析的事件处理相关类,精彩的部分留在后头吧

jaxp规范定义了四个事件处理接口,分别是EntityResolver, DTDHandler, ContentHandler, ErrorHandler

并提供了一个默认处理类DefaultHandler,实现了上面四个接口,这为我们扩展SAX的事件处理类提供了方便,只要继承该类即可。

Apache Tika提供的事件处理类使用了装饰模式,里面的包装关系一层套一层,实在是看得眼花缭乱,下面的解析部分只对部分类解析,其他事件处理类类似,不再赘述。

先来看看关键类的UML模型

ContentHandlerDecorator类继承自JAXP的默认处理类DefaultHandler,从名称基本可以看出该类采用了装饰模式,下面是它的源码:

/**
 * Decorator base class for the {@link ContentHandler} interface. This class
 * simply delegates all SAX events calls to an underlying decorated handler
 * instance. Subclasses can provide extra decoration by overriding one or more
 * of the SAX event methods.
 */
public class ContentHandlerDecorator extends DefaultHandler {

    /**
     * Decorated SAX event handler.
     */
    private ContentHandler handler;

    /**
     * Creates a decorator for the given SAX event handler.
     *
     * @param handler SAX event handler to be decorated
     */
    public ContentHandlerDecorator(ContentHandler handler) {
        assert handler != null;
        this.handler = handler;
    }

    /**
     * Creates a decorator that by default forwards incoming SAX events to
     * a dummy content handler that simply ignores all the events. Subclasses
     * should use the {@link #setContentHandler(ContentHandler)} method to
     * switch to a more usable underlying content handler.
     */
    protected ContentHandlerDecorator() {
        this(new DefaultHandler());
    }

    /**
     * Sets the underlying content handler. All future SAX events will be
     * directed to this handler instead of the one that was previously used.
     *
     * @param handler content handler
     */
    protected void setContentHandler(ContentHandler handler) {
        assert handler != null;
        this.handler = handler;
    }

    @Override
    public void startPrefixMapping(String prefix, String uri)
            throws SAXException {
        try {
            handler.startPrefixMapping(prefix, uri);
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void endPrefixMapping(String prefix) throws SAXException {
        try {
            handler.endPrefixMapping(prefix);
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void processingInstruction(String target, String data)
            throws SAXException {
        try {
            handler.processingInstruction(target, data);
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        handler.setDocumentLocator(locator);
    }

    @Override
    public void startDocument() throws SAXException {
        try {
            handler.startDocument();
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void endDocument() throws SAXException {
        try {
            handler.endDocument();
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void startElement(
            String uri, String localName, String name, Attributes atts)
            throws SAXException {
        try {
            handler.startElement(uri, localName, name, atts);
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void endElement(String uri, String localName, String name)
            throws SAXException {
        try {
            handler.endElement(uri, localName, name);
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        try {
            handler.characters(ch, start, length);
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void ignorableWhitespace(char[] ch, int start, int length)
            throws SAXException {
        try {
            handler.ignorableWhitespace(ch, start, length);
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public void skippedEntity(String name) throws SAXException {
        try {
            handler.skippedEntity(name);
        } catch (SAXException e) {
            handleException(e);
        }
    }

    @Override
    public String toString() {
        return handler.toString();
    }

    /**
     * Handle any exceptions thrown by methods in this class. This method
     * provides a single place to implement custom exception handling. The
     * default behaviour is simply to re-throw the given exception, but
     * subclasses can also provide alternative ways of handling the situation.
     *
     * @param exception the exception that was thrown
     * @throws SAXException the exception (if any) thrown to the client
     */
    protected void handleException(SAXException exception) throws SAXException {
        throw exception;
    }

}

该装饰类持有ContentHandler对象的引用,其后相关的方法都是调用了ContentHandler的同名方法

接下来看具体的装饰类BodyContentHandler的源码

/**
 * Content handler decorator that only passes everything inside
 * the XHTML <body/> tag to the underlying handler. Note that
 * the &lt;body/&gt; tag itself is <em>not</em> passed on.
 */
public class BodyContentHandler extends ContentHandlerDecorator {

    /**
     * XHTML XPath parser.
     */
    private static final XPathParser PARSER =
        new XPathParser("xhtml", XHTMLContentHandler.XHTML);

    /**
     * The XPath matcher used to select the XHTML body contents.
     */
    private static final Matcher MATCHER =
        PARSER.parse("/xhtml:html/xhtml:body/descendant::node()");

    /**
     * Creates a content handler that passes all XHTML body events to the
     * given underlying content handler.
     *
     * @param handler content handler
     */
    public BodyContentHandler(ContentHandler handler) {
        super(new MatchingContentHandler(handler, MATCHER));
    }

    /**
     * Creates a content handler that writes XHTML body character events to
     * the given writer.
     *
     * @param writer writer
     */
    public BodyContentHandler(Writer writer) {
        this(new WriteOutContentHandler(writer));
    }

    /**
     * Creates a content handler that writes XHTML body character events to
     * the given output stream using the default encoding.
     *
     * @param stream output stream
     */
    public BodyContentHandler(OutputStream stream) {
        this(new WriteOutContentHandler(stream));
    }

    /**
     * Creates a content handler that writes XHTML body character events to
     * an internal string buffer. The contents of the buffer can be retrieved
     * using the {@link #toString()} method.
     * <p>
     * The internal string buffer is bounded at the given number of characters.
     * If this write limit is reached, then a {@link SAXException} is thrown.
     *
     * @since Apache Tika 0.7
     * @param writeLimit maximum number of characters to include in the string,
     *                   or -1 to disable the write limit
     */
    public BodyContentHandler(int writeLimit) {
        this(new WriteOutContentHandler(writeLimit));
    }

    /**
     * Creates a content handler that writes XHTML body character events to
     * an internal string buffer. The contents of the buffer can be retrieved
     * using the {@link #toString()} method.
     * <p>
     * The internal string buffer is bounded at 100k characters. If this write
     * limit is reached, then a {@link SAXException} is thrown.
     */
    public BodyContentHandler() {
        this(new WriteOutContentHandler());
    }

}

最后是用过调用父类的构造函数初始化被装饰的对象

 

posted on 2013-03-08 02:38  刺猬的温驯  阅读(982)  评论(0编辑  收藏  举报