君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

几年前某位大牛写了 深入浅出 jackrabbit 系列,链接地址为http://ahuaxuan.iteye.com/category/65829

本人读后受益匪浅(如果没用他的辅助之功,本人对jackrabbit的理解可能会摸索得更长),由于时隔久远,当时的jackrabbit版本为1.7,与现在的最新版本有点出入,本人抑制不住内心某种无名冲动,不顾自己理解上的肤浅,将自己对Apache Jackrabbit的源码解析记录下来,以期加深对编程的理解,或许有助于后来者

(注:本文目前可能还处于修改中,如需转载,害人害己)

jackrabbit对富文档的文本提取目前版本是通过apache tika实现的,这是与以前的版本不同的

实现该功能主要是LazyTextExtractorField类,该类继承自lucene的抽象类AbstractField

LazyTextExtractorField类的源码如下:

/**
 * <code>LazyTextExtractorField</code> implements a Lucene field with a String
 * value that is lazily initialized from a given {@link Reader}. In addition
 * this class provides a method to find out whether the purpose of the reader
 * is to extract text and whether the extraction process is already finished.
 *
 * @see #isExtractorFinished()
 */
public class LazyTextExtractorField extends AbstractField {

    /**
     * The logger instance for this class.
     */
    private static final Logger log =
        LoggerFactory.getLogger(LazyTextExtractorField.class);

    /**
     * The exception used to forcibly terminate the extraction process
     * when the maximum field length is reached.
     */
    private static final SAXException STOP =
        new SAXException("max field length reached");

    /**
     * The extracted text content of the given binary value.
     * Set to non-null when the text extraction task finishes.
     */
    private volatile String extract = null;

    /**
     * Creates a new <code>LazyTextExtractorField</code> with the given
     * <code>name</code>.
     *
     * @param name the name of the field.
     * @param reader the reader where to obtain the string from.
     * @param highlighting set to <code>true</code> to
     *                     enable result highlighting support
     */
    public LazyTextExtractorField(
            Parser parser, InternalValue value, Metadata metadata,
            Executor executor, boolean highlighting, int maxFieldLength) {
        super(FieldNames.FULLTEXT,
                highlighting ? Store.YES : Store.NO,
                Field.Index.ANALYZED,
                highlighting ? TermVector.WITH_OFFSETS : TermVector.NO);
        executor.execute(
                new ParsingTask(parser, value, metadata, maxFieldLength));
    }

    /**
     * Returns the extracted text. This method blocks until the text
     * extraction task has been completed.
     *
     * @return the string value of this field
     */
    public synchronized String stringValue() {
        try {
            while (!isExtractorFinished()) {
                wait();
            }
            return extract;
        } catch (InterruptedException e) {
            log.error("Text extraction thread was interrupted", e);
            return "";
        }
    }

    /**
     * @return always <code>null</code>
     */
    public Reader readerValue() {
        return null;
    }

    /**
     * @return always <code>null</code>
     */
    public byte[] binaryValue() {
        return null;
    }

    /**
     * @return always <code>null</code>
     */
    public TokenStream tokenStreamValue() {
        return null;
    }

    /**
     * Checks whether the text extraction task has finished.
     *
     * @return <code>true</code> if the extracted text is available
     */
    public boolean isExtractorFinished() {
        return extract != null;
    }

    private synchronized void setExtractedText(String value) {
        extract = value;
        notify();
    }

    /**
     * Releases all resources associated with this field.
     */
    public void dispose() {
        // TODO: Cause the ContentHandler below to throw an exception
    }

    /**
     * The background task for extracting text from a binary value.
     */
    private class ParsingTask extends DefaultHandler implements Runnable {

        private final Parser parser;

        private final InternalValue value;

        private final Metadata metadata;

        private final int maxFieldLength;

        private final StringBuilder builder = new StringBuilder();

        public ParsingTask(
                Parser parser, InternalValue value, Metadata metadata,
                int maxFieldLength) {
            this.parser = parser;
            this.value = value;
            this.metadata = metadata;
            this.maxFieldLength = maxFieldLength;
        }

        public void run() {
            try {
                InputStream stream = value.getStream();
                try {
                    parser.parse(stream, this, metadata, new ParseContext());
                } finally {
                    stream.close();
                }
            } catch (Throwable t) {
                if (t != STOP) {
                    log.warn("Failed to extract text from a binary property", t);
                }
            } finally {
                value.discard();
            }
            setExtractedText(builder.toString());
        }

        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            builder.append(
                    ch, start,
                    Math.min(length, maxFieldLength - builder.length()));
            if (builder.length() >= maxFieldLength) {
                throw STOP;
            }
        }

        @Override
        public void ignorableWhitespace(char[] ch, int start, int length)
                throws SAXException {
            characters(ch, start, length);
        }

    }

}

 

从代码可以发现,富文档文本提取的工作是放在线程类ParsingTask中进行处理的,文本提取是通过异步方式进行的

这里的线程类同时继承自DefaultHandler,DefaultHandler实现了EntityResolver, DTDHandler, ContentHandler, ErrorHandler四接口,这是一种缺省适配器模式,为我们实现target目标接口提供便利

jaxp规范对xml格式文件的解析式基于事件监听模式,上面最主要的接口是ContentHandler,ParsingTask间接实现了该接口,同时将获取的文本增量累加在private final StringBuilder builder = new StringBuilder()对象里面

线程方法里面最后通过调用setExtractedText(builder.toString())方法提交得到的文本

需要注意的是,这里的parser对象,jackrabbit并没有使用原生的apache tika里面的类,而是封装了一个JackrabbitParser类

JackrabbitParser类的源码如下:

/**
 * Jackrabbit wrapper for Tika parsers. Uses a Tika {@link AutoDetectParser}
 * for all parsing requests, but sets it up with Jackrabbit-specific
 * configuration and implements backwards compatibility support for old
 * <code>textExtractorClasses</code> configurations.
 *
 * @since Apache Jackrabbit 2.0
 */
class JackrabbitParser implements Parser {

    /**
     * Logger instance.
     */
    private static final Logger logger =
        LoggerFactory.getLogger(JackrabbitParser.class);

    /**
     * Flag for blocking all text extraction. Used by the Jackrabbit test suite.
     */
    private static volatile boolean blocked = false;

    /**
     * The configured Tika parser.
     */
    private final AutoDetectParser parser;

    /**
     * Creates a parser using the default Jackrabbit-specific configuration
     * settings.
     */
    public JackrabbitParser() {
        InputStream stream =
            JackrabbitParser.class.getResourceAsStream("tika-config.xml");
        try {
            if (stream != null) {
                try {
                    parser = new AutoDetectParser(new TikaConfig(stream));
                } finally {
                    stream.close();
                }
            } else {
                parser = new AutoDetectParser();
            }
        } catch (Exception e) {
            // Should never happen
            throw new RuntimeException(
                    "Unable to load embedded Tika configuration", e);
        }
    }

    /**
     * Backwards compatibility method to support old Jackrabbit 1.x
     * <code>textExtractorClasses</code> configurations. Implements a best
     * effort mapping from the old-style text extractor classes to
     * corresponding Tika parsers.
     *
     * @param classes configured list of text extractor classes
     */
    public void setTextFilterClasses(String classes) {
        Map<MediaType, Parser> parsers = new HashMap<MediaType, Parser>();

        StringTokenizer tokenizer = new StringTokenizer(classes, ", \t\n\r\f");
        while (tokenizer.hasMoreTokens()) {
            String name = tokenizer.nextToken();
            if (name.equals(
                    "org.apache.jackrabbit.extractor.HTMLTextExtractor")) {
                parsers.put(MediaType.text("html"), new HtmlParser());
            } else if (name.equals("org.apache.jackrabbit.extractor.MsExcelTextExtractor")) {
                Parser parser = new OfficeParser();
                parsers.put(MediaType.application("vnd.ms-excel"), parser);
                parsers.put(MediaType.application("msexcel"), parser);
                parsers.put(MediaType.application("excel"), parser);
            } else if (name.equals("org.apache.jackrabbit.extractor.MsOutlookTextExtractor")) {
                parsers.put(MediaType.application("vnd.ms-outlook"), new OfficeParser());
            } else if (name.equals("org.apache.jackrabbit.extractor.MsPowerPointExtractor")
                    || name.equals("org.apache.jackrabbit.extractor.MsPowerPointTextExtractor")) {
                Parser parser = new OfficeParser();
                parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);
                parsers.put(MediaType.application("mspowerpoint"), parser);
                parsers.put(MediaType.application("powerpoint"), parser);
            } else if (name.equals("org.apache.jackrabbit.extractor.MsWordTextExtractor")) {
                Parser parser = new OfficeParser();
                parsers.put(MediaType.application("vnd.ms-word"), parser);
                parsers.put(MediaType.application("msword"), parser);
            } else if (name.equals("org.apache.jackrabbit.extractor.MsTextExtractor")) {
                Parser parser = new OfficeParser();
                parsers.put(MediaType.application("vnd.ms-word"), parser); 
                parsers.put(MediaType.application("msword"), parser);
                parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);
                parsers.put(MediaType.application("mspowerpoint"), parser);
                parsers.put(MediaType.application("vnd.ms-excel"), parser);
                parsers.put(MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"), parser);
                parsers.put(MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"), parser);
                parsers.put(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), parser);
            } else if (name.equals("org.apache.jackrabbit.extractor.OpenOfficeTextExtractor")) {
                Parser parser = new OpenDocumentParser();
                parsers.put(MediaType.application("vnd.oasis.opendocument.database"), parser);
                parsers.put(MediaType.application("vnd.oasis.opendocument.formula"), parser);
                parsers.put(MediaType.application("vnd.oasis.opendocument.graphics"), parser);
                parsers.put(MediaType.application("vnd.oasis.opendocument.presentation"), parser);
                parsers.put(MediaType.application("vnd.oasis.opendocument.spreadsheet"), parser);
                parsers.put(MediaType.application("vnd.oasis.opendocument.text"), parser);
                parsers.put(MediaType.application("vnd.sun.xml.calc"), parser);
                parsers.put(MediaType.application("vnd.sun.xml.draw"), parser);
                parsers.put(MediaType.application("vnd.sun.xml.impress"), parser);
                parsers.put(MediaType.application("vnd.sun.xml.writer"), parser);
            } else if (name.equals("org.apache.jackrabbit.extractor.PdfTextExtractor")) {
                parsers.put(MediaType.application("pdf"), new PDFParser());
            } else if (name.equals("org.apache.jackrabbit.extractor.PlainTextExtractor")) {
                parsers.put(MediaType.TEXT_PLAIN, new TXTParser());
            } else if (name.equals("org.apache.jackrabbit.extractor.PngTextExtractor")) {
                Parser parser = new ImageParser();
                parsers.put(MediaType.image("png"), parser);
                parsers.put(MediaType.image("apng"), parser);
                parsers.put(MediaType.image("mng"), parser);
            } else if (name.equals("org.apache.jackrabbit.extractor.RTFTextExtractor")) {
                Parser parser = new RTFParser();
                parsers.put(MediaType.application("rtf"), parser);
                parsers.put(MediaType.text("rtf"), parser);
            } else if (name.equals("org.apache.jackrabbit.extractor.XMLTextExtractor")) {
                Parser parser = new XMLParser();
                parsers.put(MediaType.APPLICATION_XML, parser);
                parsers.put(MediaType.text("xml"), parser);
            } else {
                logger.warn("Ignoring unknown text extractor class: {}", name);
            }
        }

        parser.setParsers(parsers);
    }

    /**
     * Delegates the call to the configured {@link AutoDetectParser}.
     */
    public Set<MediaType> getSupportedTypes(ParseContext context) {
        return parser.getSupportedTypes(context);
    }

    /**
     * Delegates the call to the configured {@link AutoDetectParser}.
     */
    public void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
        waitIfBlocked();
        parser.parse(stream, handler, metadata, context);
    }

    public void parse(
            InputStream stream, ContentHandler handler, Metadata metadata)
            throws IOException, SAXException, TikaException {
        parse(stream, handler, metadata, new ParseContext());
    }

    /**
     * Waits until text extraction is no longer blocked. The block is only
     * ever activated in the Jackrabbit test suite when testing delayed
     * text extraction.
     *
     * @throws TikaException if the block was interrupted
     */
    private synchronized static void waitIfBlocked() throws TikaException {
        try {
            while (blocked) {
                JackrabbitParser.class.wait();
            }
        } catch (InterruptedException e) {
            throw new TikaException("Text extraction block interrupted", e);
        }
    }

    /**
     * Blocks all text extraction tasks.
     */
    static synchronized void block() {
        blocked = true;
    }

    /**
     * Unblocks all text extraction tasks.
     */
    static synchronized void unblock() {
        blocked = false;
        JackrabbitParser.class.notifyAll();
    }

}

 

具体的文本解析工作是通过委托给AutoDetectParser类来执行的,如果看过我以前的apache tika源码研究,就可以知道AutoDetectParser类继承自CompositeParser类,而CompositeParser类的处理方式是通过调用它的Parser聚集来完成具体的解析工作,这里面 实现的是composite模式(自顶向下的安全式的composite模式)

---------------------------------------------------------------------------

本系列Apache Jackrabbit源码研究系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/03/2997156.html

posted on 2013-04-06 18:09  刺猬的温驯  阅读(3084)  评论(0编辑  收藏  举报