Tika解析word文件

Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

对Doc文件的解析

需要poi-scratchpad/3.7.jar

POI-HWPF - A Quick Guide

基本的文本提取

有两个输入参数：inputstream,HWPFDocument,

getText()方法是得到所有的文本内容，

getParagraphText()是得到每一段的文本内容，

getTextFromPieces()是得到每一页的文本内容

特定文本属性提取

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.

第一步：创建HWPFDocument

第二步：得到Range

getRange()： Returns the range which covers the whole of the document, but excludes any headers（页眉） and footers（页脚）.

int numParagraphs() Used to get the number of paragraphs in a range.

int numSections() Used to get the number of sections in a range（这个是“节”，就是插入、分隔符中的“节”）

第三步：得到段落

getParagraph()：

getText()

public static void main(String[] args) throws Exception {
        InputStream istream = new FileInputStream(
                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");
        HWPFDocument doc = new HWPFDocument(istream);
        Range range = doc.getRange();// Returns the range which covers the whole
                                        // of the document, but excludes any
                                        // headers and footers.
        for (int i = 0; i < range.numParagraphs(); i++) {
            Paragraph poiPara = range.getParagraph(i);
            int j = 0;
            while (true) {
                CharacterRun run = poiPara.getCharacterRun(j++);
                System.out.println("Color " + run.getColor());//颜色
                System.out.println("Font size " + run.getFontSize());//字体大小
                System.out.println("Font Name " + run.getFontName());//字体名称
                System.out.println(run.isBold() + " " + run.isItalic() + " "
                        + run.getUnderlineCode());//加粗，斜体，下划线
                System.out.println("Text is " + run.text());//文本内容
                if (run.getEndOffset() == poiPara.getEndOffset()) {
                    break;
                }
            }
        }


    }

对Docx文件的解析

需要poi-ooxml/3.7.jar

http://poi.apache.org/document/quick-guide-xwpf.html

package test;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.CharacterRun;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;

public class ParseWordDocxTest {

    /**
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        InputStream istream = new FileInputStream(
                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");
        XWPFDocument docx = new XWPFDocument(istream);
        List<XWPFParagraph> paraGraph = docx.getParagraphs();
        for(XWPFParagraph para :paraGraph ){
            List<XWPFRun> run = para.getRuns();
            for(XWPFRun r : run){
                int i = 0;
                System.out.println("字体颜色："+r.getColor());
                System.out.println("字体名称:"+r.getFontFamily());
                System.out.println("字体大小："+r.getFontSize());
                System.out.println("Text:"+r.getText(i++));
                System.out.println("粗体？："+r.isBold());
                System.out.println("斜体？："+r.isItalic());
                
            }
        }

    }

}

posted on 2014-03-26 10:25 ywf—java 阅读(5395) 评论(5) 收藏举报

刷新页面返回顶部

Tika解析word文件

Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

POI-HWPF - A Quick Guide

导航

公告