Tika解析word文件
Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files
http://poi.apache.org/document/
http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.poi/poi-scratchpad/3.7
http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.7
对Doc文件的解析
需要poi-scratchpad/3.7.jar
POI-HWPF - A Quick Guide
基本的文本提取
有两个输入参数:inputstream,HWPFDocument,
getText()方法是得到所有的文本内容,
getParagraphText()是得到每一段的文本内容,
getTextFromPieces()是得到每一页的文本内容
特定文本属性提取
To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.
第一步:创建HWPFDocument
第二步:得到Range
getRange(): Returns the range which covers the whole of the document, but excludes any headers(页眉) and footers(页脚).
int |
numParagraphs() Used to get the number of paragraphs in a range. |
int |
numSections() Used to get the number of sections in a range(这个是“节”,就是插入、分隔符中的“节”) |
第三步:得到段落
getParagraph():
getText()
public static void main(String[] args) throws Exception { InputStream istream = new FileInputStream( "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx"); HWPFDocument doc = new HWPFDocument(istream); Range range = doc.getRange();// Returns the range which covers the whole // of the document, but excludes any // headers and footers. for (int i = 0; i < range.numParagraphs(); i++) { Paragraph poiPara = range.getParagraph(i); int j = 0; while (true) { CharacterRun run = poiPara.getCharacterRun(j++); System.out.println("Color " + run.getColor());//颜色 System.out.println("Font size " + run.getFontSize());//字体大小 System.out.println("Font Name " + run.getFontName());//字体名称 System.out.println(run.isBold() + " " + run.isItalic() + " " + run.getUnderlineCode());//加粗,斜体,下划线 System.out.println("Text is " + run.text());//文本内容 if (run.getEndOffset() == poiPara.getEndOffset()) { break; } } } }
对Docx文件的解析
需要poi-ooxml/3.7.jar
http://poi.apache.org/document/quick-guide-xwpf.html
package test; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.usermodel.CharacterRun; import org.apache.poi.hwpf.usermodel.Paragraph; import org.apache.poi.hwpf.usermodel.Range; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFParagraph; import org.apache.poi.xwpf.usermodel.XWPFRun; public class ParseWordDocxTest { /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { InputStream istream = new FileInputStream( "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx"); XWPFDocument docx = new XWPFDocument(istream); List<XWPFParagraph> paraGraph = docx.getParagraphs(); for(XWPFParagraph para :paraGraph ){ List<XWPFRun> run = para.getRuns(); for(XWPFRun r : run){ int i = 0; System.out.println("字体颜色:"+r.getColor()); System.out.println("字体名称:"+r.getFontFamily()); System.out.println("字体大小:"+r.getFontSize()); System.out.println("Text:"+r.getText(i++)); System.out.println("粗体?:"+r.isBold()); System.out.println("斜体?:"+r.isItalic()); } } } }