使用poi读取word2007(.docx)中的复杂表格
最近工作需要做一个读取word(.docx)中的表格,并以html形式输出。这里使用了poi。
对于2007及之后的word文档,需要导入poi-ooxml-xxx.jar及其依赖包,如下图(图中为使用maven):pom.xml
对于简单表格,可以使用如下方式来获取每个表格的内容:
XWPFDocument document = new XWPFDocument(new FileInputStream("word.docx")); // 获取所有表格 List<XWPFTable> tables = document.getTables(); for (XWPFTable table : tables) { // 获取表格的行 List<XWPFTableRow> rows = table.getRows(); for (XWPFTableRow row : rows) { // 获取表格的每个单元格 List<XWPFTableCell> tableCells = row.getTableCells(); for (XWPFTableCell cell : tableCells) { // 获取单元格的内容 String text = cell.getText(); } } }
但是对于复杂表格(含合并的单元格),则无法正常处理。
于是继续上网查询,在stackoverflow查到如下生成含有合并的单元格的表格:
public class CreateWordTableMerge { static void mergeCellVertically(XWPFTable table, int col, int fromRow, int toRow) { for(int rowIndex = fromRow; rowIndex <= toRow; rowIndex++){ CTVMerge vmerge = CTVMerge.Factory.newInstance(); if(rowIndex == fromRow){ // The first merged cell is set with RESTART merge value vmerge.setVal(STMerge.RESTART); } else { // Cells which join (merge) the first one, are set with CONTINUE vmerge.setVal(STMerge.CONTINUE); } XWPFTableCell cell = table.getRow(rowIndex).getCell(col); // Try getting the TcPr. Not simply setting an new one every time. CTTcPr tcPr = cell.getCTTc().getTcPr(); if (tcPr != null) { tcPr.setVMerge(vmerge); } else { // only set an new TcPr if there is not one already tcPr = CTTcPr.Factory.newInstance(); tcPr.setVMerge(vmerge); cell.getCTTc().setTcPr(tcPr); } } } static void mergeCellHorizontally(XWPFTable table, int row, int fromCol, int toCol) { for(int colIndex = fromCol; colIndex <= toCol; colIndex++){ CTHMerge hmerge = CTHMerge.Factory.newInstance(); if(colIndex == fromCol){ // The first merged cell is set with RESTART merge value hmerge.setVal(STMerge.RESTART); } else { // Cells which join (merge) the first one, are set with CONTINUE hmerge.setVal(STMerge.CONTINUE); } XWPFTableCell cell = table.getRow(row).getCell(colIndex); // Try getting the TcPr. Not simply setting an new one every time. CTTcPr tcPr = cell.getCTTc().getTcPr(); if (tcPr != null) { tcPr.setHMerge(hmerge); } else { // only set an new TcPr if there is not one already tcPr = CTTcPr.Factory.newInstance(); tcPr.setHMerge(hmerge); cell.getCTTc().setTcPr(tcPr); } } } public static void main(String[] args) throws Exception { XWPFDocument document= new XWPFDocument(); XWPFParagraph paragraph = document.createParagraph(); XWPFRun run=paragraph.createRun(); run.setText("The table:"); //create table XWPFTable table = document.createTable(3,5); for (int row = 0; row < 3; row++) { for (int col = 0; col < 5; col++) { table.getRow(row).getCell(col).setText("row " + row + ", col " + col); } } //create and set column widths for all columns in all rows //most examples don't set the type of the CTTblWidth but this //is necessary for working in all office versions for (int col = 0; col < 5; col++) { CTTblWidth tblWidth = CTTblWidth.Factory.newInstance(); tblWidth.setW(BigInteger.valueOf(1000)); tblWidth.setType(STTblWidth.DXA); for (int row = 0; row < 3; row++) { CTTcPr tcPr = table.getRow(row).getCell(col).getCTTc().getTcPr(); if (tcPr != null) { tcPr.setTcW(tblWidth); } else { tcPr = CTTcPr.Factory.newInstance(); tcPr.setTcW(tblWidth); table.getRow(row).getCell(col).getCTTc().setTcPr(tcPr); } } } //using the merge methods mergeCellVertically(table, 0, 0, 1); mergeCellHorizontally(table, 1, 2, 3); mergeCellHorizontally(table, 2, 1, 4); paragraph = document.createParagraph(); FileOutputStream out = new FileOutputStream("create_table.docx"); document.write(out); System.out.println("create_table.docx written successully"); } }
运行一下确实可以实现,不过仍是一头雾水,对于其中的cTTc,tcPr,vMerge等属性仍是不知道是什么。
直到后来知道了Office Open XML (OOXML) ,可以将.docx文件后缀改为.zip,即可以使用解压软件打开,进入后有一个word文件夹,里面的document.xml即为word正文内容。
对于word中的上图行合并表格,对应的xml如下:
<w:tbl> <w:tblPr> <w:tblStyle w:val="a3"/> <w:tblW w:w="0" w:type="auto"/> <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/> </w:tblPr> <w:tblGrid> <w:gridCol w:w="2765"/> <w:gridCol w:w="2765"/> </w:tblGrid> <w:tr w:rsidR="00151AA4" w:rsidTr="000249EF"> <w:tc> <w:tcPr> <w:tcW w:w="2765" w:type="dxa"/> <w:vMerge w:val="restart"/> </w:tcPr> <w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4" w:rsidP="00915802"> <w:r> <w:rPr> <w:rFonts w:hint="eastAsia"/> </w:rPr> <w:t>0,0</w:t> </w:r> </w:p> </w:tc> <w:tc> <w:tcPr> <w:tcW w:w="2765" w:type="dxa"/> </w:tcPr> <w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4"> <w:r> <w:rPr> <w:rFonts w:hint="eastAsia"/> </w:rPr> <w:t>0,1</w:t> </w:r> </w:p> </w:tc> </w:tr> <w:tr w:rsidR="00151AA4" w:rsidTr="000249EF"> <w:tc> <w:tcPr> <w:tcW w:w="2765" w:type="dxa"/> <w:vMerge/> </w:tcPr> <w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4"/> </w:tc> <w:tc> <w:tcPr> <w:tcW w:w="2765" w:type="dxa"/> </w:tcPr> <w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4"> <w:r> <w:rPr> <w:rFonts w:hint="eastAsia"/> </w:rPr> <w:t>1,1</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> </w:tc> </w:tr> </w:tbl>
看到这里,相信大家会理解了前面的tc,tcPr,vMerge等属性了吧。
其中w:tr表示的是表格的一行,tcPr代表的是一个单元格的属性。
具体可以参考:http://www.datypic.com/sc/ooxml/e-w_tbl-1.html
下面在给大家展示一下列合并的情况,大家也可以用来验证一下:
对应的xml:
<w:tbl> <w:tblPr> <w:tblStyle w:val="a3"/> <w:tblW w:w="0" w:type="auto"/> <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/> </w:tblPr> <w:tblGrid> <w:gridCol w:w="2765"/> <w:gridCol w:w="2765"/> </w:tblGrid> <w:tr w:rsidR="006C0A9A" w:rsidTr="006C099A"> <w:tc> <w:tcPr> <w:tcW w:w="5530" w:type="dxa"/> <w:gridSpan w:val="2"/> </w:tcPr> <w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A"> <w:r> <w:rPr> <w:rFonts w:hint="eastAsia"/> </w:rPr> <w:t>0,0</w:t> </w:r> </w:p> </w:tc> </w:tr> <w:tr w:rsidR="006C0A9A" w:rsidTr="000249EF"> <w:tc> <w:tcPr> <w:tcW w:w="2765" w:type="dxa"/> </w:tcPr> <w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A"> <w:r> <w:rPr> <w:rFonts w:hint="eastAsia"/> </w:rPr> <w:t>1,0</w:t> </w:r> </w:p> </w:tc> <w:tc> <w:tcPr> <w:tcW w:w="2765" w:type="dxa"/> </w:tcPr> <w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A"> <w:r> <w:rPr> <w:rFonts w:hint="eastAsia"/> </w:rPr> <w:t>1,1</w:t> </w:r> </w:p> </w:tc> </w:tr> </w:tbl>
通过观察可以总结如下(使用poi提供的方法):
行合并情况:
CTTcPr tcpr = tables.get(0).getRow(2).getCell(0).getCTTc().getTcPr(); // 此属性每个单元格都有,为每个单元格的属性:tableCell.cellProperty
如果是行合并的第一行单元格,则: tcpr.getVMerge().getVal().toString() == "restart"
如果是行合并的其他行单元格,则: tcpr.getVMerge().getVal() == null
如果不是行合并的单元格,则: tcpr.getVMerge() == null
列合并情况:
CTTcPr tcpr = tables.get(0).getRow(2).getCell(0).getCTTc().getTcPr();
如果是列合并的第一列单元格,则:tcpr.getGridSpan().getVal()可以获取到这列单元格所占的行数
其他单元格:tcpr.getGridSpan() == null
这里有一个获取表格内容转为html的demo供大家参考。(https://github.com/zavier/ReadWordTable)
也欢迎大家关注我的新博客:https://zhengw-tech.com/