POI执行解析word转化HTML
目前来说解析word文档显示在html上有三种办法
分别是:POI(比较麻烦)
插件(要付费,或者每天只允许调用500次,不适合大企业)
把word转化成为PDF然后通过flash体现在页面上(不怎么样,麻烦+可操作性不强)
使用H5执行,不太熟悉H5
既然选择了POI那么就开始做了。
第一步先maven导入jar包.
<dependency> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> <version>3.14</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>3.14</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.14</version> </dependency> <dependency> <groupId>fr.opensagres.xdocreport</groupId> <artifactId>xdocreport</artifactId> <version>1.0.6</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml-schemas</artifactId> <version>3.14</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>ooxml-schemas</artifactId> <version>1.3</version> </dependency>
POI在解析的时候会有版本问题导致无法调用某些对象。所以word2003跟word2007需要使用不同的方法进行转化
先解析2007
@Test public void word2007ToHtml() throws Exception { String filepath = "e:/files/"; String sourceFileName =filepath+"前言.docx"; String targetFileName = filepath+"1496717486420.html"; String imagePathStr = filepath+"/image/"; OutputStreamWriter outputStreamWriter = null; try { XWPFDocument document = new XWPFDocument(new FileInputStream(sourceFileName)); XHTMLOptions options = XHTMLOptions.create(); // 存放图片的文件夹 options.setExtractor(new FileImageExtractor(new File(imagePathStr))); // html中图片的路径 options.URIResolver(new BasicURIResolver("image")); outputStreamWriter = new OutputStreamWriter(new FileOutputStream(targetFileName), "utf-8"); XHTMLConverter xhtmlConverter = (XHTMLConverter) XHTMLConverter.getInstance(); xhtmlConverter.convert(document, outputStreamWriter, options); } finally { if (outputStreamWriter != null) { outputStreamWriter.close(); } } }
然后没试过的2003
@Test public void test(){ DocxToHtml("E://files//1496635038432.doc","E://files//1496635038432.html"); } public static void DocxToHtml(String fileAllName,String outPutFile){ HWPFDocument wordDocument; try { //根据输入文件路径与名称读取文件流 InputStream in=new FileInputStream(fileAllName); //把文件流转化为输入wordDom对象 wordDocument = new HWPFDocument(in); //通过反射构建dom创建者工厂 DocumentBuilderFactory domBuilderFactory=DocumentBuilderFactory.newInstance(); //生成dom创建者 DocumentBuilder domBuilder=domBuilderFactory.newDocumentBuilder(); //生成dom对象 Document dom=domBuilder.newDocument(); //生成针对Dom对象的转化器 WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(dom); //转化器重写内部方法 wordToHtmlConverter.setPicturesManager( new PicturesManager() { public String savePicture( byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches ) { return suggestedName; } } ); //转化器开始转化接收到的dom对象 wordToHtmlConverter.processDocument(wordDocument); //保存文档中的图片 /* List<?> pics=wordDocument.getPicturesTable().getAllPictures(); if(pics!=null){ for(int i=0;i<pics.size();i++){ Picture pic = (Picture)pics.get(i); try { pic.writeImageContent(new FileOutputStream("E:/test/"+ pic.suggestFullFileName())); } catch (FileNotFoundException e) { e.printStackTrace(); } } } */ //从加载了输入文件中的转换器中提取DOM节点 Document htmlDocument = wordToHtmlConverter.getDocument(); //从提取的DOM节点中获得内容 DOMSource domSource = new DOMSource(htmlDocument); //字节码输出流 ByteArrayOutputStream out = new ByteArrayOutputStream(); //输出流的源头 StreamResult streamResult = new StreamResult(out); //转化工厂生成序列转化器 TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); //设置序列化内容格式 serializer.setOutputProperty(OutputKeys.ENCODING, "GB2312"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); //生成文件方法 writeFile(new String(out.toByteArray()), outPutFile); out.close(); } catch (FileNotFoundException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } catch (IOException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } catch (TransformerConfigurationException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (TransformerException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParserConfigurationException e) { // TODO Auto-generated catch block e.printStackTrace(); } } public static void writeFile(String content, String path) { FileOutputStream fos = null; BufferedWriter bw = null; try { File file = new File(path); fos = new FileOutputStream(file); bw = new BufferedWriter(new OutputStreamWriter(fos,"GB2312")); bw.write(content); } catch (FileNotFoundException fnfe) { fnfe.printStackTrace(); } catch (IOException ioe) { ioe.printStackTrace(); } finally { try { if (bw != null) bw.close(); if (fos != null) fos.close(); } catch (IOException ie) { } } }
这两个方法可以将word转化成HTML,注意如果是在IE8的情况下会无法显示表格边框。
我会进一步优化这个方法