文件在线预览doc,docx转换pdf(一)
文件在线预览doc,docx转换pdf(一)
1. 前言
文档转换是一个是一块硬骨头,但是也是必不可少的,我们正好做的知识库产品中,也面临着同样的问题,文档转换,精准的全文搜索,知识的转换率,是知识库产品的基本要素,初识阅读时同时绞尽脑汁,自己开发?,集成第三方?都是中小企业面临的一大难题…….
自己在网上搜索着找到poi开源出来的很多例子,最开始是用poi把所有文档转换为html,
1) 在github上面找到一个https://github.com/litter-fish/transform完整的demo,你想要的转换基本都提供,初学者可以参照实现转换出来的基本样子,达到通用级别,需要自己花很多功夫。此开源代码是基于poi和itext(pdf)的转换方式。
2) https://gitee.com/kekingcn/file-online-preview这是开源中国提供的一个源码,基于jodconverter,原理是调用windows,另存为的组件,实现转换。
3) 收费产品例如【永中office】【office365】【idocv】、【https://downloads.aspose.com/words/java】
2. 转换思路
自己在尝试过很多后,也与永中集成了文档转换,发现,要想完成预览的品质,必须的做二次渲染。毕竟永中做了十几年文档转换我们不能比的,自己琢磨后,发现一个勉强靠谱的思路,doc和docx都转换为pdf实现预览。都是在基于poi的基础上。
2.1. Doc转换pdf
1) Doc转换为xml
/** * doc转xml */ public String toXML(String filePath){ try{ POIFSFileSystem nPOIFSFileSystem = new POIFSFileSystem(new File(filePath)); HWPFDocument nHWPFDocument = new HWPFDocument(nPOIFSFileSystem); WordToFoConverter nWordToHtmlConverter = new WordToFoConverter( DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument()); PicturesManager nPicturesManager = new PicturesManager() { public String savePicture(byte[] arg0, PictureType arg1,String arg2, float arg3, float arg4) { //file:///F://20.vscode//iWorkP//temp//images//0.jpg //System.out.println("file:///"+PathMaster.getWebRootPath()+ java.io.File.separator + "temp"+java.io.File.separator+"images" + java.io.File.separator + arg2); // return "file:///"+PathMaster.getWebRootPath()+java.io.File.separator +"temp"+java.io.File.separator+"images" + java.io.File.separator + arg2; return "file:///"+PathMaster.getWebRootPath()+java.io.File.separator +"temp"+java.io.File.separator+"images" + java.io.File.separator + arg2; } }; nWordToHtmlConverter.setPicturesManager(nPicturesManager); nWordToHtmlConverter.processDocument(nHWPFDocument); String nTempPath = PathMaster.getWebRootPath() + java.io.File.separator + "temp" + java.io.File.separator + "images" + java.io.File.separator; File nFile = new File(nTempPath); if (!nFile.exists()) { nFile.mkdirs(); } for (Picture nPicture : nHWPFDocument.getPicturesTable().getAllPictures()) { nPicture.writeImageContent(new FileOutputStream(nTempPath + nPicture.suggestFullFileName())); } Document nHtmlDocument = nWordToHtmlConverter.getDocument(); OutputStream nByteArrayOutputStream = new FileOutputStream(OUTFILEFO); DOMSource nDOMSource = new DOMSource(nHtmlDocument); StreamResult nStreamResult = new StreamResult(nByteArrayOutputStream); TransformerFactory nTransformerFactory = TransformerFactory.newInstance(); Transformer nTransformer = nTransformerFactory.newTransformer(); nTransformer.setOutputProperty(OutputKeys.ENCODING, "GBK"); nTransformer.setOutputProperty(OutputKeys.INDENT, "YES"); nTransformer.setOutputProperty(OutputKeys.METHOD, "xml"); nTransformer.transform(nDOMSource, nStreamResult); nByteArrayOutputStream.close(); return ""; }catch(Exception e){ e.printStackTrace(); } return ""; }
2) Xml转换为pdf
这里我是使用fop通过xml转换为pdf,也是最近欣喜的一个发现,poi官网推荐的我一直没去仔细看,里面的架包和永中的很多高清包,一模一样,现在貌似路子对了。有兴趣者研究去吧。我的源码已经在githubhttps://github.com/liuxufeijidian/file.convert.master/tree/master上面,环境已经配置好,需要准备好doc和docx文档即可。
/* * xml 转pdf */ public void xmlToPDF() throws SAXException, TransformerException{ // Step 1: Construct a FopFactory by specifying a reference to the configuration file // (reuse if you plan to render multiple documents!) FopFactory fopFactory = null; new URIResolverAdapter(new URIResolver(){ public Source resolve(String href, String base) throws TransformerException { try { URL url = new URL(href); URLConnection connection = url.openConnection(); connection.setRequestProperty("User-Agent", "whatever"); return new StreamSource(connection.getInputStream()); } catch (IOException e) { throw new RuntimeException(e); } } }); OutputStream out = null; try { fopFactory = FopFactory.newInstance(new File(CONFIG)); // Step 2: Set up output stream. // Note: Using BufferedOutputStream for performance reasons (helpful with FileOutputStreams). out = new BufferedOutputStream(new FileOutputStream(OUTFILEPDF)); // Step 3: Construct fop with desired output format Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, out); // Step 4: Setup JAXP using identity transformer TransformerFactory factory = TransformerFactory.newInstance(); Transformer transformer = factory.newTransformer(); // identity transformer // Step 5: Setup input and output for XSLT transformation // Setup input stream Source src = new StreamSource(OUTFILEFO); // Resulting SAX events (the generated FO) must be piped through to FOP Result res = new SAXResult(fop.getDefaultHandler()); // Step 6: Start XSLT transformation and FOP processing transformer.transform(src, res); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } finally { //Clean-up try { out.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }}
2.1.3
很多时候我们是使用word直接转的html,但是需要自己写二次渲染的代码,较为复杂,我是使用迂回方法,doc转xml,再用xml转换pdf,转换出来的pdf用pdfjs渲染即可实现和浏览器打开一样的预览,pdfjs预览方法详情见https://blog.csdn.net/liuxufeijidian/article/details/82260199
ending:大家都想看效果如何,https://github.com/litter-fish/transform,github获取改源码,配置好doc和docx文档即可实现转换,接下来会继续努力不间断优化和更新文档转换。