使用 apache pdfbox 去除水印
需求
学习cobol过程中,找了一本电子书,但是有水印。
WPS 可以擦除,但是需要开通会员。
能不能用java程序去除水印呢?
实现
先查阅一些资料,开拓视野。
第一步:安装 org.apache.pdfbox:pdfbox-app:3.0.2 ,这是一个可执行jar,执行后可弹出Swing图形用户界面,可导入pdf文件后可查看其内部结构
java -jar pdfbox-app-3.0.2.jar
给出帮助信息
Usage: pdfbox [COMMAND] [OPTIONS] Commands: debug Analyzes and inspects the internal structure of a PDF document decrypt Decrypts a PDF document encrypt Encrypts a PDF document decode Writes a PDF document with all streams decoded export:images Extracts the images from a PDF document export:xmp Extracts the xmp stream from a PDF document export:text Extracts the text from a PDF document export:fdf Exports AcroForm form data to FDF export:xfdf Exports AcroForm form data to XFDF import:fdf Imports AcroForm form data from FDF import:xfdf Imports AcroForm form data from XFDF overlay Adds an overlay to a PDF document print Prints a PDF document render Converts a PDF document to image(s) merge Merges multiple PDF documents into one split Splits a PDF document into number of new documents fromimage Creates a PDF document from images fromtext Creates a PDF document from text version Gets the version of PDFBox help Display help information about the specified command. See 'pdfbox help <command>' to read about a specific subcommand
java -jar pdfbox-app-3.0.2.jar debug
File -> Open... -> <choose your pdf file>
查看Page:1结构,发现 Im[pageNum]代表每一页的书籍内容,是图片;Xi[pageNum-1]代表每一页的水印,是文本;这两个组成了页面内容
整体是树状结构:Document - Page - Resource - XObject - Xi
写程序:
引入依赖
<dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>3.0.2</version> </dependency>
@Test public void testaa() throws IOException { PDDocument doc = Loader.loadPDF(new File("D:\\202404 cobol学习\\[精通COBOL大型机商业编程技术详解].马千里.修订版.pdf")); PDPageTree pages = doc.getPages(); int i = 0; for (PDPage page : pages) { PDResources resources = page.getResources(); COSDictionary dic = resources.getCOSObject().getCOSDictionary(COSName.XOBJECT); dic.removeItem(COSName.getPDFName("Xi"+i++)); } doc.save(new File("D:\\202404 cobol学习\\无水印.pdf")); // The #close() method must be called once the document is no longer needed. doc.close(); }
执行后全部水印被清除