使用Python给Pdf文件去水印

有两种方法

使用PyPDF2库，并且已知水印内容

　　　　定义一个watermark的list，list元素为水印内容

    with open(pdf_file, 'rb') as f:
        source = PdfFileReader(f, "rb")
        output = PdfFileWriter()

        for page in range(source.getNumPages()):
            page = source.getPage(page)
            content_object = page["/Contents"].getObject()
            content = ContentStream(content_object, source)

            for operands, operator in content.operations:
                if operator == b_("Tj"):
                    text = operands[0]
                    print(text)

                    if isinstance(text, str) and text in watermark:
                        # operands[0] = TextStringObject('')
                        operands[0] = NullObject()

            page.__setitem__(NameObject('/Contents'), content)
            output.addPage(page)

        with open('new_pdf.pdf', "wb") as outputStream:
            output.write(outputStream)

使用pdfplumber库，需要批量处理pdf，并且每个pdf的水印不同，不知道水印内容

　　　　使用pdfplubmer的时候，是先打印了page.chars 发现打印了pdf文档里每一个字符的属性，观察发现水印的non_stroking_color属性是有值的，而正文字符的non_stroking_color属性为None，page.chars返回的是一个list对象，所以就想尝试遍历这个list，通过判断non_stroking_color的值来将水印字符从list中去掉，但是发现page.chars是无法修改的，再继续打印观察page其他的属性，发现page.objects为一个dict，而dict其中的一项就是char，而可以重新给page.objects['char']赋值，所以就尝试给他赋值，之后再继续extract_text()或者是extract_table()都不会再提取到水印的内容

with pdfplumber.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages):
        # 去水印
        objects = page.objects

        new_chars = []

        for char in objects['char']:
            if not char['non_stroking_color']:
                new_chars.append(char)

        page.objects['char'] = new_chars

　　这两种方法都可以PyPDF2可以将去水印的内容保存为新的pdf文件，而pdfplubmer只有解析文件的功能，不提供写文件的功能，只支持继续在新得到的内容上进行解析。

posted @ 2022-07-19 10:26 _Masami 阅读(2999) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

_Masami

使用Python给Pdf文件去水印

公告