itextsharp upgrade to itext7

Why am I getting duplicate pages extracted from iText7 C#?

Actually it is not the same text being returned from sequential pages. Instead you get

  • the text from page 1 when you extract page 1;
  • the text from pages 1 and 2 when you extract page 2;
  • the text from pages 1, 2, and 3 when you extract page 3;
  • ...

Often this happens for code that re-uses a text extraction strategy for multiple pages. But that's not the case in your code, you correctly create a new strategy object for each page. Thus the cause must be in the PDF itself.

And indeed, each page of your document does contain the contents of all previous pages, too, merely outside its crop box. To extract only the text in the respective page crop box you have to filter, e.g. like this:

string SRC = @"285187.pdf";

PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));

Console.WriteLine("\n285187 Filtered\n============\n");

for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
    var strategy = new SimpleTextExtractionStrategy();
    var pdfPage = pdfDoc.GetPage(i);

    var filter = new IEventFilter[1];
    filter[0] = new TextRegionEventFilter(pdfPage.GetCropBox());
    var filteredTextEventListener = new FilteredTextEventListener(strategy, filter);

    var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, filteredTextEventListener);

    Console.WriteLine("PAGE {0}", i);
    Console.WriteLine(currentText);
}

pdfDoc.Close();

需要注意的是,策略换成LocationTextExtractionStrategy读出来的内容就和原来一样了

 

作者:Chuck Lu    GitHub    
posted @   ChuckLu  阅读(32)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
历史上的今天:
2019-01-10 AppDomain.CurrentDomain.BaseDirectory
2018-01-10 sql server 2012中red gate的sql source control消失
2015-01-10 git gc内存错误的解决方案
点击右上角即可分享
微信分享提示