itextsharp upgrade to itext7
Why am I getting duplicate pages extracted from iText7 C#?
Actually it is not the same text being returned from sequential pages. Instead you get
- the text from page 1 when you extract page 1;
- the text from pages 1 and 2 when you extract page 2;
- the text from pages 1, 2, and 3 when you extract page 3;
- ...
Often this happens for code that re-uses a text extraction strategy for multiple pages. But that's not the case in your code, you correctly create a new strategy object for each page. Thus the cause must be in the PDF itself.
And indeed, each page of your document does contain the contents of all previous pages, too, merely outside its crop box. To extract only the text in the respective page crop box you have to filter, e.g. like this:
string SRC = @"285187.pdf";
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
Console.WriteLine("\n285187 Filtered\n============\n");
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
var strategy = new SimpleTextExtractionStrategy();
var pdfPage = pdfDoc.GetPage(i);
var filter = new IEventFilter[1];
filter[0] = new TextRegionEventFilter(pdfPage.GetCropBox());
var filteredTextEventListener = new FilteredTextEventListener(strategy, filter);
var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, filteredTextEventListener);
Console.WriteLine("PAGE {0}", i);
Console.WriteLine(currentText);
}
pdfDoc.Close();
需要注意的是,策略换成LocationTextExtractionStrategy
读出来的内容就和原来一样了
作者:Chuck Lu GitHub |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
2019-01-10 AppDomain.CurrentDomain.BaseDirectory
2018-01-10 sql server 2012中red gate的sql source control消失
2015-01-10 git gc内存错误的解决方案