<转>Extract Text from PDF in C# (100% .NET)
This article is from http://www.codeproject.com/KB/cs/PDFToText.aspx.
Introduction
This is a 100% .NET solution to extract text from PDF documents.
Background
Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.
Using the Code
In order to use this solution in your projects, you need to do the following steps:
- Add references to itextsharp.dll and SharpZiplib.dll
- Add the PDFParser.cs class to your project
Then you can use the newly added class in the following way:
// create an instance of the pdfparser class PDFParser pdfParser = new PDFParser(); // extract the text String result = pdfParser.ExtractText(pdfFile);
I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.
How Is It Working?
My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader
class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes
to extract the text contents from the deflated page.
Further Improvements
Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.
History
- 20th May, 2006: Initial post
License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)
About the Author
Zollor Web Developer
Romania Member |