pdf解析2

Posted on 2012-11-13 17:54 xgbzsc 阅读(495) 评论(1) 编辑收藏举报

1、使用PDFBox处理PDF文档

PDF全称Portable Document Format，是Adobe公司开发的电子文件格式。这种文件格式与操作系统平台无关，可以在Windows、Unix或Mac OS等操作系统上通用。

PDF文件格式将文字、字型、格式、颜色及独立于设备和分辨率的图形图像等封装在一个文件中。如果要抽取其中的文本信息，需要根据它的文件格式来进行解析。幸好目前已经有不少工具能帮助我们做这些事情，其中就有PDFBOX

PDFBox是Java实现的PDF文档API库，提供PDF文档的一系列操作。例如创建、处理以及文档内容提取等功能，也包含了一些命令行实用工具。

主要有以下特性：

PDF格式的文本抽取
合并PDF文档
PDF文档的加密与解密
Lucene搜索引擎集成
填充表单数据
创建一个文本文件的PDF
创建PDF页面图象
打印PDF文档

2、PDFBox的下载

最常见的一种PDF文本抽取工具就是PDFBox了，访问网址http://sourceforge.net/projects/pdfbox/。读者可以在该网页下载其最新的版本。本书采用的是PDFBox-0.7.3版本。PDFBox是一个开源的Java PDF库，这个库允许你访问PDF文件的各项信息。

3、在Eclipse中配置

以下是在Eclipse中创建工程，并导入pdf工具类的过程

（1）在Eclipse的workspace中创建一个普通的Java工程:pdfprj

（2）把下载的PDFBox-0.7.3.zip解压。

（3）进入external目录下，可以看到，这里包括了PDFBox所有用到的外部包。复制下面的Jar包到工程pdfprj的lib目录下（如还未建立lib目录，则先创建一个）。

bcmail-jdk14-132.jar
bcprov-jdk14-132.jar
checkstyle-all-4.2.jar
FontBox-0.1.0-dev.jar
lucene-core-2.0.0.jar

然后再从PDFBox的lib目录下，复制PDFBox-0.7.3.jar到工程的lib目录下。

（4）在工程上单击右键，在弹出的快捷菜单中选择“Build Path->Config Build Path->Add Jars”命令，把工程lib目录下面的包都加入工程的Build Path。

4.使用PDFBox解析PDF内容

抽取pdf文本内容

Java代码

private PDDocument document = null;
public static void main(String[] args) throws IOException {
String file = "d:\\pdf\\pdf-type.pdf";
PDFBOX parse = new PDFBOX();
parse.openPDFFile(file);
}
public void openPDFFile(String file) throws IOException {
InputStream is = null;
File f = new File(file);
is = new FileInputStream(f);
this.document = this.parseDocument(is);
//获取页数
List pages = this.document.getDocumentCatalog().getAllPages();
int pageSize = pages.size();
System.out.println("pdf页数:"+pageSize);
this.getPdfText();
}
public PDDocument parseDocument(InputStream input) throws IOException {
PDDocument document = PDDocument.load(input);
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (CryptographyException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return document;
}
/*
* 抽取pdf文本内容
*/
public void getPdfText() throws IOException {
PDFTextStripper stripper = new PDFTextStripper();
OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(
"d:\\pdf-type.txt"));
BufferedWriter bw = new BufferedWriter(osw);
stripper.setShouldSeparateByBeads(true);
stripper.writeText(document, bw);
bw.close();
document.close();
}

抽取pdf文档信息:

Java代码

public static final String DATE_FORMAT = "yyyy-MM-dd HH:mm:ss";
/**
* 解析pdf文档信息
* @param pdfPath pdf文档路径
* @throws Exception
*/
public static void pdfParse( String pdfPath, String imgSavePath ) throws Exception
{
InputStream input = null;
File pdfFile = new File( pdfPath );
PDDocument document = null;
try{
input = new FileInputStream( pdfFile );
//加载 pdf 文档
document = PDDocument.load( input );
/** 文档属性信息 **/
PDDocumentInformation info = document.getDocumentInformation();
System.out.println( "标题:" + info.getTitle() );
System.out.println( "主题:" + info.getSubject() );
System.out.println( "作者:" + info.getAuthor() );
System.out.println( "关键字:" + info.getKeywords() );
System.out.println( "应用程序:" + info.getCreator() );
System.out.println( "pdf 制作程序:" + info.getProducer() );
System.out.println( "作者:" + info.getTrapped() );
System.out.println( "创建时间:" + dateFormat( info.getCreationDate() ));
System.out.println( "修改时间:" + dateFormat( info.getModificationDate()));
/** 文档页面信息 **/
PDDocumentCatalog cata = document.getDocumentCatalog();
List pages = cata.getAllPages();
int count = 1;
for( int i = 0; i < pages.size(); i++ )
{
PDPage page = ( PDPage ) pages.get( i );
if( null != page )
{
PDResources res = page.findResources();
//获取页面图片信息
Map imgs = res.getImages();
if( null != imgs )
{
Set keySet = imgs.keySet();
Iterator it = keySet.iterator();
while( it.hasNext() )
{
Object obj = it.next();
PDXObjectImage img = ( PDXObjectImage ) imgs.get( obj );
img.write2file( imgSavePath + count );
count++;
}
}
}
}
}catch( Exception e)
{
throw e;
}finally{
if( null != input )
input.close();
if( null != document )
document.close();
}
}
/**
* 获取格式化后的时间信息
* @param dar 时间信息
* @return
* @throws Exception
*/
public static String dateFormat( Calendar calendar ) throws Exception
{
if( null == calendar )
return null;
String date = null;
try{
String pattern = DATE_FORMAT;
SimpleDateFormat format = new SimpleDateFormat( pattern );
date = format.format( calendar.getTime() );
}catch( Exception e )
{
throw e;
}
return date == null ? "" : date;
}

刷新页面返回顶部

xgbzsc

公告

pdf解析2