lucent检索技术之创建索引:使用POI读取txt/word/excel/ppt/pdf内容

在使用lucent检索文档时,必须先为各文档创建索引。索引的创建即读出文档信息(如文档名称、上传时间、文档内容等),然后再经过分词建索引写入到索引文件里。这里主要是总结下读取各类文档内容这一步。

一、之前做过一个小工具也涉及到读取word和excel内容,采用的是com组件的方式来读取。即导入COM库,引入命名空间(using Microsoft.Office.Interop.Word;using Microsoft.Office.Interop.Excel;),然后读代码如下:

读取word

   public string readWORD(object filepath)
        {
            string filename = Convert.ToString(filepath);
            Microsoft.Office.Interop.Word.Application wordapp = new Microsoft.Office.Interop.Word.Application();
            object isreadonly = true;
            object nullobj = System.Reflection.Missing.Value;
            object missingValue = Type.Missing;
            object miss = System.Reflection.Missing.Value;
            object saveChanges = WdSaveOptions.wdDoNotSaveChanges;
            Microsoft.Office.Interop.Word._Document doc = wordapp.Documents.Open(ref filename, ref nullobj, ref isreadonly);
            string content = doc.Content.Text;
            doc.Close(ref saveChanges, ref missingValue, ref missingValue);
            wordapp.Quit(ref saveChanges, ref miss, ref miss);
            wordapp = null;
            return content;
           }
View Code

读取excel

用COM读取excel代码,首先是启动excel程序打开工作表,然后取得工作表名,再读取单元格内容,比较繁琐,代码略。

另外,也可以采用OleDB读取EXCEL文件,即把excel作为一个数据库,读出内容返回datatable,代码:

public DataSet ExcelToDS(string Path) 
{ 
string strConn = "Provider=Microsoft.Jet.OLEDB.4.0;" +"Data Source="+ Path +";"+"Extended Properties=Excel 8.0;"; 
OleDbConnection conn = new OleDbConnection(strConn); 
conn.Open();   
string strExcel = "";    
OleDbDataAdapter myCommand = null; 
DataSet ds = null; 
strExcel="select * from [sheet1$]"; 
myCommand = new OleDbDataAdapter(strExcel, strConn); 
ds = new DataSet(); 
myCommand.Fill(ds,"table1");    
return ds; 
} 

对于EXCEL中的表即sheet([sheet1$])如果不是固定的可以使用下面的方法得到 
string strConn = "Provider=Microsoft.Jet.OLEDB.4.0;" +"Data Source="+ Path +";"+"Extended Properties=Excel 8.0;"; 
OleDbConnection conn = new OleDbConnection(strConn); 
DataTable schemaTable = objConn.GetOleDbSchemaTable(System.Data.OleDb.OleDbSchemaGuid.Tables,null); 
string tableName=schemaTable.Rows[0][2].ToString().Trim();   
View Code

读取ppt

        public string readPPT(object filepath)
        {
            string file = filepath.ToString();
            Microsoft.Office.Interop.PowerPoint.Application pa = new Microsoft.Office.Interop.PowerPoint.Application();
            Microsoft.Office.Interop.PowerPoint.Presentation pp = pa.Presentations.Open(file, Microsoft.Office.Core.MsoTriState.msoTrue, Microsoft.Office.Core.MsoTriState.msoFalse, Microsoft.Office.Core.MsoTriState.msoFalse);
            string content = "";
            foreach (Microsoft.Office.Interop.PowerPoint.Slide slide in pp.Slides)
            {
                foreach (Microsoft.Office.Interop.PowerPoint.Shape shape in slide.Shapes)
                    content += shape.TextFrame.TextRange.Text.ToString();
            }
            pa.Quit();
            pp.Close();
            pa = null;
            return content;
       }
View Code

 

采用COM方式读取效率很低,而创建索引只需取得文档内容,也要求要快速高效获得要索引的文件内容。因此,COM读取不适用于创建索引。POI包含了各类文档所需的类,使用时只需添加相应的类,实现代码也简单,更重要的是能快速地取得文档内容。

二、采用POI

(1)首先下载POI包,在解决方案中通过“管理NuGet程序包”工具来下载;也可以到Apache官网下载。

(2)以下是POI读取各文档内容代码(包含读取txt、word、excel、ppt、pdf)。

        /// <summary>
        /// 读取各类文档内容
      /// </summary>
        /// <param name="filepath">文档路径</param>
        /// <param name="filename">文档名称</param>
        /// <returns></returns>
        public string textToreader(string filepath, object filename)
        {
            string content = null;
            FileInfo file = new FileInfo(filename.ToString());
            switch (file.Extension.ToLower())
            {
                case ".txt":
                    content = readTXT(filepath);
                    break;
                case ".doc":
                    content = readWORD(filepath);
                    break;
                case ".docx":
                    content = readWORDX(filepath);
                    break;
                case ".xls":
                    content = readEXCEL(filepath);
                    break;
                 case ".xlsx":
                    content = readEXCELX(filepath);
                    break;
                case ".pdf":
                    content = readPDF(filepath);
                    break;
                case ".ppt":
                    content = readPPT(filepath);
                    break;
            }
            return content;
        }


        /// <summary>
        /// 读取txt
        /// </summary>
        /// <param name="filepath"></param>
        /// <returns></returns>
        public string readTXT(string filepath)
        {
            StreamReader st = new StreamReader(filepath, Encoding.GetEncoding("gb2312"));
            string content = st.ReadToEnd();
            return content;
        }


        /// <summary>
        /// 读取word2003
        /// </summary>
        /// <param name="filepath"></param>
        /// <returns></returns>
        public string readWORD(string filepath)
        {
            FileInputStream fs = new FileInputStream(filepath);
            HWPFDocument doc = new HWPFDocument(fs);
            string content = doc.getDocumentText();
            return content;
        }


        /// <summary>
        /// 读取word2007
        /// </summary>
        /// <param name="filepath"></param>
        /// <returns></returns>
        public string readWORDX(string filepath)
        {
            FileInputStream fs = new FileInputStream(filepath);
            XWPFDocument XDocument = new XWPFDocument(fs);
            XWPFWordExtractor doc = new XWPFWordExtractor(XDocument);
            string content = doc.getText();
            return content;
        }


        /// <summary>
        /// 读取excel2003
        /// </summary>
        /// <param name="filepath"></param>
        /// <returns></returns>
        public string readEXCEL(object filepath)
        {
            string filename = filepath.ToString();
            FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);//读取流
            POIFSFileSystem ps = new POIFSFileSystem(fs);
            HSSFWorkbook hwb = new HSSFWorkbook(ps);
            ExcelExtractor extractor = new ExcelExtractor(hwb);
            extractor.FormulasNotResults = true;
            extractor.IncludeSheetNames = true;
            string content = extractor.Text;
            return content;
        }


        /// <summary>
        /// 读取excel2007
        /// </summary>
        /// <param name="filepath"></param>
        /// <returns></returns>
        public string readEXCELX(string filepath)
        {
            //FileStream fs = new FileStream(filepath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);//读取流
            FileInputStream fis = new FileInputStream(filepath);
            //POIFSFileSystem ps = new POIFSFileSystem(fs);
            XSSFWorkbook hwb = new XSSFWorkbook(fis);
            XSSFExcelExtractor extractor = new XSSFExcelExtractor(hwb);
            string content = extractor.getText();
            return content;
        }


        /// <summary>
        /// 读取pdf
        /// </summary>
        /// <param name="filepath"></param>
        /// <returns></returns>
        public string readPDF(string filepath)
        {
            PDDocument doc = PDDocument.load(filepath);
            PDFTextStripper pdfStripper = new PDFTextStripper();
            string content = pdfStripper.getText(doc);
            doc.close();
            return content;
        }


        /// <summary>
        /// 读取ppt2003
        /// </summary>
        /// <param name="filepath"></param>
        /// <returns></returns>
        public string readPPT(string filepath)
        {
            FileInputStream fs = new FileInputStream(filepath);
            SlideShow ss = new SlideShow(new HSLFSlideShow(fs));
            Slide[] slides = ss.getSlides();// 获得每一张幻灯片

            string content = "";
            for (int i = 0; i < slides.Length; i++)
            {
                TextRun[] t = slides[i].getTextRuns();// 为了取得幻灯片的文字内容,建立TextRun
                for (int j = 0; j < t.Length; j++)
                {
                    content += t[j].getText();
                }
            }
            return content;
        }
View Code


注:不同版本的读取对应不同的POI接口程序。

Excel 文件: xls 格式文件对应 POI API 为 HSSF ; xlsx 格式为 office 2007 的文件格式,POI 中对应的API 为XSSF。

Word 文件:doc 格式文件对应的 POI API 为 HWPF; docx 格式为 XWPF。

powerPoint 文件:ppt 格式对应的 POI API 为 HSLF; pptx 格式为 XSLF。

 

三、使用POITextExtractor类可实现读取office2007兼容以上版本的文档代码:

        /// <summary>
        /// 读取word2007,excel2003/2007,ppt2003/2007
        /// </summary>
        /// <param name="filepath"></param>
        /// <returns></returns>
        public  string ReadOfficeText(string filepath)
        {
            //docx 、pptx 、xlsx、 ppt 、xls
            FileInputStream fs = new FileInputStream(filepath);
            POITextExtractor extractor = ExtractorFactory.createExtractor(fs);
            string text = extractor.getText();
            return text;
        }
View Code

但是不知什么原因采用这个方法读取word2003会报错,暂时先用着上面第二点中读取word2003的方法吧。

 

posted @ 2014-09-30 17:00  goodgirlmia  阅读(971)  评论(0编辑  收藏  举报
作者:goodgirlmia 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。