【apache tika】apache tika获取文件内容(与FileUtils的对比)
文档类型检测 内容提取 元数据提取 语言检测
MIME类型检测: Tika可以检测并从所有包括在MIME标准的媒体类型中提取内容。
语言检测: Tika包括语言识别功能,因此可以在一个多语种网站基于语言类型的文档中使用。
给出的图表显示Tika通用解析器类CompositeParser 主AutoDetectParser。由于CompositeParser类遵循复合设计模式,可以用一组解析器实例作为一个单独的解析器。CompositeParser类也可以访问所有实现解析器接口的类。
void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
handler,接收从输入文档解析XHTML SAX事件序列的ContentHandler对象,负责处理事件并以特定的形式导出结果。
当解析文档时,Tika尽量重用已经存在的解析库,如Apache POI或PDFBox。因此,大多数解析器实现类仅适配这些外部类库。下面,我们将了解如何使用处理程序和元数据参数来提取文档的内容和元数据。为了方便,我们能使用Tika的门面类调用解析器Api。
<!--tika解析文本内容--> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.17</version> </dependency>
/** * 检测文件类型的用法 */ public static void test1(){ File file = new File("G:/tikatest/test.mp4"); Tika tika = new Tika(); String filetype = null; try { filetype = tika.detect(file); } catch (IOException e) { e.printStackTrace(); } System.out.println(filetype); }
/** * 读取txt内容 */ public static void test2(){ File file = new File("G:/tikatest/test.txt"); Tika tika = new Tika(); String filecontent = null; try { filecontent = tika.parseToString(file); } catch (IOException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } System.out.println("Extracted Content: " + filecontent); }
Extracted Content: 111
public static void test3(){ File file = new File("G:/tikatest/test.txt"); String s = null; try { s = FileUtils.readFileToString(file); } catch (IOException e) { e.printStackTrace(); } System.out.println(s); }
public static void test4(){ File file=new File("G:/tikatest/test.mp4"); Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = null; try { inputstream = new FileInputStream(file); } catch (FileNotFoundException e) { e.printStackTrace(); } ParseContext context = new ParseContext(); try { parser.parse(inputstream, handler, metadata, context); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } System.out.println(handler.toString()); //getting the list of all meta data elements String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
Software: OnePlus3-user 7.1.1 NMF26F 76 dev-keys
GPS Altitude Ref: Unknown (2)
Metering Mode: Center weighted average
Model: ONEPLUS A3010
meta:save-date: 2017-09-02T16:32:15
File Name: apache-tika-4154811460990247864.tmp
Exposure Mode: Auto exposure
Exif Version: 2.20
Sensing Method: One-chip color area sensor
tiff:ImageLength: 540
exif:Flash: false
Creation-Date: 2017-09-02T16:32:15
Interoperability Version: 1.00
ISO Speed Ratings: 640
X Resolution: 72 dots per inch
Shutter Speed Value: 1/20 sec
tiff:ImageWidth: 720
Thumbnail Width Pixels: 0
tiff:XResolution: 72.0
Image Width: 720 pixels
Last-Save-Date: 2017-09-02T16:32:15
exif:FNumber: 2.0
Number of Tables: 4 Huffman tables
F-Number: f/2.0
Color Space: sRGB
meta:creation-date: 2017-09-02T16:32:15
Resolution Units: inch
Data Precision: 8 bits
File Modified Date: 星期二 十月 16 22:15:54 +08:00 2018
tiff:BitsPerSample: 8
Last-Modified: 2017-09-02T16:32:15
tiff:YResolution: 72.0
YCbCr Positioning: Center of pixel array
Compression Type: Baseline
Components Configuration: YCbCr
exif:IsoSpeedRatings: 640
X-Parsed-By: org.apache.tika.parser.DefaultParser
Focal Length 35: 28 mm
modified: 2017-09-02T16:32:15
Brightness Value: 0
Thumbnail Offset: 874 bytes
Exif Image Height: 3480 pixels
Focal Length: 4.3 mm
Thumbnail Length: 14211 bytes
White Balance Mode: Auto white balance
Content-Type: image/jpeg
Make: OnePlus
tiff:Make: OnePlus
Date/Time Original: 2017:09:02 08:32:15
Scene Capture Type: Standard
Exif Image Width: 4640 pixels
Makernote: [26 values]
dcterms:created: 2017-09-02T16:32:15
exif:ExposureTime: 0.05
date: 2017-09-02T16:32:15
Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
tiff:ResolutionUnit: Inch
Interoperability Index: Recommended Exif Interoperability Rules (ExifR98)
Flash: Flash did not fire, auto
Date/Time Digitized: 2017:09:02 08:32:15
File Size: 50158 bytes
Thumbnail Height Pixels: 0
Resolution Unit: Inch
Sub-Sec Time Original: 994455
XMP Value Count: 4
tiff:Software: OnePlus3-user 7.1.1 NMF26F 76 dev-keys
Aperture Value: f/2.0
Number of Components: 3
dcterms:modified: 2017-09-02T16:32:15
tiff:Model: ONEPLUS A3010
Image Height: 540 pixels
Sub-Sec Time Digitized: 994455
Sub-Sec Time: 994455
Scene Type: Directly photographed image
Exposure Time: 0.05 sec
exif:DateTimeOriginal: 2017-09-02T16:32:15
exif:FocalLength: 4.26
Compression: JPEG (old-style)
FlashPix Version: 1.00
Date/Time: 2017:09:02 08:32:15
Exposure Program: Unknown (0)
Y Resolution: 72 dots per inch
public static void test6(){ //Instantiating a file object File file = new File("G:/tikatest/test.txt"); //Parser method parameters Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream content = null; try { content = new FileInputStream(file); } catch (FileNotFoundException e) { e.printStackTrace(); } //Parsing the given document try { parser.parse(content, handler, metadata, new ParseContext()); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } LanguageIdentifier object = new LanguageIdentifier(handler.toString()); System.out.println("Language name :" + object.getLanguage()); }
Language name :lt
public static void test7() throws IOException, TikaException, SAXException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/4.pdf")); ParseContext pcontext = new ParseContext(); //parsing the document using PDF parser PDFParser pdfparser = new PDFParser(); pdfparser.parse(inputstream, handler, metadata,pcontext); //getting the content of the document System.out.println("Contents of the PDF :" + handler.toString()); //getting metadata of the document System.out.println("Metadata of the PDF:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name+ " : " + metadata.get(name)); } }
Contents of the PDF : 个人简历 ............................... Metadata of the PDF: access_permission:extract_for_accessibility : true pdf:docinfo:title : 个人简历 meta:save-date : 2018-06-12T07:41:54Z pdf:docinfo:modified : 2018-06-12T07:41:54Z dcterms:created : 2018-06-12T07:41:54Z Author : liqiang qiao date : 2018-06-12T07:41:54Z access_permission:can_modify : true access_permission:modify_annotations : true creator : liqiang qiao Creation-Date : 2018-06-12T07:41:54Z title : 个人简历 meta:author : liqiang qiao access_permission:fill_in_form : true created : Tue Jun 12 15:41:54 CST 2018 pdf:docinfo:producer : Microsoft® Word 2013 dc:format : application/pdf; version=1.5 access_permission:can_print : true pdf:docinfo:created : 2018-06-12T07:41:54Z xmp:CreatorTool : Microsoft® Word 2013 Last-Save-Date : 2018-06-12T07:41:54Z dc:title : 个人简历 access_permission:assemble_document : true dcterms:modified : 2018-06-12T07:41:54Z meta:creation-date : 2018-06-12T07:41:54Z pdf:docinfo:creator : liqiang qiao dc:creator : liqiang qiao pdf:PDFVersion : 1.5 Last-Modified : 2018-06-12T07:41:54Z modified : 2018-06-12T07:41:54Z xmpTPg:NPages : 2 access_permission:can_print_degraded : true pdf:encrypted : false access_permission:extract_content : true producer : Microsoft® Word 2013 pdf:docinfo:creator_tool : Microsoft® Word 2013 Content-Type : application/pdf
从Microsoft Office文档中提取内容和元数据。
public static void test8() throws TikaException, SAXException, IOException { //detecting the file type BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/test.docx")); ParseContext pcontext = new ParseContext(); //OOXml parser OOXMLParser msofficeparser = new OOXMLParser (); msofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
Contents of the document: -Xms5200M -Xmx5200M -XX:PermSize=512M -XX:MaxPermSize=512M
http_load使用教程: https://www.cnblogs.com/shijingjing07/p/6539179.html
Metadata of the document:
cp:revision: 19
meta:last-author: liqiang qiao
Last-Author: liqiang qiao
meta:save-date: 2017-12-14T10:25:00Z
Application-Name: Microsoft Office Word
Author: liqiang qiao
dcterms:created: 2017-12-14T09:28:00Z
Application-Version: 15.0000
Character-Count-With-Spaces: 195
date: 2017-12-14T10:25:00Z
Total-Time: 57
extended-properties:Template: Normal.dotm
meta:line-count: 1
creator: liqiang qiao
Word-Count: 29
meta:paragraph-count: 1
Creation-Date: 2017-12-14T09:28:00Z
extended-properties:AppVersion: 15.0000
meta:author: liqiang qiao
Line-Count: 1
extended-properties:Application: Microsoft Office Word
Paragraph-Count: 1
Last-Save-Date: 2017-12-14T10:25:00Z
Revision-Number: 19
dcterms:modified: 2017-12-14T10:25:00Z
meta:creation-date: 2017-12-14T09:28:00Z
Template: Normal.dotm
Page-Count: 1
meta:character-count: 167
dc:creator: liqiang qiao
meta:word-count: 29
Last-Modified: 2017-12-14T10:25:00Z
modified: 2017-12-14T10:25:00Z
xmpTPg:NPages: 1
extended-properties:TotalTime: 57
Character Count: 167
meta:page-count: 1
meta:character-count-with-spaces: 195
Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/user.xlsx")); ParseContext pcontext = new ParseContext(); OOXMLParser msofficeparser = new OOXMLParser (); msofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); }
Contents of the document:Sheet1
序号 用户名字 用户电话 用户邮箱 用户账户 用户类型 密码
1 rrrrrr 15888585954 954318308@qq.com root111 管理员 111222
2 001 15898569856 qiao_liqiang@163.com 001 普通用户 111222
3 超级管理员 15898569856 5555@qq.com root8 管理员 111222
4 qqq 15898569856 qiao_liqiang@163.com 1231 普通用户 111222
5 张三 18558458569 33335658@qq.com 333 普通用户 111222
6 李四 15898569856 qiao_liqiang@163.com 4444 普通用户 111222
7 超级管理员 15898569856 5555@qq.com root5 管理员 111222
8 张三 18434391711 qiao_liqiang@163.com root7 管理员 111222
9 张三 18434391711 qiao_liqiang@163.com root3 管理员 111222
10 超管 15898569856 qiao_liqiang@163.com root 管理员 111222
11 8888 15898569856 qiao_liqiang@163.com 8888 普通用户 111222
12 超级管理员 15888585954 954318308@qq.com roo6 管理员 111222
13 张三 18434391711 qiao_liqiang@163.com root4 管理员 111222
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/test.txt")); ParseContext pcontext = new ParseContext(); TXTParser msofficeparser = new TXTParser(); msofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/index.html")); ParseContext pcontext = new ParseContext(); HtmlParser msofficeparser = new HtmlParser(); msofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
Contents of the document:
Welcome to nginx!
If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.
For online documentation and support please refer to
Commercial support is available at
Thank you for using nginx.
Metadata of the document:
title: Welcome to nginx!
Content-Encoding: ISO-8859-1
Content-Type: text/html; charset=ISO-8859-1
dc:title: Welcome to nginx!
public static void test8() throws TikaException, SAXException, IOException { String s = FileUtils.readFileToString(new File("G:/tikatest/index.html")); System.out.println(s); }
<!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title> <style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; } </style> </head> <body> <h1>Welcome to nginx!</h1> <p>If you see this page, the nginx web server is successfully installed and working. Further configuration is required.</p> <p>For online documentation and support please refer to <a href="http://nginx.org/">nginx.org</a>.<br/> Commercial support is available at <a href="http://nginx.com/">nginx.com</a>.</p> <p><em>Thank you for using nginx.</em></p> </body> </html>
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/UUIDUtil.class")); ParseContext pcontext = new ParseContext(); ClassParser parser = new ClassParser(); parser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
Contents of the document:package cn.xm.jwxt.utils;
public synchronized class UUIDUtil {
public void UUIDUtil();
public static String getUUID();
public static String getUUID2();
Metadata of the document:
title: UUIDUtil
resourceName: UUIDUtil.class
dc:title: UUIDUtil
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(10*1024*1024); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/t.jar")); ParseContext pcontext = new ParseContext(); PackageParser parser = new PackageParser (); parser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
package org.apache.tika.utils;
public synchronized class ServiceLoaderUtils {
public void ServiceLoaderUtils();
public static void sortLoadedClasses(java.util.List);
public static Object newInstance(String);
public static Object newInstance(String, ClassLoader);
package org.apache.tika.utils;
final synchronized class XMLReaderUtils$1 implements org.xml.sax.EntityResolver {
void XMLReaderUtils$1();
public org.xml.sax.InputSource resolveEntity(String, String) throws org.xml.sax.SAXException, java.io.IOException;
package org.apache.tika.utils;
final synchronized class XMLReaderUtils$2 implements javax.xml.stream.XMLResolver {
void XMLReaderUtils$2();
public Object resolveEntity(String, String, String, String) throws javax.xml.stream.XMLStreamException;
package org.apache.tika.utils;
public synchronized class XMLReaderUtils {
private static final java.util.logging.Logger LOG;
private static final org.xml.sax.EntityResolver IGNORING_SAX_ENTITY_RESOLVER;
private static final javax.xml.stream.XMLResolver IGNORING_STAX_ENTITY_RESOLVER;
public void XMLReaderUtils();
public static org.xml.sax.XMLReader getXMLReader() throws org.apache.tika.exception.TikaException;
public static javax.xml.parsers.SAXParser getSAXParser() throws org.apache.tika.exception.TikaException;
public static javax.xml.parsers.SAXParserFactory getSAXParserFactory();
public static javax.xml.parsers.DocumentBuilderFactory getDocumentBuilderFactory();
public static javax.xml.parsers.DocumentBuilder getDocumentBuilder() throws org.apache.tika.exception.TikaException;
public static javax.xml.stream.XMLInputFactory getXMLInputFactory();
private static void trySetSAXFeature(javax.xml.parsers.DocumentBuilderFactory, String, boolean);
private static void tryToSetStaxProperty(javax.xml.stream.XMLInputFactory, String, boolean);
public static javax.xml.transform.Transformer getTransformer() throws org.apache.tika.exception.TikaException;
static void <clinit>();
package org.apache.tika.utils;
abstract interface package-info {
Metadata of the document:
Content-Type: application/zip
Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("g:/tikatest/5.jpeg")); ParseContext pcontext = new ParseContext(); JpegParser JpegParser = new JpegParser(); JpegParser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
Contents of the document:
Metadata of the document:
Number of Tables: 4 Huffman tables
Number of Components: 3
Image Height: 192 pixels
Resolution Units: inch
File Name: apache-tika-7234240523307196989.tmp
Data Precision: 8 bits
File Modified Date: 星期三 十月 17 21:43:39 +08:00 2018
tiff:BitsPerSample: 8
Compression Type: Baseline
Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
tiff:ImageLength: 192
Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
X Resolution: 96 dots
File Size: 9216 bytes
tiff:ImageWidth: 256
Thumbnail Height Pixels: 0
Thumbnail Width Pixels: 0
Image Width: 256 pixels
Y Resolution: 96 dots
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("g:/tikatest/test.mp4")); ParseContext pcontext = new ParseContext(); MP4Parser MP4Parser = new MP4Parser(); MP4Parser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
Contents of the document:
Metadata of the document:
dcterms:modified: 2017-07-20T10:25:23Z
xmpDM:duration: 39.5
meta:creation-date: 2017-07-20T10:25:23Z
meta:save-date: 2017-07-20T10:25:23Z
Last-Modified: 2017-07-20T10:25:23Z
dcterms:created: 2017-07-20T10:25:23Z
xmpDM:audioSampleRate: 10000
date: 2017-07-20T10:25:23Z
tiff:ImageLength: 578
modified: 2017-07-20T10:25:23Z
Creation-Date: 2017-07-20T10:25:23Z
tiff:ImageWidth: 442
Content-Type: video/mp4
Last-Save-Date: 2017-07-20T10:25:23Z
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 一个奇形怪状的面试题:Bean中的CHM要不要加volatile?
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· Obsidian + DeepSeek:免费 AI 助力你的知识管理,让你的笔记飞起来!
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了