2013 年 7月 17 日随笔档案 - 刺猬的温驯

JAVA内容仓库

摘要： Java Content Repository API（JSR-170）试图建立一套标准的API去访问内容仓库。如果你对内容管理系统（CMS）不熟悉的话，你一定会对内容仓库是什么感到疑惑。你可以这样去理解，把内容仓库理解为一个用来存储文本和二进制数据（图片，word文档，PDF等等）的数据存储应用程序。一个显著的特点是你不用关心你真正的数据到底存储在什么地方，是关系数据库？是文件系统？还是XML？不仅仅是数据的存储和读取，大多数的内容仓库还提供了更加高级的功能，例如访问控制，查找，版本控制，锁定内容等等。一段时间以来市场上出现了各个厂家开发的不同的CMS系统，这些系统都建立在他们各自的内阅读全文

posted @ 2013-07-17 19:18 刺猬的温驯阅读(665) 评论(0) 推荐(0) 编辑

Tutorials（十四）

摘要： These tutorials are primarily aimed at developers who want to use one of more parts of Aperture. They should also be read by people that want to create custom implementations of Aperture's APIs.The following topics concern everyone wanting to use one or more Aperture components:General structure 阅读全文

posted @ 2013-07-17 05:31 刺猬的温驯阅读(79) 评论(0) 推荐(0) 编辑

Aperture in OSGi（十三）

摘要： Using Aperture in OSGI requires a certain environment to be set. You need to include RDF2Go, a triple store implementation with it's driver and dependencies, the SLF4J Logging API and some logging implementation. It is all summarized on a diagram below.The aperture bundles are generated automati 阅读全文

posted @ 2013-07-17 05:27 刺猬的温驯阅读(182) 评论(0) 推荐(0) 编辑

Full-text and Metadata Storage and Querying（十二）

摘要： Crawling and extraction results typically need to be stored somewhere and be made queryable. We are still working on APIs and implementations that handle this aspect for Aperture.For the technically interested: this code will provide a Sesame Sail implementation that combines the use of a standard S 阅读全文

posted @ 2013-07-17 05:26 刺猬的温驯阅读(151) 评论(0) 推荐(0) 编辑

Extractors（十一）

摘要： Extractors extract the full-text and/or metadata of a particular document type (one or more MIME types). They operate on an InputStream, optionally accompanied by a MIME type and/or a Charset to tune the processing, and produce a set of RDF statements describing the full-text and metadata.The extrac 阅读全文

posted @ 2013-07-17 05:25 刺猬的温驯阅读(354) 评论(0) 推荐(0) 编辑

MIME Type Identification（十）

摘要： One of the core tasks of Aperture is full-text and metadata extraction. To choose the right Extractor for a given document, one must first establish the document's MIME type. Therefore, we have designed a MimeTypeIdentifier API that fulfills this task and developed a general purpose implementati 阅读全文

posted @ 2013-07-17 05:22 刺猬的温驯阅读(105) 评论(0) 推荐(0) 编辑

SSL Support（九）

摘要： In order to communicate with certain data sources, some sort of authentication and encryption may be needed. SSL is a technology that offers both types of security. It can for example be used to secure IMAP and HTTP communication.We provide several classes that can help establish communication over 阅读全文

posted @ 2013-07-17 05:20 刺猬的温驯阅读(309) 评论(0) 推荐(0) 编辑

LinkExtractors（八）

摘要： The LinkExtractor interface defines a service for extracting hypertext links and other references to external resources from a document. Although it is primarily meant to be used by the WebCrawler class, it may also be of use for other applications as well and has therefore been kept separate from W 阅读全文

posted @ 2013-07-17 05:15 刺猬的温驯阅读(148) 评论(0) 推荐(0) 编辑

Crawlers（七）

摘要： A Crawler is responsible for accessing the contents of a DataSource and reporting the individual resources in it as DataObjects. Examples are FileSystemCrawler, WebCrawler, ImapCrawler. Each produced DataObject contains all the metadata that can be provided by that source type, such as a file name, 阅读全文

posted @ 2013-07-17 05:14 刺猬的温驯阅读(151) 评论(0) 推荐(0) 编辑

DataObjects and DataAccessors（六）

摘要： DataObjectA DataObject represents an individual resource found in a physical data source, such as a file, a web page, a mail or an attachment. It contains the identifier, a reference to the DataSourceObject from which it has been created, and RDF metadata. It may also contain other arbitrary resourc 阅读全文

posted @ 2013-07-17 05:12 刺猬的温驯阅读(254) 评论(0) 推荐(0) 编辑

DataSources（五）

摘要： One of the central concepts of Aperture is the notion of a DataSource. A DataSource contains all information necessary to locate the individual information resources in a physical source. For example, a FileSystemDataSource holds a root directory, a set of patterns that describe what files to includ 阅读全文

posted @ 2013-07-17 05:09 刺猬的温驯阅读(141) 评论(0) 推荐(0) 编辑

The Use of RDF2Go in Aperture（四）

摘要： Aperture is built on top RDF2Go. It is an abstract layer that allows Aperture to work easily with all popular RDFStores like Sesame, Jena. For a complete list of supported RDF Stores and detailed documentation see RDF2Go homepage at http://wiki.ontoworld.org/wiki/RDF2GoThis guide is not intended to 阅读全文

posted @ 2013-07-17 05:07 刺猬的温驯阅读(149) 评论(0) 推荐(0) 编辑

The Use of RDF in Aperture（三）

摘要： Aperture makes heavy use of RDF graphs to communicate information between components. For example, Extractors return the full-text and metadata they extract as an RDF model and Crawlers do the same for the source-specific content and metadata they obtain through crawling.The rationale for using RDF 阅读全文

posted @ 2013-07-17 05:05 刺猬的温驯阅读(210) 评论(0) 推荐(0) 编辑

Quickstart（二）

摘要： This guide is dedicated to all those that would simply like to do what aperture is for - that is to crawl a filesystem and extract everything there is to be extracted: file metadata and contents. All of this can be accomplished in a single class. This class is available in src/examples folder. It is 阅读全文

posted @ 2013-07-17 05:01 刺猬的温驯阅读(157) 评论(0) 推荐(0) 编辑

General Structure of Aperture Components（一）

摘要： Aperture consists of a number of APIs each fulfilling a different type of service, e.g. text and metadata extraction, crawling data in a data source or identifying a file's MIME type. The code involved in implementing and managing such a service is typically organized in a particular way. Once y 阅读全文

posted @ 2013-07-17 04:57 刺猬的温驯阅读(164) 评论(0) 推荐(0) 编辑

WEB数据挖掘（十六）——Aperture数据抽取（9）：数据源

摘要： One of the central concepts of Aperture is the notion of a DataSource. A DataSource contains all information necessary to locate the individual inform... 阅读全文

posted @ 2013-07-17 04:24 刺猬的温驯阅读(380) 评论(0) 推荐(0) 编辑

君子博学而日参省乎己则知明而行无过矣

公告

君子博学而日参省乎己 则知明而行无过矣

公告

君子博学而日参省乎己则知明而行无过矣