君子博学而日参省乎己 则知明而行无过矣

博客园 首页 新随笔 联系 订阅 管理

This guide is dedicated to all those that would simply like to do what aperture is for - that is to crawl a filesystem and extract everything there is to be extracted: file metadata and contents. All of this can be accomplished in a single class. This class is available in src/examples folder. It is called TutorialCrawlingExample.

Basically in order to extract some rdf information from a data source we need a ... well a data source, that is an instance of a DataSource class. DataSources come in many flavours. (see DataSources). The one we'll be interested in is a FileSystemDataSource. The configuration of a data source is stored in an RDFContainer. It is an interface that makes access to the RDF store easy. It works more or less like a hash map. We can operate on it directly, or through a convenience class called ConfigurationUtil... (for details see RDF usage in Aperture) We'll choose the second approach. The configuration of a data source boils down to five lines of code:

1 Model model = RDF2Go.getModelFactory().createModel();
2 RDFContainer configuration = new RDFContainerImpl(model,new URIImpl("source:testSource"));
3 ConfigurationUtil.setRootFolder(rootFile.getAbsolutePath(), configuration);
4 DataSource source = new FileSystemDataSource();
5 source.setConfiguration(configuration);

This snippet does following things

  1. Instantiate an RDF2Go model, for more details about RDF2Go see here
  2. Instantiate the RDFContainer using the available implementation. The given uri is more or less irrelevant in this case. (Provided it is syntactically correct)
  3. Set the rootFolder option to the absolute path of the rootFile (which is a java.io.File instance)
  4. Instantiate the DataSource itself as a FileSystemDataSource
  5. Set the configuration of the data source to the one from the provided container

The second stage: the setting up and firing a crawler is done in another five lines:

1 FileSystemCrawler crawler = new FileSystemCrawler();
2 crawler.setDataSource(source);
3 crawler.setDataAccessorRegistry(new DefaultDataAccessorRegistry());
4 crawler.setCrawlerHandler(new TutorialCrawlerHandler());
5 crawler.crawl();

This piece of code has following meaning

  1. Instantiation of a crawler
  2. Set the crawler to crawl this particular DataSource
  3. Part of aperture magic: if you're interested see DataObjects and DataAccessors for details
  4. Set the object that will be notified of new DataObjects. This is the part that we will have to provide by ourselves, since we are the ones, who know best what to do with the data :-). See below
  5. Fire the crawling (might as well be done in a separate thread...)

The crawler handler is actually very simple. Aperture provides a class called CrawlerHandlerBase. Note that it is not available in the aperture jar itself. You need the examples jar file to use it. It encapsulates the default methods. The simplest use case of a crawler needs only five methods to be provided. They are summarized in this snippet:

 1 private class TutorialCrawlerHandler extends CrawlerHandlerBase {
 2 
 3     // our 'persistent' modelSet
 4     private ModelSet modelSet;

Constructor - initializes the underlying modelSet - the rdf store that will contain all generated RDF statements. In this example we use the default createModelSet() method. It creates a model set backed by an in-memory repository with no inference. We could just as well use a persistent model, whose content is stored in a file on in a relational database.

 6     public TutorialCrawlerHandler() throws ModelException {
 7         modelSet = RDF2Go.getModelFactory().createModelSet();
 8     }

crawlStopped - the method called by the crawler, when it has finished the crawling process. At that point the Repository will contain all data that has been extracted from a file system, that is the file metadata (names, sizes, dates of last modification etc...) and contents (extracted from files that have been recognized as being of one of the supported file types. See extractors for details on this process. Don't forget to close the modelSet after you're done with it (line 18).

10     public void crawlStopped(Crawler crawler, ExitCode exitCode) {
11         try {
12             modelSet.writeTo(System.out, Syntax.Trix);
13         }
14         catch (Exception e) {
15             throw new RuntimeException(e);
16         }
17 
18         modelSet.close();
19     }

getRDFContainer - every time a new data object (in this case a file) is encountered, the crawler has to store the rdf data in some rdf container. He asks the handler to provide him with one. This approach gives us some flexibility. In this particular program we use this flexibility to make every container a new fresh one, backed by a new empty in-memory model. As such we will have the information about different DataObject nicely divided. They won't interfere with each other, and we will be able to decide by ourselves what to do with each DataObject.

21     public RDFContainer getRDFContainer(URI uri) {
22         // we create a new in-memory temporary model for each data source
23         Model model = RDF2Go.getModelFactory().createModel(uri);
24         // note that the model is opened when passed to an rdfcontainer
25         return new RDFContainerImpl(model, uri);
26     }

Now we see the power of aperture. The most important method in every application that uses aperture. ObjectNew. This method is called by the crawler whenever a new data object is found. For applications that don't keep information about previous crawls and are thus unable to tell if an object has been encountered before or not - this will be the only method that really matters. In this example we simply move the metadata from the data object (backed by an in-memory mode, we created in the getRdfContainer method) to our 'persistent' modelSet. We could just as well do just anything we like with the data, analyze it in any way, show it to the user, serialize to a file, feed to Lucene for later searching. The sky is the limit.

Note that processBinary method from the CrawlerHandlerBase is used. It tries to find and use an extractor to augment the metadata provided by the crawler itself. It has been ommitted from this example but the reader is heartily advised to acquaint him- or herself with it, since using extractors is a common task for all Aperture users.

28     public void objectNew(Crawler crawler, DataObject object) {
29         // first we try to extract the information from the binary file
30         processBinary(object);
31         // then we add this information to our persistent model
32         modelSet.addModel(object.getMetadata().getModel());
33         // don't forget to dipose of the DataObject
34         object.dispose();
35     }
36 

This method is a variant of the previous one. It is used by crawlers that keep track of their crawling and can distinguish between objects that have been encountered before or not. See Incremental Crawling for details.

37     public void objectChanged(Crawler crawler, DataObject object) {
38         // first we remove old information about the data object
39         modelSet.removeModel(object.getID());
40         // then we try to extract metadata and fulltext from the file
41         processBinary(object);
42         // an then we add the information from the temporary model to our
43         // 'persistent' model
44         modelSet.addModel(object.getMetadata().getModel());
45         // don't forget to dispose of the DataObject
46         object.dispose();
47     }
48 

At lastly, another method often called by crawlers that use Incremental Crawling facilities. It is called whenever the crawler finds out that a data object has been deleted from the data source (e.g. a file was deleted). This method lets us update the 'persistent' rdf store to reflect the deletion.

49     public void objectRemoved(Crawler crawler, URI uri) {
50         // we simply remove the information
51         modelSet.removeModel(uri);
52     }
53 }

If this short demonstration got you interested - see the entire working example in org.semanticdesktop.aperture.examples.TutorialCrawlingExample. There are also numerous other examples in this package. Apart from examples, there is still plenty to read in the rest of this documentation. Enjoy aperture!

posted on 2013-07-17 05:01  刺猬的温驯  阅读(157)  评论(0编辑  收藏  举报