自动抓取RDF

Posted on 2005-02-26 16:36 idior 阅读(854) 评论(0) 编辑收藏举报

Using DC-dot to Generate DC RDF

Much about a document can be deleted directly from the document itself. The format, location, subject, author, and copyright from HTML meta tags and so on can all be derived from scraping the HTML for a particular web resource.

Based on this, an organization going by the abbreviation UKOLN, at the University of Bath in the UK, created the DC-dot generator. This online application will scrape a web resource, pull whatever information it can from it, and then return the result formatted in multiple ways, including RDF, XHTML meta tags, and straight XML.

Access DC-dot at http://www.ukoln.ac.uk/metadata/dcdot/.

I decided to try this with the sample "Tale of Two Monsters" article. In the first page of the application, I entered the URL for the document, and checked both boxes to have the tool attempt to determine publisher and return RDF. The page returned has a first guess at the RDF/XML and provides a form that you can then use to modify the DC elements generated. Figure 6-4 displays the form you can use to modify the results.

Figure 6-4. DC-dot format to modify results

With some modifications, the DC RDF/XML document generated is shown in Example 6-8.

Example 6-8. DC-dot-generated RDF/XML

<?xml version="1.0"?>
<!DOCTYPE rdf:RDF SYSTEM "http://purl.org/dc/schemas/dcmes-xml-20000714.dtd">

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/">
  <rdf:Description about="http://burningbird.net/articles/monsters3.htm">
    <dc:title>
      Tale of Two Monsters: Architeuthis Dux
    </dc:title>
    <dc:creator>
      Shelley Powers
    </dc:creator>
    <dc:subject>
      Internet; Web; Computers; Software; Technology;
      Meteorology; Geology; Oceanography; Astronomy; Math;
      Science; Physics; P2P
    </dc:subject>
    <dc:description>
      The Giant Squid and its relationship to mythology.
    </dc:description>
    <dc:publisher>
      Burningbird
    </dc:publisher>
    <dc:date>
      2002-01-20
    </dc:date>
    <dc:type>
      Text
    </dc:type>
    <dc:format>
      text/html
    </dc:format>
    <dc:format>
      8287 bytes
    </dc:format>
  </rdf:Description>
</rdf:RDF>

The generated RDF/XML validates with the RDF Validator, except for one element, boldfaced in the example code—the generator uses an unqualified about attribute, which, though allowed for existing vocabularies, is discouraged with new vocabularies and RDF/XML instances. However, this is a quick change to make.

Now that you've had a chance to try out RDF/XML, it's time to try out a few of the many, many tools and utilities and APIs that have been created specifically for processing RDF/XML.

刷新页面返回顶部

享受代码,享受人生

公告