hbase-writer
http://code.google.com/p/hbase-writer/
What is HBase-Writer?
HBase-Writer is an extension to the Heritrix open source crawler written by the Internet Archive (http://crawler.archive.org/) that enables it to store crawled content directly into HBase tables (http://hbase.org/) running on the Hadoop Distributed FileSystem (http://hadoop.apache.org/core/). HBase-Writer writes crawled content into a given hbase table as individual records or "rowkeys". In turn, these tables are directly supported by the MapReduce framework via HBase and Hadoop. HBase-Writer's goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase.
News
March 29th, 2010
HBase-Writer 0.9-SNAPSHOT has now been released. This version is compatible with both Heritrix 2.X and Heritrix 3.X. Much thanks to Greg Lu for spearheading this effort and sending in the initial patch. Once Heritrix has an official 3.0.0-RELEASE, then HBase-writer will release version 0.9-RELEASE. Thanks again Greg!