使用代码查看Nutch爬取的网站后生成的SequenceFile信息
必须针对data文件中的value类型来使用对应的类来查看(把这个data文件,放到了本地Windows的D盘根目录下).
代码:
1 package cn.summerchill.nutch; 2 import java.io.IOException; 3 4 import org.apache.hadoop.conf.Configuration; 5 import org.apache.hadoop.fs.FileSystem; 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.io.SequenceFile; 8 import org.apache.hadoop.io.Text; 9 import org.apache.nutch.crawl.CrawlDatum; 10 import org.apache.nutch.crawl.Inlinks; 11 import org.apache.nutch.parse.ParseData; 12 import org.apache.nutch.parse.ParseText; 13 import org.apache.nutch.protocol.Content; 14 /** 15 * 读取nutch生成的sequencefile文件 16 * @author Administrator 17 * 18 */ 19 public class SeFileReader { 20 public static void main(String[] args) throws IOException { 21 Configuration conf=new Configuration(); 22 Path dataPath=new Path("D:\\data"); 23 FileSystem fs=dataPath.getFileSystem(conf); 24 SequenceFile.Reader reader=new SequenceFile.Reader(fs,dataPath,conf); 25 Text key=new Text(); 26 CrawlDatum value=new CrawlDatum(); 27 //Content value = new Content(); 28 //Inlinks value = new Inlinks(); 29 //ParseText value = new ParseText(); 30 //ParseData value = new ParseData(); 31 while(reader.next(key,value)){ 32 System.out.println("key->\n"+key); 33 System.err.println("value->\n"+value); 34 try { 35 Thread.sleep(1000); 36 } catch (InterruptedException e) { 37 e.printStackTrace(); 38 } 39 System.out.println("======================================="); 40 } 41 reader.close(); 42 } 43 }
运行结果:
key-> http://bbs.superwu.cn/ value-> Version: 7 Status: 2 (db_fetched) Fetch time: Tue Nov 08 08:31:30 CST 2016 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.6153846 Signature: 22defcd7cb4e7b1dc8a16a0a2f339ecb Metadata: Content-Type=application/xhtml+xml _pst_=success(1), lastModified=0 _rs_=610 ======================================= value-> Version: 7 Status: 1 (db_unfetched) Fetch time: Sun Oct 09 08:31:35 CST 2016 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.23076925 Signature: null Metadata: key-> http://bbs.superwu.cn/archiver/ ======================================= key-> http://bbs.superwu.cn/forum.php value-> Version: 7 Status: 1 (db_unfetched) Fetch time: Sun Oct 09 08:31:35 CST 2016 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.15384616 Signature: null Metadata: =======================================
作者:SummerChill 出处:http://www.cnblogs.com/DreamDrive/ 本博客为自己总结亦或在网上发现的技术博文的转载。 如果文中有什么错误,欢迎指出。以免更多的人被误导。 |