nutch2.1源码分析

nutch2.1分表采集开发

1.nutch源码获取配置文件nutch-default.xml的storage.schema.webpage属性值:webpage:

类:org.apache.nutch.storage.StorageUtils.java

@SuppressWarnings("unchecked")
  public static <K, V extends Persistent> DataStore<K, V> createWebStore(Configuration conf,
      Class<K> keyClass, Class<V> persistentClass) throws ClassNotFoundException, GoraException {
    
    String schema = null;
    if (WebPage.class.equals(persistentClass)) {
      schema = conf.get("storage.schema.webpage", "webpage");
    } else if (Host.class.equals(persistentClass)) {
      schema = conf.get("storage.schema.host", "host");
    } else {
      throw new UnsupportedOperationException("Unable to create store for class " + persistentClass);
    }
    
    String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");
    
    if (!crawlId.isEmpty()) {
      conf.set("schema.prefix", crawlId + "_");
    } else {
      conf.set("schema.prefix", "");
    }

    Class<? extends DataStore<K, V>> dataStoreClass =
      (Class<? extends DataStore<K, V>>) getDataStoreClass(conf);
    return DataStoreFactory.createDataStore(dataStoreClass,
            keyClass, persistentClass, conf, schema);
  }

2、过滤只采集的地址:

regex-urlfilter.txt末尾增加: +^http://([a-z0-9]*\.)zhangzhou.gov.cn/
automaton-urlfilter.txt末尾增加:+^http://([a-z0-9]*\.)zhangzhou.gov.cn/

3、nutch分表需要修改的配置文件:

gora-sql-mapping.xml
nutch-default.xml

4、nutch分表规则

采集存储的数据表名+"_"采集地址id

5、采集运行可能出现挂掉,异常信息是:Output path is null in cleanup

解决方法:把类org.apache.nutch.indexer.elastic.ElasticWriter.java的异常位置try掉，如下:
  private void processExecute(boolean createNewBulk) {
   try {
	   if (execute != null) {
		      // wait for previous to finish
		      long beforeWait = System.currentTimeMillis();
		      BulkResponse actionGet = execute.actionGet();
		      if (actionGet.hasFailures()) {
		        for (BulkItemResponse item : actionGet) {
		          if (item.failed()) {
		            throw new RuntimeException("First failure in bulk: "
		                + item.getFailureMessage());
		          }
		        }
		      }
		      long msWaited = System.currentTimeMillis() - beforeWait;
		      LOG.info("Previous took in ms " + actionGet.getTookInMillis()
		          + ", including wait " + msWaited);
		      execute = null;
		    }
		    if (bulk != null) {
		      if (bulkDocs > 0) {
		        // start a flush, note that this is an asynchronous call
		        execute = bulk.execute();
		      }
		      bulk = null;
		    }
		    if (createNewBulk) {
		      // Prepare a new bulk request
		      bulk = client.prepareBulk();
		      bulkDocs = 0;
		      bulkLength = 0;
		    }
	} catch (Exception e) {
		// TODO: handle exception
		e.printStackTrace();
	}	  
   
  }

6、保存中文个别乱码问题解决方案:try catch掉

类名:com.suncco.leadsite.utils.NutchJob.java
      @Override
  public boolean waitForCompletion(boolean verbose){
     boolean succeeded = true;
      try {
    	    succeeded = super.waitForCompletion(verbose);
            if (!succeeded) {
    	      // check if we want to fail whenever a job fails. (expert setting)
    	      if (getConfiguration().getBoolean("fail.on.job.failure", true)) {
    	    	  Log.warn("job failed: " + "name=" + getJobName()
    	            + ", jobid=" + getJobID());
    	      }
    	    }
	} catch (Exception e) {
		// TODO: handle exception
		e.printStackTrace();
	}
    return succeeded;
  }

触发器编写：

触发器主表不能删除主表本身操作否则会报错如:

触发器代码:BEGIN  
         SET @isLocalCount =(select count(*) from webpage_39  where id = NEW.id); 
               IF (@isLocalCount >0 )  
                  THEN   
                      delete from webpage_39  where  id = NEW.id; 
               END IF; 
        SET @count =(select count(*) from suncco_spider.webpage  where id = NEW.id);   
        IF (@count =0  && NEW.id like '%fj.fj%')   
              THEN 
                  insert into suncco_spider.webpage (id, baseUrl, status, prevFetchTime, fetchTime,   fetchInterval, retriesSinceFetch, reprUrl,content,typ,protocolStatus,   modifiedTime,title,text,parseStatus, signature,prevSignature,score,   headers,inlinks,outlinks,metadata,markers,isDelete) values (NEW.id, NEW.baseUrl,  NEW.status,NEW.prevFetchTime,NEW.fetchTime, NEW.fetchInterval, NEW.retriesSinceFetch,  NEW.reprUrl,NEW.content,NEW.typ,NEW.protocolStatus,NEW.modifiedTime,NEW.title,NEW.text,NEW.parseStatus,   NEW.signature,NEW.prevSignature,NEW.score,NEW.headers,NEW.inlinks,NEW.outlinks,NEW.metadata,NEW.markers,NEW.isDelete);  
              END IF;   
        END

java.io.IOException: java.sql.BatchUpdateException: Can't update table 'webpage_39' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.


解决方法是把delete from webpage_39  where  id = NEW.id;删掉

posted on 2013-04-22 23:15 阳光总在风雨后001 阅读(1197) 评论(0) 编辑收藏举报

刷新页面返回顶部

nutch2.1源码分析

导航

公告