nutch2.1源码分析

                                                        nutch2.1分表采集开发

1.nutch源码获取配置文件nutch-default.xml的storage.schema.webpage属性值:webpage:

类:org.apache.nutch.storage.StorageUtils.java

@SuppressWarnings("unchecked") public static <K, V extends Persistent> DataStore<K, V> createWebStore(Configuration conf, Class<K> keyClass, Class<V> persistentClass) throws ClassNotFoundException, GoraException { String schema = null; if (WebPage.class.equals(persistentClass)) { schema = conf.get("storage.schema.webpage", "webpage"); } else if (Host.class.equals(persistentClass)) { schema = conf.get("storage.schema.host", "host"); } else { throw new UnsupportedOperationException("Unable to create store for class " + persistentClass); } String crawlId = conf.get(Nutch.CRAWL_ID_KEY, ""); if (!crawlId.isEmpty()) { conf.set("schema.prefix", crawlId + "_"); } else { conf.set("schema.prefix", ""); } Class<? extends DataStore<K, V>> dataStoreClass = (Class<? extends DataStore<K, V>>) getDataStoreClass(conf); return DataStoreFactory.createDataStore(dataStoreClass, keyClass, persistentClass, conf, schema); }

 2、过滤只采集的地址:

regex-urlfilter.txt末尾增加: +^http://([a-z0-9]*\.)zhangzhou.gov.cn/
automaton-urlfilter.txt末尾增加:+^http://([a-z0-9]*\.)zhangzhou.gov.cn/

 3、nutch分表需要修改的配置文件:

gora-sql-mapping.xml
nutch-default.xml

 4、nutch分表规则

采集存储的数据表名+"_"采集地址id

 5、采集运行可能出现挂掉,异常信息是:Output path is null in cleanup

解决方法:把类org.apache.nutch.indexer.elastic.ElasticWriter.java的异常位置try掉,如下:
  private void processExecute(boolean createNewBulk) {
   try {
	   if (execute != null) {
		      // wait for previous to finish
		      long beforeWait = System.currentTimeMillis();
		      BulkResponse actionGet = execute.actionGet();
		      if (actionGet.hasFailures()) {
		        for (BulkItemResponse item : actionGet) {
		          if (item.failed()) {
		            throw new RuntimeException("First failure in bulk: "
		                + item.getFailureMessage());
		          }
		        }
		      }
		      long msWaited = System.currentTimeMillis() - beforeWait;
		      LOG.info("Previous took in ms " + actionGet.getTookInMillis()
		          + ", including wait " + msWaited);
		      execute = null;
		    }
		    if (bulk != null) {
		      if (bulkDocs > 0) {
		        // start a flush, note that this is an asynchronous call
		        execute = bulk.execute();
		      }
		      bulk = null;
		    }
		    if (createNewBulk) {
		      // Prepare a new bulk request
		      bulk = client.prepareBulk();
		      bulkDocs = 0;
		      bulkLength = 0;
		    }
	} catch (Exception e) {
		// TODO: handle exception
		e.printStackTrace();
	}	  
   
  }

 6、保存中文个别乱码问题解决方案:try catch掉

类名:com.suncco.leadsite.utils.NutchJob.java
      @Override
  public boolean waitForCompletion(boolean verbose){
     boolean succeeded = true;
      try {
    	    succeeded = super.waitForCompletion(verbose);
            if (!succeeded) {
    	      // check if we want to fail whenever a job fails. (expert setting)
    	      if (getConfiguration().getBoolean("fail.on.job.failure", true)) {
    	    	  Log.warn("job failed: " + "name=" + getJobName()
    	            + ", jobid=" + getJobID());
    	      }
    	    }
	} catch (Exception e) {
		// TODO: handle exception
		e.printStackTrace();
	}
    return succeeded;
  }

 触发器编写:

 触发器主表不能删除主表本身操作否则会报错如:

触发器代码:BEGIN  
         SET @isLocalCount =(select count(*) from webpage_39  where id = NEW.id);
               IF (@isLocalCount >0 )  
                  THEN   
                      delete from webpage_39  where  id = NEW.id;
               END IF;
        SET @count =(select count(*) from suncco_spider.webpage  where id = NEW.id);   
        IF (@count =0  && NEW.id like '%fj.fj%')   
              THEN
                  insert into suncco_spider.webpage (id, baseUrl, status, prevFetchTime, fetchTime,   fetchInterval, retriesSinceFetch, reprUrl,content,typ,protocolStatus,   modifiedTime,title,text,parseStatus, signature,prevSignature,score,   headers,inlinks,outlinks,metadata,markers,isDelete) values (NEW.id, NEW.baseUrl,  NEW.status,NEW.prevFetchTime,NEW.fetchTime, NEW.fetchInterval, NEW.retriesSinceFetch,  NEW.reprUrl,NEW.content,NEW.typ,NEW.protocolStatus,NEW.modifiedTime,NEW.title,NEW.text,NEW.parseStatus,   NEW.signature,NEW.prevSignature,NEW.score,NEW.headers,NEW.inlinks,NEW.outlinks,NEW.metadata,NEW.markers,NEW.isDelete);  
              END IF;   
        END

java.io.IOException: java.sql.BatchUpdateException: Can't update table 'webpage_39' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.


解决方法是把delete from webpage_39  where  id = NEW.id;删掉

 

posted on 2013-04-22 23:15  阳光总在风雨后001  阅读(1197)  评论(0编辑  收藏  举报

导航