nutch2.1源码分析
nutch2.1分表采集开发
1.nutch源码获取配置文件nutch-default.xml的storage.schema.webpage属性值:webpage:
类:org.apache.nutch.storage.StorageUtils.java
@SuppressWarnings("unchecked") public static <K, V extends Persistent> DataStore<K, V> createWebStore(Configuration conf, Class<K> keyClass, Class<V> persistentClass) throws ClassNotFoundException, GoraException { String schema = null; if (WebPage.class.equals(persistentClass)) { schema = conf.get("storage.schema.webpage", "webpage"); } else if (Host.class.equals(persistentClass)) { schema = conf.get("storage.schema.host", "host"); } else { throw new UnsupportedOperationException("Unable to create store for class " + persistentClass); } String crawlId = conf.get(Nutch.CRAWL_ID_KEY, ""); if (!crawlId.isEmpty()) { conf.set("schema.prefix", crawlId + "_"); } else { conf.set("schema.prefix", ""); } Class<? extends DataStore<K, V>> dataStoreClass = (Class<? extends DataStore<K, V>>) getDataStoreClass(conf); return DataStoreFactory.createDataStore(dataStoreClass, keyClass, persistentClass, conf, schema); }
2、过滤只采集的地址:
regex-urlfilter.txt末尾增加: +^http://([a-z0-9]*\.)zhangzhou.gov.cn/ automaton-urlfilter.txt末尾增加:+^http://([a-z0-9]*\.)zhangzhou.gov.cn/
3、nutch分表需要修改的配置文件:
gora-sql-mapping.xml nutch-default.xml
4、nutch分表规则
采集存储的数据表名+"_"采集地址id
5、采集运行可能出现挂掉,异常信息是:Output path is null in cleanup
解决方法:把类org.apache.nutch.indexer.elastic.ElasticWriter.java的异常位置try掉,如下: private void processExecute(boolean createNewBulk) { try { if (execute != null) { // wait for previous to finish long beforeWait = System.currentTimeMillis(); BulkResponse actionGet = execute.actionGet(); if (actionGet.hasFailures()) { for (BulkItemResponse item : actionGet) { if (item.failed()) { throw new RuntimeException("First failure in bulk: " + item.getFailureMessage()); } } } long msWaited = System.currentTimeMillis() - beforeWait; LOG.info("Previous took in ms " + actionGet.getTookInMillis() + ", including wait " + msWaited); execute = null; } if (bulk != null) { if (bulkDocs > 0) { // start a flush, note that this is an asynchronous call execute = bulk.execute(); } bulk = null; } if (createNewBulk) { // Prepare a new bulk request bulk = client.prepareBulk(); bulkDocs = 0; bulkLength = 0; } } catch (Exception e) { // TODO: handle exception e.printStackTrace(); } }
6、保存中文个别乱码问题解决方案:try catch掉
类名:com.suncco.leadsite.utils.NutchJob.java @Override public boolean waitForCompletion(boolean verbose){ boolean succeeded = true; try { succeeded = super.waitForCompletion(verbose); if (!succeeded) { // check if we want to fail whenever a job fails. (expert setting) if (getConfiguration().getBoolean("fail.on.job.failure", true)) { Log.warn("job failed: " + "name=" + getJobName() + ", jobid=" + getJobID()); } } } catch (Exception e) { // TODO: handle exception e.printStackTrace(); } return succeeded; }
触发器编写:
触发器主表不能删除主表本身操作否则会报错如:
触发器代码:BEGIN
SET @isLocalCount =(select count(*) from webpage_39 where id = NEW.id);
IF (@isLocalCount >0 )
THEN
delete from webpage_39 where id = NEW.id;
END IF;
SET @count =(select count(*) from suncco_spider.webpage where id = NEW.id);
IF (@count =0 && NEW.id like '%fj.fj%')
THEN
insert into suncco_spider.webpage (id, baseUrl, status, prevFetchTime, fetchTime, fetchInterval, retriesSinceFetch, reprUrl,content,typ,protocolStatus, modifiedTime,title,text,parseStatus, signature,prevSignature,score, headers,inlinks,outlinks,metadata,markers,isDelete) values (NEW.id, NEW.baseUrl, NEW.status,NEW.prevFetchTime,NEW.fetchTime, NEW.fetchInterval, NEW.retriesSinceFetch, NEW.reprUrl,NEW.content,NEW.typ,NEW.protocolStatus,NEW.modifiedTime,NEW.title,NEW.text,NEW.parseStatus, NEW.signature,NEW.prevSignature,NEW.score,NEW.headers,NEW.inlinks,NEW.outlinks,NEW.metadata,NEW.markers,NEW.isDelete);
END IF;
END
java.io.IOException: java.sql.BatchUpdateException: Can't update table 'webpage_39' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.
解决方法是把delete from webpage_39 where id = NEW.id;删掉
posted on 2013-04-22 23:15 阳光总在风雨后001 阅读(1197) 评论(0) 编辑 收藏 举报