nifi的去重方案设计(一)-单队列内去重.md
nifi的去重方案设计(一)-单队列内去重.md
在官方组件里没有找到去重的组件,这个场景还是比较常见的
会分两篇来讲nifi 队列内flowflie去重的实现,都不完美,但满足日常使用
假设flowfile代表任务,以一个技术人员都比较容易理解的,爬虫任务场景而言
flowfile 分为两级属性,attr和文件体,类似本地文件的文件属性(文件名,权限,大小,更新日期等)和文件内容(文本内容,或二进制内容)
在爬虫的任务场影,flowfile为一条需要下载的url信息,url地址保存在attr内,flowfile并不存在文件体
processor的功能为下载url,上游的队列内的flowfile,只是url的信息
通常processor的处理是从队列里读一条flowfile来处理,完全1:1的,这样对队列里相同的url会同时处理多次
自已实现一个轻量的processor
通过flowfile的attr 构造唯一key,以此key去重,只保留唯一(第一条,或最后一条)的数据再输出到下级队列即可
使用场景有限,只对同一队列内,小范围时间窗口的flowfile生效,该去重方案只是辅助,无法彻底解决去重问题,彻底解决需要外部存储的支持,该方法做去重主要为减少外部存储的io压力
主要代码见 git,结构很简单,可以当作熟悉nifi processor的定制开发规范的练习项目
https://github.com/cclient/nifi-unique-processor
Nifi Unique Processor
<custom_id:1,custom_value:123> <custom_id:1,custom_value:123>
<custom_id:1,custom_value:456> -> unique by ${custom_id}->
<custom_id:2,custom_value:789> <custom_id:2,custom_value:789>
nifi queued distinct/unique by 'custom key'
deploy
1 compile
mvn package
2 upload to one of
nifi.nar.library.directory=./lib
nifi.nar.library.directory.custom=./lib_custom
nifi.nar.library.autoload.directory=./extensions
nifi.nar.working.directory=./work/nar/
cp nifi-unique-nar/target/nifi-unique-nar-0.1.nar nifi/lib_custom/
3 restart nifi if need
nifi/bin/nifi.sh restart
@Override
public void onTrigger(ProcessContext processContext, ProcessSession session) throws ProcessException {
int bulkSize = processContext.getProperty(BULK_SIZE).asInteger();
if (bulkSize == 0) {
bulkSize = Integer.MAX_VALUE;
}
List<FlowFile> orginalList = session.get(bulkSize);
if (orginalList == null || orginalList.size() == 0) {
return;
}
boolean retainFirst = processContext.getProperty(RETAIN_FIRST).asBoolean();
Map<String, FlowFile> map = new HashMap(orginalList.size());
List<FlowFile> needRemoveFlowFiles = new ArrayList<>(orginalList.size());
List<FlowFile> errorFlowFiles = new ArrayList<>(orginalList.size());
List<FlowFile> needNextFlowFiles = new ArrayList<>(orginalList.size());
orginalList.forEach(flowFile -> {
String key = processContext.getProperty(UNIQUE_KEY).evaluateAttributeExpressions(flowFile).getValue();
if (key == null || key.isEmpty()) {
errorFlowFiles.add(flowFile);
return;
}
if (map.containsKey(key)) {
if (retainFirst) {
needRemoveFlowFiles.add(flowFile);
} else {
FlowFile oldSame = map.get(key);
needRemoveFlowFiles.add(oldSame);
needNextFlowFiles.remove(oldSame);
needNextFlowFiles.add(flowFile);
}
} else {
needNextFlowFiles.add(flowFile);
map.put(key, flowFile);
}
});
logger.info("distinct orginal size: {},retain size: {},remove size: {},error size: {}", Arrays.asList(orginalList.size(), needNextFlowFiles.size(), needRemoveFlowFiles.size(), errorFlowFiles.size()).toArray());
session.transfer(needNextFlowFiles, REL_SUCCESS);
session.transfer(errorFlowFiles, REL_FAILURE);
session.remove(needRemoveFlowFiles);
}