Recently when I work on ETL task: Consume data from Kafka to DB. Frequent a row a connect->insert->close to DB is not a good option. 

First setParalism(n) can help to save total operation time. but number of inserting keeps the same. 

at dataSource, every operator, and dataSink, al the spot can setParalism.

DataStream<String> text = env.socketTextStream("127.0.0.1", 9000).setParallelism(1);

 

Secondly, set time Window/count Window, then transfer data to List<data>, afterwards, batch insert. (question is difference of window type, and keys, and I have question on process and apply)

text.timeWindowAll(Time.seconds(5)).process(new ProcessWindow()).addSink(new RichSinkFunction<List<String>>() {
private Connection connection;
private PreparedStatement stmt;
private String SQL = "";
public void open(Configuration parameters) throws Exception {
String url = "jdbc:mysql://localhost:3306/test";
String user = "root";
String password = "root";
String driverClass = "com.mysql.jdbc.Driver";
Connection connection = null;
try {
Class.forName(driverClass);
Properties info = new Properties();
info.setProperty("user",user);
info.setProperty("password",password);
this.connection = DriverManager.getConnection(url,info);
} catch (ClassNotFoundException e) {
log.error("Not found driver class, load driver failed.");
} catch (SQLException e) {
log.error("create db connection error,",e);
}
}

@Override
public void invoke(List<String> value, Context context) throws Exception {
this.connection.setAutoCommit(false);
this.stmt = this.connection.prepareStatement(SQL);
Iterator<String> str_iter = value.iterator();
while(str_iter.hasNext()){
stmt.setString(1, str_iter.next());
stmt.addBatch();
}
int[] result = stmt.executeBatch();
this.connection.commit();
log.info("inserted:"+result.length);
}

public void close() throws Exception {
super.close();
if(this.connection!=null){
this.connection = null;
}
if(this.stmt!=null){
this.connection = null;
}
}
}).setParallelism(1);
// execute program

env.execute("Java WordCount from SocketTextStream Example");
}

private static class ProcessWindow extends ProcessAllWindowFunction<String, List<String>, TimeWindow> {

@Override
public void process(Context context, Iterable<String> iterable, Collector<List<String>> collector) throws Exception {
List<String> aa = new ArrayList<String>();
Iterator<String> iter= iterable.iterator();
while(iter.hasNext()){
aa.add(iter.next());
}
collector.collect(aa);
}
}

Thirdly, Transfer streamData to Table StreamData, them use table sql operation, like JDBCOutputFormat, AppendTableStream<extends addSink>, setInterval to make batch insert.

 

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);

DataStream<String> text = env.socketTextStream("127.0.0.1", 9000).setParallelism(1);

text.flatMap(new FlatMapFunction<String, WordCount>() {
@Override
public void flatMap(String s, Collector<WordCount> collector) throws Exception {
String[] ss= s.split(" ");
for(String item:ss){
collector.collect(new WordCount(item, 1));
}
}
}).setParallelism(1);
// execute program

tableEnv.registerDataStreamInternal("firsttable", text);
Table table = tableEnv.sqlQuery("select word from firsttable");
String [] fields = {"word"};
JDBCAppendTableSink sink = JDBCAppendTableSink.builder()
.setDBUrl("").setDrivername("").setPassword("").setBatchSize(3).build();
tableEnv.registerTableSink("test", fields, new TypeInformation[]{Types.STRING}, sink);
env.execute("ss");

Lastly, FinkKafkaConsumer may support batch consumer, add time interval.  

 

To de researched.

posted on 2020-02-22 17:30  connie313  阅读(461)  评论(0编辑  收藏  举报