1. 如何开启Checkpoint?
// get the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(300 * 1000);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(300 * 1000);
// allow only one checkpoint to be in progress at the same time
// enable externalized checkpoints which are retained after job cancellation
// allow job recovery fallback to checkpoint when there is a more recent savepoint
2. 如何从Checkpoint恢复?
Difference to Savepoints
Checkpoints have a few differences from savepoints. They
use a state backend specific (low-level) data format, may be incremental.
do not support Flink specific features like rescaling.
Resuming from a retained checkpoint
A job may be resumed from a checkpoint just as from a savepoint by using the checkpoint’s meta data file instead (see the savepoint restore guide). Note that if the meta data file is not self-contained, the jobmanager needs to have access to the data files it refers to (see Directory Structure above).
$ bin/flink run -s :checkpointMetaDataPath [:runArgs]
Restore a savepoint
./bin/flink run -s
The run command has a savepoint flag to submit a job, which restores its state from a savepoint. The savepoint path is returned by the savepoint trigger command.
By default, we try to match all savepoint state to the job being submitted. If you want to allow to skip savepoint state that cannot be restored with the new job you can set the allowNonRestoredState flag. You need to allow this if you removed an operator from your program that was part of the program when the savepoint was triggered and you still want to use the savepoint.
./bin/flink run -s
This is useful if your program dropped an operator that was part of the savepoint.
-n,--allowNonRestoredState Allow to skip savepoint state that
cannot be restored. You need to allow
this if you removed an operator from
your program that was part of the
program when the savepoint was
from (for example
bin/flink -s hdfs://your-node/application/flink/slankka/checkpoint/37736d4edffd6150c97ff24d6a48bbf4/chk-225 -n ...其他参数
3. 如何收集Flink Checkpoint?
除了从Flink的UI中可以看到,还可以通过YARN等,FLink的REST API 访问获取
// 例如访问YARN的 http://yarn-node.slankka.com:8088/proxy/application_1595593091318_0082/jobs/37736d4edffd6150c97ff24d6a48bbf4/metrics?get=lastCheckpointExternalPath
// 得到
"id": "lastCheckpointExternalPath",
"value": "hdfs://your-node/application/flink/slankka/checkpoint/37736d4edffd6150c97ff24d6a48bbf4/chk-248"
收集Flink Metrics(尤其是lastCheckpointExternalPath这种非Number类型指标)
Flink Metrics