FLIP-85 Support Cluster Deploy Mode

Authors: Peter Huang, Yang Wang, Rong Rong, Zili Chen, Shuyi Chen
Last updated: 2019-11-02

Motivation

Apache Flink supports job deployment to Yarn, Mesos, and Kubernetes after the big effort of FLIP-6. For a single application user who wants to deploy Flink job to Yarn, they can use either per job cluster mode or session cluster mode. Session cluster mode that has preallocated resources is good for short interactive/ batch jobs. For platform users who manage tens of hundreds of streaming pipelines for the whole org or company, it is more convenient to use per job cluster for long-running jobs for resource and failure isolation.

In the per job cluster mode, the job graph is generated in the client-side and shipped as a local resource of yarn application. When the application master starts, the cluster entry point will load the job graph file from the local file system of Node Manager. If we generate the job graph in the client-side, which means that job graph generation happens in the deployer service in our case, we need to download the job jar to the deployer service in order to generate the job graph and launch Flink CLI in a different process to prevent dependency conflicts. With these deployment scenarios, the centralized deployment service will be overwhelmed with deployment requests for, in the worst case, all production jobs in a short period of time. The issue of the centralized service with existing solution includes

The network is a bottleneck to download thousands of job jars (100MB per job) into the service.
To prevent dependency conflicts, a separate process is needed for job generation of each job. It brings a large memory footprint and CPU usage in a short period. Thus reduce the throughput of the single service instance.
If overprovision the service for disaster scenarios, it will be a big resource waste as the QPS of the service probably < 1 in most of the time.

Thus, it is a challenge for a streaming platform that needs to help all customers handling Flink version upgrade, transient failure, cluster maintenance and disaster situation with less minute downtime.

Review of Deployment Model

In FLIP 6 [1], the unified deployment process is introduced for Yarn, Mesos and Kubernates. A job can be deployed into a “per job cluster” or a “session cluster”.

Session Cluster(Standalone/Yarn/Kubernetes/Mesos)

They share the same submission process. Generate the jobgraph on client side and then

upload to jobmanager via rest client.Once the job has been submitted successfully,

JobSubmitHandler receives the request and submit job to Dispatcher. Then job master

is spawned.

Per-job cluster

Standalone per-job cluster

The user jars have already existed on jobmanager side(manually distributed or via docker

image). So when the jobmanager launched, ClasspathJobGrapRetrieve could be used to

get the job graph and recover the Flink job on started.

Yarn per-job cluster

When we deploy a per-job cluster on yarn, a local jar is needed and will be used to generate

the job graph. Then the user jars and job graph will be shipped via yarn local resource. When

the job manager container is launched, all the local resources(user jars, job graph) is ready for

use. So YarnJobClusterEntrypoint uses FileJobGraphRetriever to get the job graph and recover

(not submit a job) the Flink job on started.

Kubernetes per-job cluster

The Kubernetes do not have a default distributed storage and provide public api to ship files like Yarn local resource. So we could not ship the user jars and files on the client side to jobmanager and taskmanager. Also it is not a common way on Kubernetes. Instead, users usually build their jars and files into the docker image. So when the jobmanager and taskmanager are launched, the users jars already existed.

Even if some users do not want to build the jars into the image, they could use the initContainer to download the jars from the storage(http/s3/etc.).

All in all, the Kubernetes per-job cluster will only support cluster deploy-mode.

Proposed Change

In the proposal, we want to discuss a solution that can improve the deployment scalability for platform users by providing a deployment model for per-job. We call it cluster model. The biggest change is job graph will not be generated on the client side. It will be generated on the job manager side after it is launched. It is also useful for per job cluster support of kubernetes and mesos in which customers already put user code in the image.

In this solution, the job graph is generated in the Job Entrypoint, the load will be distributed to each job manager which naturally will be spread evenly across all nodes on the cluster (i.e., a few hundred worker nodes on tens of racks), and this is more scalable & resource efficient without causing any extra latency in the deployment process. To achieve the goal, there are four pieces of changes needed in the Yarn/Kubernetes Per Job Cluster deployment process.

Add CLI Option for deploy-mode, client or cluster
Program Metadata Based Job Submission
JobGraph Generation in Cluster EntryPoint
JobGraph Recovery in Failure

CLI Option for deploy mode

The separation of deployment and submission is exactly what we need to do. After separation, we could choose where to run the user program. For example, we add a new config option "execution.deploy-mode". It could be "client" or "cluster". For "client" mode, it will have the same behavior as now.

public static final Option DEPLOY_MODE = new Option("dm", "deployMode", false, "Distinguishes where the user program runs. In \"cluster\" mode, user program is executed inside of the cluster. In \"client\" mode, user program is executed on client side.")

User Jar Schema

Flink Clients do not always need a local jar to start a Flink per-job cluster. We could support multiple schemas. We can support two types of schema:

Local Schema

file:///path/of/my.jar

When the deploy-mode is client, it means a jar located at client side.

If the deploy-mode is a cluster, it means a jar located at jobmanager side. For example, users use images to deploy onto a K8s or Mesos cluster. The user jar is already in the image.

Remote Schema

hdfs://myhdfs/user/myname/flink/my.jar means a jar located at remote hdfs

http://myhdfs/user/myname/flink/my.jar means a jar serving by a http server

s3://myhdfs/user/myname/flink/my.jar means a jar located at remote S3 storage

Per Job Cluster <Cluster Deployment> Mode

Program Metadata Based Job Submission

In the latest implementation of Flink, the job graph is generated from PackagedProgram in CliFrontend, then deployed to cluster through the ClusterDescriptor API. Compared to job graph generation in client-side, delayed job graph generation solution passes the program metadata to application master through Configuration.

public class ProgramMetadata {

private final String[] args;

private final String mainClassName;

private SavepointRestoreSettings savepointRestoreSettings;

private int parallelism;

private JobID jobID;

...

}

To achieve the goal, we can use ExecutionConfigAccessor to get these metadata from ProgramOptions and put them into flink configuration (The job Id is needed from client side, so that it can be referred later to access the job status).

The YarnClusterDescriptor will write the conf with ProgramMetadata into local disk and ship to yarn as a local resource.

For kubernetes, the KubernetesClusterDescriptor will construct the ProgramMetadata and store to ConfigMap. It will be mounted into JobManager pod and used to generate the job graph.

JobGraph Generation in Cluster EntryPoint

Run user program in cluster mode

After considering the implementation complexity, we want to use the class JobGraphRetrieveDelegator to generate job graph and recover job when the dispatcher started per job cluster.

Generate job graph

With ProgramMetadata passed through flink conf, we can easily construct the PackagedProgram in the (Kubernate/Yarn)JobClusterEntrypoint. But the problem is that Job graph generation depends on several important modules, such as flink-optimizer and flink-table (if the job uses table api / sql), which are not needed in runtime. As flink-clients and flink-optimizer module already depends on flink-runtime for operator chaining, we can’t make flink-runtime circular dependency on them. Thus the implementation can rely on reflection. Basically, we can define a JobGraphRetrieveDelegator interface in flink-runtime, and create the concrete implementation ProgramJobGraphRetrieveDelegator within flink-clients.

public interface JobGraphRetrieveDelegator {

JobGraph retrieveJobGraph(Configuration configuration, ProgramMetadata metadata) throws FlinkException;
}

In the implementation, flink-clients, flink-optimizer, flink-table libs need to be in the classpath of flink application master, so that job graph generation needed classes can be loaded. These jars can be either shipped as Yarn local resource by CLI or preinstall into accessible storage ( public cloud providers charge a lot more for "external" bandwidth than internal, e.g. within an EC2 region) and use them as remote resource. For containerized deployment (kubernetes), they have to be built in the image for kubernetes users . Otherwise, the job graph generation within JobClusterEntryPoint will fail.

public class ProgramJobGraphRetrieveDelegator implements JobGraphRetrieveDelegator {

@Override

public JobGraph retrieveJobGraph(

Configuration configuration,

ProgramMetadata metadata) throws FlinkException {

final PackagedProgram packagedProgram = createPackagedProgram(metadata);

final int defaultParallelism = metadata.getParallelism();

try {

final JobGraph jobGraph = PackagedProgramUtils.createJobGraph(

packagedProgram,

configuration,

defaultParallelism,

metadata.getJobID());

jobGraph.setAllowQueuedScheduling(true);

jobGraph.setSavepointRestoreSettings(metadata.getSavepointRestoreSettings());

return jobGraph;

} catch (Exception e) {

throw new FlinkException("Could not create the JobGraph from the provided user code jar.", e);

}

private PackagedProgram createPackagedProgram(ProgramMetadata metadata) throws FlinkException {

final String entryClass = metadata.getMainClassName();

try {

final Class<?> mainClass = getClass().getClassLoader().loadClass(entryClass);

return new PackagedProgram(mainClass, metadata.getArgs());

} catch (ClassNotFoundException | ProgramInvocationException e) {

throw new FlinkException("Could not load the provided entrypoint class.", e);

}

JobGraph Failure Recovery

In per job client mode, currently JobGraph.bin is shipped together with log configs, job jar and flink.jar as local resources. Once application master failure, yarn rm can still reschedule another application master with the same local resources. There is no state lost in failure scenarios. In the cluster deploy mode, the program metadata (including job id) is stored in flink-conf.yaml and shipped as local resource. Thus, JobClusterEntryPoint can also regenerate job graph again with these job metadata.

In session mode, multiple jobs can be submitted to job graph store within job master through RestClusterClient. To make the metadata fault tolerant, job graphs are stored to zookeeper in HA mode. We can leverage the same mechanism for delayed job graph generation. There are benefits

Make the job graph static even after failure
As no job graph regeneration is needed, thus less failure recovery time in HA mode.

Session Cluster <Cluster Deploy> Mode

For existing session cluster deployment, the user jar has to be in local. The scenario of deployment is still to use local jar to generate job graph, upload user jars and distributed cache files to blobservice in application manager, then start the job with jobgragh.

Similar to per job cluster <cluster deploy> mode, the user jar can be provided either local or remote in the session cluster <cluster deploy> mode.

Jar File Upload

In the session cluster <cluster deploy> mode, the job graph is not generated on the client side. it is impossible to upload the distributed cache files used in the user code to application manager side from the client side.

But the user jars will still be uploaded to the application manager through RestClusterClient in AbstractSessionClusterExecutor.

Start job with configuration

Compared to submitting a job with a job graph in per job cluster, session jobs will be submitted through the RestClusterClient to JarRunHandler. The handler accepts the jar path and program metadata in request parameters. The job graph is built within the handler, uploaded to blobstore, and submit to the dispatch gateway.

New or Changed Public Interfaces

To align with FLIP-73, we will still use the executor model to enable the cluster deploy mode. In this mode, user jar doesn’t need to be used in client deployment scenarios. In this case, CliFrontend just needs to skip the PackageProgram build step, and use the right executor from service provider to start the submission process. After discussion with the community, we agreed on using a different interface for cluster mode job submission. Both per job cluster and session cluster will be different implementation of the interface. Similar with executor, we will use SPI to discover concrete deployer implementation according to the configuration. But compared to executor runs in the execution environment, the new deploy function will be used directly in CliFrontend for submitting cluster deploy mode jobs. Notes (users jars are passed through PipelineOptions.JARS)

@Internal

public interface ClusterModeDeployer{

CompletableFuture<JobClient> deploy(final Configuration configuration) throws Exception;

}

For the session cluster, we need to upload user jars first then call JarRunHandler to start a job from a particular jar path. Thus，we need to add a new method in ClusterClient as below.

public interface ClusterClient<T> extends AutoCloseable {

/**

* Submit job to the cluster via configuration (program metadata) and user JARs. .

* @param configuration configuration with program metadata

* @return {@link JobID} of the submitted job

CompletableFuture<JobID> submitJob(final Configuration configuration);

}

Migration Plan and Compatibility

This Flip proposal is a deployment option for Flink users, there is no impact for existing Flink users.
And there is no breaking change to existing functionality. Thus, no migration is needed.

User Experience Changes

After we finish this feature, the job graph will be compiled in the application master, which means that users cannot easily get the exception message synchronously in the job client if there are problems during the job graph compiling (especially for platform users), such as the resource path is incorrect, the user program itself has some problems, etc. This will be very different with client mode.

How to improve the user experience?

Indeed, if the deploy-mode is cluster, it may be not convenient for user debugging. For different clusters, there are different ways to debug. For example, using 'yarn logs' and 'kubectl log' to get the jobmanager logs. Also we could consider throwing the exception to the client by rest. I'm not sure whether we could achieve this purpose. Compared to the client deploy-model, it is really a fallback in user experience. We will try to add more description in the document about the user experience.

Implementation Plan

Phase I (Cluster model for per job cluster)

The implementation steps of this proposal include

Change the Executor interface by adding a new interface for cluster mode
Change cluster mode handling in CliFrontend
Change Yarn Cluster Descriptor to handle the cluster model cleanly
Add Kubernetes per job deployment solution on top of new executor api.

Phase II (Cluster model for session cluster)

Implement the new execute function in AbstractSessionClusterExecutor
Enable the cluster deploy mode for session cluster in CliFrontEnd

Prioritized Alternatives

During the solution discussed in maillist and community, Tison proposed to use a client to submit the job to "local" cluster, as the client and dispatcher are in the same process. Below is our consideration.

As the MiniDispatcher is required for per job cluster by the design of FLIP 6. It needs a job graph in constructor. I think using ClasspathJobGraphRetriever is more straightforward. Compared to option 2, the local client submission solution needs to start a dispatcher, then start to run a user program to use jobClient in the execution environment to submit a job graph. It will need more engineering effort and code structure change.

To reduce the implementation overhead and narrow down the scope of this FLIP and start with class first. It is easier for the community to accept. Also we could start the discussion in a separate thread for the option2.

References

[1] FLIP 6 Flink Deployment and Process Model - Standalone, Yarn, Mesos, Kubernetes, etc.

[2] FLIP-73: Introducing Executors for job submission

posted @ 2020-04-03 18:29 MrPei 阅读(46) 评论(0) 编辑收藏举报

MrPei's Blog

代码改变世界