The FlinkProcessor does feature ETL using Flink as the compute engine. In the following sections we describe the deployment modes supported by FlinkProcessor and the configuration keys accepted by each mode.
- Flink 1.16
The Flink processor runs the Flink job in one of the following deployment modes:
- Command-Line mode. This mode should be used if you want to run the FeatHub program using Flink's command-line interface.
- Session mode. This mode should be used if you want to run the FeatHub program in the Flink session mode. This mode runs Flink jobs in an already running Flink cluster.
- Kubernetes Application mode. This mode should be used if you want to run the FeatHub program in the Flink application mode on a Kubernetes cluster. This mode creates a dedicated Flink cluster and runs Flink jobs in this cluster.
Session mode is the default mode. User can specify the deployment mode with
configuration deployment_mode
.
When running in CLI mode, FeatHub itself will not submit the Flink job to a Flink cluster. Users need to manually submit the FeatHub job as a Flink job using the Flink CLI tool.
A quickstart of how to submit a simple FeatHub job with CLI mode to a standalone Flink cluster can be found in this document.
Session mode assumes that there is a running cluster and the Flink job is
submitted to the Flink cluster. User needs to specify the master
configuration
with the address and the port where the JobManager runs.
Specifically, if master
is configured to "local"
, FlinkProcessor would set
up a Flink MiniCluster to execute the submitted job. This case can be used for
development and proof of concept.
The advantage of session mode is that you do not pay the overhead of spinning up a Flink cluster for every submitted job. And users can convert the features to Pandas dataframe and use libraries in pandas ecosystem. The session mode is convenient for local testing and prototyping. The disadvantage is that it doesn't isolate resource between jobs, which means one misbehaving job can affect other jobs. You can refer to the Flink Docs for explanation of session mode.
A quickstart of how to submit a simple FeatHub job with session mode to a standalone Flink cluster can be found in this document.
When running in Kubernetes Application mode, a Flink cluster is created in a Kubernetes cluster per Flink job. This comes with better resource isolation than session mode. You can refer to the Flink Docs for explanation of application mode.
Note: Table#to_pandas
is not supported in Kubernetes Application mode.
You can refer to the Flink Docs for more explanation of Kubernetes Application mode.
To run the Flink job in Kubernetes Application mode, a docker image is required.
FeatHub provides a base Docker image to run the Flink job that compute the FeatHub features. User can modify the Dockerfile to further customize the image. You can refer to here to learn how to customize the image. Then you can use the following command to build the image:
$ bash tools/build-feathub-flink-image.sh
The script builds the image with tag "feathub:latest". You can rename the tag of the
image. After that, you need to push it to a Docker image registry where your kubernetes
cluster can pull the image from. And you can specify the image with the
configuration kubernetes.image
.
In the following we describe the configuration keys accepted by the configuration dict passed to the FlinkProcessor. Note that the accepted configuration keys depend on the deployment_mode of the FlinkProcessor.
These are the configuration keys accepted by all deployment modes.
key | Required | default | type | Description |
---|---|---|---|---|
deployment_mode | optional | session | String | The flink job deployment mode, it could be "cli" or "session". |
native.* | optional | (none) | String | Any key with the "native" prefix will be forwarded to the Flink job config after the "native" prefix is removed. For example, if the processor config has an entry "native.parallelism.default: 2", then the Flink job config will have an entry "parallelism.default: 2". |
These are the extra configuration keys accepted when deployment_mode = "session":
key | Required | default | type | Description |
---|---|---|---|---|
master | required | (none) | String | The Flink JobManager URL to connect to. |