Data Exchange between runtime native components #2048

akchinSTC · 2021-08-16T21:24:05Z

akchinSTC
Aug 16, 2021
Maintainer

Motivation

Currently, Elyra supports a handful of sample components from Apache Airflow and Kubeflow Pipelines. These components demonstrate Elyra’s ability to use native concepts from each orchestrator, however a key portion of their functionality is missing, notably the ability to pass data and/or parameters from one component/operator to another via inputs and outputs

Considerations

We want to limit the scope of the issue to just the exchange of data between runtime native components. That is, for the time being, support data exchanging between Airflow operators -> Airflow operators and KFP components -> KFP components.

We support both Apache Airflow and Kubeflow Pipelines but both runtimes have very different ways of defining inputs and outputs.

Apache Airflow

Apache Airflow uses the concept of Xcoms or Cross Communication. Xcoms are small amounts of data that are shared between tasks (nodes). The data is represented by a key-value pair with the key being a string and a value that is serializable in JSON or pickable(pickle). These Xcoms can be pushed and pulled between tasks and by default are scoped to the DAG run (pipeline run)

Xcoms are built into the Airflow BaseOperator so all operators inherit them and are accessed via the task_instance(ti) object and xcom_push and xcom_pull helper methods.

t1 = BashOperator(
    task_id="t1",
    bash_command='echo "{{ ti.xcom_push(key="k1", value="v1") }}" "{{ti.xcom_push(key="k2", value="v2") }}"',
    dag=dag,
)
t2 = BashOperator(
    task_id="t2",
    bash_command='echo "{{ ti.xcom_pull(key="k1") }}" "{{ ti.xcom_pull(key="k2") }}"',
    dag=dag,
)
t1 >> t2

Limitations:

Note that there are size limitations to the amount of data that can be passed via Xcoms. Best practices seems to suggest that objects around a few MBs are ok to pass via Xcoms but anything larger should be handled via by file path reference (volumes, s3)

Resources:
A good guide : https://marclamberti.com/blog/airflow-xcom/

Kubeflow Pipelines

Elyra uses KFP component definitions when considering how it handles input and outputs and how to share data. Inputs and Outputs are specified in the component definition under each respective name and are then used in the implementation with type hinting(inputPath, inputValue, outputPath) to describe how each argument should be processed, either by reference(*Path) or value(*Value).

name: Truncate file
description: Gets the specified number of lines from the input file.

inputs:
- {name: Input 1, type: String, optional: false, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', optional: true, description: 'Number of lines to keep'}

outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}

implementation:
  container:
    image: quay.io/ptitzler/kfp-ex-truncate-file@sha256:37e20c5f5daae264a05f7bb595aac19ebd7b045667b7056ba3a13fda1b86746e
    # command is a list of strings (command-line arguments). 
    # The YAML language has two syntaxes for lists and you can use either of them. 
    # Here we use the "flow syntax" - comma-separated strings inside square brackets.
    command: [
      python3, 
      # Path of the program inside the container
      /pipelines/component/src/truncate-file.py,
      --input1-path,
      {inputPath: Input 1},
      --param1, 
      {inputValue: Parameter 1},
      --output1-path, 
      {outputPath: Output 1},
    ]

Limitations:
Best practices indicate that users should limit the amount of data passed by value to 200KB per pipeline run.

Envisioned workflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Exchange between runtime native components #2048

{{title}}

Replies: 0 comments

Select a reply

Data Exchange between runtime native components #2048

akchinSTC Aug 16, 2021 Maintainer

Replies: 0 comments

akchinSTC
Aug 16, 2021
Maintainer