Data Exchange between runtime native components #2048
akchinSTC
started this conversation in
Pipeline Editor/ Kubeflow Pipelines + Apache Airflow
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Motivation
Currently, Elyra supports a handful of sample components from Apache Airflow and Kubeflow Pipelines. These components demonstrate Elyra’s ability to use native concepts from each orchestrator, however a key portion of their functionality is missing, notably the ability to pass data and/or parameters from one component/operator to another via inputs and outputs
Considerations
We want to limit the scope of the issue to just the exchange of data between runtime native components. That is, for the time being, support data exchanging between Airflow operators -> Airflow operators and KFP components -> KFP components.
We support both Apache Airflow and Kubeflow Pipelines but both runtimes have very different ways of defining inputs and outputs.
Apache Airflow
Apache Airflow uses the concept of Xcoms or Cross Communication. Xcoms are small amounts of data that are shared between tasks (nodes). The data is represented by a key-value pair with the key being a string and a value that is serializable in JSON or pickable(pickle). These Xcoms can be pushed and pulled between tasks and by default are scoped to the DAG run (pipeline run)
Xcoms are built into the Airflow BaseOperator so all operators inherit them and are accessed via the task_instance(ti) object and xcom_push and xcom_pull helper methods.
Limitations:
Note that there are size limitations to the amount of data that can be passed via Xcoms. Best practices seems to suggest that objects around a few MBs are ok to pass via Xcoms but anything larger should be handled via by file path reference (volumes, s3)
Resources:
A good guide : https://marclamberti.com/blog/airflow-xcom/
Kubeflow Pipelines
Elyra uses KFP component definitions when considering how it handles input and outputs and how to share data. Inputs and Outputs are specified in the component definition under each respective name and are then used in the implementation with type hinting(inputPath, inputValue, outputPath) to describe how each argument should be processed, either by reference(*Path) or value(*Value).
Limitations:
Best practices indicate that users should limit the amount of data passed by value to 200KB per pipeline run.
Envisioned workflow
Beta Was this translation helpful? Give feedback.
All reactions