Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kedro-airflow: revise Airflow deployment manual #605

Closed
DimedS opened this issue Mar 12, 2024 · 3 comments · Fixed by kedro-org/kedro#3792
Closed

kedro-airflow: revise Airflow deployment manual #605

DimedS opened this issue Mar 12, 2024 · 3 comments · Fixed by kedro-org/kedro#3792
Assignees

Comments

@DimedS
Copy link
Member

DimedS commented Mar 12, 2024

Description

The Airflow Deployment Manual appears to require updates:

  1. The manual should be fully operational, with all outdated information removed to prevent confusion.
  2. Future of the kedro-airflow-k8s plugin: There needs to be a decision on whether to continue recommending the kedro-airflow-k8s plugin, especially since it's noted to be compatible with kedro versions less than 0.18.
  3. My suggestion is to move away from recommending the astro-airflow-iris starter, considering it's outdated and not specifically required for running Kedro projects on Airflow with Astronomer. It may be clearer for users to start with the standard spaceflights-pandas starter available through the kedro new command with options [1-5], ensuring a more streamlined and up-to-date starting point.
  4. Current Strategy:

The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an Airflow task while the whole pipeline is converted into a DAG for orchestration purpose. This approach mirrors the principles of running Kedro in a distributed environment.

should be discussed and confirmed, as it has several drawbacks regarding the approach of running a new Kedro session for each node:

  • it can be time consuming
  • it does not accommodate Memory datasets, necessitating their specification and storage within the dataCatalog. This limitation should be explicitly addressed in the docs to ensure clarity for users.

If strategy will be confirmed it might be advantageous to enhance the kedro-airflow plugin with to not only generate the DAG but also create an Airflow configuration folder, including a tailored version of config.yml that incorporates all MemoryDatasets.

  1. Given the known issues with default logging via the Rich library, the manual should include a section advising on switching from Rich to Console logging, complete with detailed instructions to facilitate this change and ensure DAG operability.
  2. Guidance on automatically transferring files from the Airflow container back to the user's local folder maybe benefit users.
  3. Incorporating a section that outlines steps for deploying kedro Airflow project to cloud services such as AWS, Azure, and Google Cloud, AstroCloud.
@noklam
Copy link
Contributor

noklam commented Mar 25, 2024

The outcome of this ticket is to fix 1,2,3,5 and 6. The idea is to improve the existing documentation and make it run properly. There other points mentioned here will be tackled in separate tickets.

@DimedS
Copy link
Member Author

DimedS commented Apr 15, 2024

Comments 1, 3, 5, and 6 have been addressed in kedro-org/kedro#3792. Follow-up tickets have been created for comments 2 and 7 at #652 and #651 respectively. Regarding the changes in deployment strategy proposed by comment 4, I currently do not have any ideas for modifications. However, we should discuss the deployment strategy during a Technical Design session to share knowledge, explore possibilities, and confirm the current strategy.

@astrojuanlu
Copy link
Member

astrojuanlu commented Apr 15, 2024

Ticket for 4 probably kedro-org/kedro#2058?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants