Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Required packages should be available with amphi directly #227

Open
simonaubertbd opened this issue Jan 20, 2025 · 1 comment
Open

Required packages should be available with amphi directly #227

simonaubertbd opened this issue Jan 20, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@simonaubertbd
Copy link
Contributor

Hello

This follows a conversation with @tgourdel . As of today, when you run a pipeline that needs some libraries not packaged with python, like duckdb, Amphi will download it automatically.

From my perspective, this leads to serious issyes

1/requires an internet connection during runtime, something that's always possible (imagine a sacralized server)
2/the speed of execution is affected, we lost some seconds each time
3/ another point is the version control, we don't have the hand of the library version and can break the workflow.

(however we can still keep the ability to do so, for custom components, tests, etc).

Best regards,

Simon

@tgourdel
Copy link
Contributor

Hi @simonaubertbd ! Thanks for bringing this here.

Being able to run Amphi in environment without internet access is something that should be possible and relatively easy.

A few notes to continue the conversation:

  • It is possible to have a setup where Amphi can run without internet. This is today a manual process but not extremely complex. There are multiple ways to achieve this but basically, create a virtual python environment with amphi-etl package and the packages needed to run all the components such as Duckdb and a few others like sqlalchemy. Creating a Docker image is also a solution.
  • What's missing today is maybe an automated way to get the list of packages that are downloaded on the fly. It needs to be documented
  • On the speed of execution, the download happens once, the first time you execute the component if the package is not installed. Afterwards the download is not triggered of course.
  • We have the option to add "mandatory" packages when installing amphi-etl. This of course increase the time to install amphi and the footprint of the install on the machine. We should probably find a balance with the core libraries needed and the ones that are rarely needed and not essential. Duckdb could be added to the list. Today I decided to only have pandas only (essential to have the core components working) as you configured here:
    install_requires=[

@tgourdel tgourdel added the enhancement New feature or request label Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants