Skip to content

Support excluding packages from the build #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
neilmehta24 opened this issue Dec 2, 2024 · 4 comments
Open

Support excluding packages from the build #95

neilmehta24 opened this issue Dec 2, 2024 · 4 comments
Labels
Category: Enhancement New feature or request

Comments

@neilmehta24
Copy link
Member

Python packages can transitively pull in many packages that are not strictly required to run an application. Including these packages in the published layer means that our users have to download large packages they don't necessarily need. Layers should be able to specify in their venvstacks.toml configuration what packages should be not installed as part of the build or publish. A package that is marked as excluded would ideally not pull in packages that are only required by the excluded layer.

A package that is marked as excluded should not affect the environment solving, but simply affect the output directory. This is so that an application layer doesn't install a package that was excluded by the framework layer; the package belongs to the framework layer, but is not emitted into the build directory anywhere.

The overall goal of this is to reduce the size of layers deployed to our users' machines.

@neilmehta24 neilmehta24 added the Category: Enhancement New feature or request label Dec 2, 2024
@ncoghlan
Copy link
Collaborator

ncoghlan commented Dec 3, 2024

That makes sense. Longer term, we could even do something similar to what treeshaker does: https://pypi.org/project/treeshaker/ (treeshaker itself wouldn't be the right solution, but something along those lines should be feasible, with application layer shaking based on the launch module contents, and framework and runtime layer shaking based on the full set of defined application layers that depend on them).

For manual exclusions, pack_venv already filters out some files during the export process with shutil.ignore_patterns.

Due to the import-package-vs-dist-package ambiguity in Python, as well as the import-module-vs-import-package situation, I'm leaning towards making this two separate settings on the layer:

  • exclude_import_name: rather than using the result of ignore_patterns directly, env exports will have a dedicated copytree filtering function that excludes directories and files (excluding their extensions) matching the given name in the site-packages folder. Distributions will still claim to be installed in the deployed environment, but some of their files will be missing.
  • exclude_dist_package: this would run importlib.metadata.files in the build environment for each of the given distribution names, and use that to get a full list of files to be excluded from the export process. This would also exclude the installation metadata, so the distribution won't even claim to be installed in the deployed environment.

@neilmehta24
Copy link
Member Author

exclude_dist_package seems like the one most relevant to us right now.

@neilmehta24
Copy link
Member Author

exclude_dist_package gets us 90% of what we need. But, how difficult is it to determine if other dependencies can be transitively excluded? For example, if we say exclude_dist_package=["X"] and package "Y" was only installed due to the requirement from package "X", can we exclude package "Y" too?

@ncoghlan ncoghlan changed the title The ability to exclude packages from the build Support excluding packages from the build Dec 6, 2024
@ncoghlan
Copy link
Collaborator

Excluding individual packages is potentially feasible with the current locking design, since that doesn't affect the lock resolution process, it's just intentionally making the shipped archives not quite match the contents claimed in their lock files. There's potential fragility in actually doing that, but I can see the value since some projects are really liberal with what they classify as a "requirement" (if someone is pulling in the full Jupyter notebook runtime as dependency because some of their examples are Jupyter notebooks, that doesn't make sense when people are just trying to use their Python library in an application).

Excluding "X and its dependencies, unless something else depends on them", on the other hand, would need to be a resolver level operation, since you need a resolver to answer both "What does X depend on?" and "Does anything other than X depend on those packages?"

Even with a resolver, it's still not an easy problem, since the presence or absence of X may make more changes to the dependency graph than just adding or removing packages outright - it may change the exact version pinned, it may make the dependency platform or Python version dependent, and more.

There's an open proposal (https://peps.python.org/pep-0771/) that will allow projects to specify default extras, which may help clean up some of these messy dependency trees at their source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants