Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] Homebrew packaging for python resources #157500

Closed
chenrui333 opened this issue Dec 15, 2023 · 37 comments
Closed

[rfc] Homebrew packaging for python resources #157500

chenrui333 opened this issue Dec 15, 2023 · 37 comments
Labels
maintainer feedback Additional maintainers' opinions may be needed python Python use is a significant feature of the PR or issue python-vendoring Part of the Python resource vendoring project

Comments

@chenrui333
Copy link
Member

chenrui333 commented Dec 15, 2023

EDIT: See #157500 (comment) for final decision.


Homebrew packaging for python resources

Context

In light of recent discussions, there's a growing concern regarding the separation of Python resources into distinct Homebrew formulae. A key consideration is avoiding the duplication of efforts, particularly for packages readily installable via pip install. Additionally, there's a potential issue of brew link errors, especially with Python 3.11 formulae (py3.10 support was removed recently). This document aims to collate various ideas and lay a foundation for the management of Python formulae in Homebrew.

Problem Statement

The current approach to handling Python resources in Homebrew has led to several challenges:

  • Extended Build Times: many Python formulae (like dvc, dstack, semgrep) have extensive resource lists, leading to prolonged build times.

  • Security Vulnerabilities: While there are critical CVE issues, we also leverage pip-audit to monitor and update resources for security vulnerabilities, necessitating frequent revision bumps.

  • Resource Sharing: Several resources are extensively shared across homebrew-core. Isolating these resources into individual formulae could reduce duplications and enhance caching efficiency (like cffi, six, pygments, python-cryptography)

  • Rust build dependencies: Some formulae, like python-cryptography have rust dependency. Moving them to separate formulae would remove the need of rust for build dependency. (like python-cryptography)

  • Default Python Version Update: Homebrew aligns with the annual release cycle of Python, updating bottles to support new major versions each year. Once the migration to the latest major Python version is complete, the default Python version in Homebrew will be updated to reflect the most recent Python 3 release.

Challenges

Our goal is to optimize Python package management in Homebrew without replicating the functionality of existing Python package repositories. Key challenges include:

Naming Conventions

Historically, Homebrew wasusing language-agnostic approach for the python libraries, like cffi, six, but homebrew python libraries are not necessarily for end-users, so it might be better to use `python-*` prefix instead of language agnostic namings. But it is also noted that many python libraries also ship the CLI tools, we did add caveats to clarify the python dependencies.

Proposed Solutions and Discussion

This section will invite suggestions and discuss potential strategies to address the outlined challenges, aiming for efficient, secure, and user-friendly management of Python resources in Homebrew.

Popular formula

This category is trying to include the formulae which are considered popular within the homebrew-core project and keep them for build efficiencies. 

  • python-cryptography
  • python-requests

Formulae for python builds

  • python-build
  • python-flit-core
  • python-setuptools
  • python-hatching
  • meson-python

Maybe combine common extensions into these formulae? For example, the python-setuptools formula could include both the “setuptools” and “setuptools-scm” Python modules, and the python-hatchling formula could include “hatchling”, “hatch-vcs”, and “hatch-fancy-pypi-readme” Python modules.

Decision

Revert some existing python-* formulae

For formulae that do not require much build efforts and are only used by few formulae, we are going to deprecate/remove them from the homebrew-core repo.

@chenrui333 chenrui333 added bug Reproducible Homebrew/homebrew-core bug and removed bug Reproducible Homebrew/homebrew-core bug labels Dec 15, 2023
@fxcoudert fxcoudert added the maintainer feedback Additional maintainers' opinions may be needed label Dec 16, 2023
@fxcoudert
Copy link
Member

Thanks @chenrui333 for opening this issue. I have been thinking about this topic in the past, and my opinion is currently:

  • There is no discussion that some formulas, that include very long compile time, or are not installable only via pip, or contain more than just the Python part, need to have their own formula. We've had several in the past, and it's fine: numpy, scipy, pillow, pytorch, stuff.
  • Some build tools are probably common enough to be in that category, but I don't think “build tools” as a category needs to be treated separately.
  • There are downsides to having duplicate resources in several formulas, but they are mostly technical (build time, disk size, resource update) and in my mind, somewhat limited. The issue of resource update/vulnerability, for example, can be dealt with through technical means.
  • On the other hand, if we have many python formulas, we increase the risk of dependency hell and versioning hell. This is bad for us, because we will start to see version requirements, have to ship several versions, or switch between vendored-in and formula dependencies, and it's a lot of work.
  • For users, having some mix of python packages as formulas and some not available (only through pip) leads to confusion. We should come up with a policy that is clear and can be followed in the future.

I can think of two clear policies:

  1. the one we currently have (and we can possibly amend), which basically says we should limit the number of python formulas
  2. allowing more, which basically means we would duplicate python distribution mechanisms

@MikeMcQuaid
Copy link
Member

  • Extended Build Times: many Python formulae (like dvc, dstack, semgrep) have extensive resource lists, leading to prolonged build times.

It'd be good to measure how times change before/after moving to separate formulae. Unless there's non-trivial, measurable improvements: things should not be extracted for this reason 👎🏻

  • Resource Sharing: Several resources are extensively shared across homebrew-core. Isolating these resources into individual formulae could reduce duplications and enhance caching efficiency (like cffi, six, pygments, python-cryptography)

Reducing duplication only really makes sense if there's a non-trivial process to install a resource which usually does not seem to be the case. Caching is already efficient/shared when multiple formulae or resources share the same URL so I don't think caching makes sense 👎🏻.

  • Rust build dependencies: Some formulae, like python-cryptography have rust dependency. Moving them to separate formulae would remove the need of rust for build dependency. (like python-cryptography)

This, and other non-trivial native compilation, seems like a very good fit for separate formulae 👍🏻

  • Default Python Version Update: Homebrew aligns with the annual release cycle of Python, updating bottles to support new major versions each year. Once the migration to the latest major Python version is complete, the default Python version in Homebrew will be updated to reflect the most recent Python 3 release.

I don't understand what this means for resources, can you elaborate?

  • Avoiding redundancy with pip install for easily installable packages.

This is not a concern for applications but is a large concern for libraries.

  • Ensuring version compatibility across different formulae.

Can you elaborate on this? Not sure I understand.

This is probably a good argument for all Python libraries, except bindings, to be always keg-only.

so it might be better to use python-* prefix instead of language agnostic namings

I would rather we didn't do this because it would be a large departure from how we've done things and do things for every other language, really.

  • python-cryptography

This has a build-time rust dependency so 👍🏻 to keep it as a formula.

  • python-requests

This has no non-Python build dependencies and the installation is very simple so 👎🏻, it should not be a formula.

Revert some existing python-* formulae

For formulae that do not require much build efforts and are only used by few formulae, we are going to deprecate/remove them from the homebrew-core repo.

I think we should figure out a criteria and then revert all Python formulae that do not fit that.

  • There is no discussion that some formulas, that include very long compile time, or are not installable only via pip, or contain more than just the Python part, need to have their own formula. We've had several in the past, and it's fine: numpy, scipy, pillow, pytorch, stuff.

Agreed 👍🏻

  1. the one we currently have (and we can possibly amend), which basically says we should limit the number of python formulas

This has been amended recently but we can amend it further if needed.

@singingwolfboy
Copy link
Member

Reducing duplication only really makes sense if there's a non-trivial process to install a resource which usually does not seem to be the case. Caching is already efficient/shared when multiple formulae or resources share the same URL so I don't think caching makes sense 👎🏻.

According to this reasoning, we should also remove the six formula. This formula has been around since 2021 (added in #74909), which is much longer than the other python-* formulae; I'm not sure if it deserves special consideration or not.

@cho-m
Copy link
Member

cho-m commented Dec 24, 2023

According to this reasoning, we should also remove the six formula. This formula has been around since 2021 (added in #74909), which is much longer than the other python-* formulae; I'm not sure if it deserves special consideration or not.

It may be up for removal once we finalize the criteria of valid python formulae.

Some notes on six:

  • In theory, the amount of six usage should decrease as projects finally remove remnants of Python 2. We also don't have a great way of handling the situation of outdated depends_on "six" while resource "six" can be auto-removed.
  • For single file package like six, de-duplication may only be worth it if there is a significant disk size impact:
    • six bottle is 162.8KB installed over 3 different Pythons. So ~50KB as bottle probably has more metadata than pip install.
    • There are ~200 formulae that depend on six so maybe +10MB if someone installs all of them with resource "six".
    • Though someone can claim pycache creation should be accounted for which adds ~45KB so getting closer to 100KB per six usage. Overall, on most systems it shouldn't be noticeable.
  • On side note, adding six resulted in our first set of link issues, e.g.

@MikeMcQuaid
Copy link
Member

According to this reasoning, we should also remove the six formula. This formula has been around since 2021 (added in #74909), which is much longer than the other python-* formulae; I'm not sure if it deserves special consideration or not.

Yes, sounds like it's worth removal.

@dtrodrigues
Copy link
Member

depends_on "six" causes CI to install all 3 of Python 3.10-3.12 at build-time, even though typically only one Python version is required for a formula, which is another reason to remove six as a dependent formula.

@MikeMcQuaid
Copy link
Member

We should start removing six and other formulae. Those present in https://formulae.brew.sh/analytics/install-on-request/365d/ should be deprecated. Those not present can be removed immediately. We should remove our usage of these formulae immediately.

@SMillerDev
Copy link
Member

In the case of six, I wonder if our resource updater fails if resources that aren't needed are excluded, because without that we would never know if it was no longer required.

@iMichka
Copy link
Member

iMichka commented Jan 7, 2024

I think there are many questions to answer here, I won't be able to respond to everything in one message.

Regarding six: it's used by around 200 formulae, I think we should check if we can start dropping it (or at least ask upstream to think about dropping Python 2 / six support). That's a work we can start, and is does not depend on any decision we take here. I don't want to pollute the discussion with the six topic, as the issue we face here is more general.

Regarding the brew link issue: this is always going to happen with a shared /usr/local/lib/pythonX.Y/site-packages folder.
So I agree that reducing the number of packages provided by brew in there should be a goal. But maybe there is another solution.

I am wondering if we should introduce a second brew-specific site-packages folder? One that is only used for our own purposes? We would then keep /usr/local/lib/pythonX.Y/site-packages empty, leaving it for our users to populate with pip.
How hard would it be to use that solution (I did not test it, this needs some testing)?

If I understand the problem well, there should be no issue for our users if they had used virtualenvs, with virtualenv --no-site-packages, to not include our site-packages folder?

@SMillerDev
Copy link
Member

If I understand the problem well, there should be no issue for our users if they had used virtualenvs, with virtualenv --no-site-packages, to not include our site-packages folder?

That might be a solution, but in that case what are we packaging them for if not for users.

@iMichka
Copy link
Member

iMichka commented Jan 8, 2024

what are we packaging them for if not for users.

Only for our internal use.

Most formulae are command line tools, and the user should not care if it's written in Python or with something else.
But we might still need to ship a brewed numpy or whatnot to make our life easier: maybe that's not a reason to pollute the site-package folder.

My approach is of course a little bit more drastic compared to the rules we had in place before.

And maybe this change will imply getting rid of a few formulae too (through a deprecation cycle) to comply with those new rules. It's unclear though if we want to go in that direction, but I feel that mixing pip stuff with brewed stuff is going to bring endless odd bugs and issues.

Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale No recent activity label Jan 30, 2024
@woodruffw
Copy link
Member

Not stale, I believe.

@github-actions github-actions bot removed the stale No recent activity label Jan 30, 2024
@MikeMcQuaid
Copy link
Member

I think this kinda is stale. We're at the point now that at the AGM this should be decided and then turned into docs/PRs.

@woodruffw
Copy link
Member

@iMichka @chenrui333

A summary of one of the AGM ideas we discussed: instead of separating things into new formulae like python-cryptography, we could do our own wheel builds. Then, during CI bottle build, we could access the wheel build instead of re-building from the source distribution referenced in the resource block.

@iMichka
Copy link
Member

iMichka commented Feb 5, 2024

@iMichka @chenrui333

A summary of one of the AGM ideas we discussed: instead of separating things into new formulae like python-cryptography, we could do our own wheel builds. Then, during CI bottle build, we could access the wheel build instead of re-building from the source distribution referenced in the resource block.

This was a fun idea and I really liked it. I'm keeping this in a corner of my head just in case we need to revive the idea. But given the discussion we had together at the AGM, and the different scenarios we explored together: we are not going to do this. It's too complex and needs too many changes in brew.

I'll write down what we "decided" in the next message so we can review it.

@iMichka
Copy link
Member

iMichka commented Feb 6, 2024

After a small workshop at the AGM here is what we came up with @woodruffw and @cho-m (and others):

  • PEP 668 will prevent pip vs brew conflicts from Python 3.12 on: we explicitly do not allow to "pip" install outside of a separate virtualenv. This will prevent users to install pip-managed packages inside of brew's site-package, thus avoiding conflicts. We discussed the migration scenario from Python 3.11 to 3.12: all should be fine as these are separate formulae with separate site-packages, so making 3.12 the main Python should not be an issue.

  • We should revert back to vendoring for most of python-xxx formulae as we feel this is a lot safer regarding dependency resolution

  • We would like to keep a few packages as separate formulae: everything that depends on rust or is complex / long to compile (> 10 min?). This would be for example the case for python-cryptography, numpy, pillow ...

  • We are aware that a user might run python3 -c "import numpy outside of a virtualenv, thus importing brewed packages: this is an edge case we have no solution for (unless filtering this out in sitecustomize.py, but that's too hacky). We do not think this should be an issue anymore because PEP 668 tells you to work inside a virtualenv.

  • Old Python formulae like numpy, scipy, pillow can stay as-is: they can be imported with python3 -c "import numpy, but again, due to PEP 668 you will probably have to setup a virtualenv, and use numpy from pip.

  • If someone really wants to break things, they can use --break-system-packages: we won't support that use case.

One last thing: we did not discuss renaming packages to add consistency (python-cryptography vs numpy for example).

Cc @woodruffw @cho-m @chenrui333

@MikeMcQuaid
Copy link
Member

  • We should revert back to vendoring for most of python-xxx formulae as we feel this is a lot safer regarding dependency resolution
  • We would like to keep a few packages as separate formulae: everything that depends on rust or is complex / long to compile (> 10 min?). This would be for example the case for python-cryptography, numpy, pillow ...
  • If someone really wants to break things, they can use --break-system-packages: we won't support that use case.

👍🏻 to all of these in particular. Let's move forward with this. Thanks for shepherding @iMichka!

@alebcay
Copy link
Member

alebcay commented Feb 7, 2024

We would like to keep a few packages as separate formulae: everything that depends on rust or is complex / long to compile (> 10 min?). This would be for example the case for python-cryptography, numpy, pillow ...

We added a patched/modified python-certifi formula to allow any Python-based formulae to leverage the brewed CA certs (see https://github.com/orgs/Homebrew/discussions/4691). I would like to suggest we keep it, even though building is fast.

The alternative would be to add certifi back as a resource to all Python-based formulae that need it, and also add the patch/modification pointing to brewed CA certs to all such occurrences. I am of the opinion that this would be more cumbersome than maintaining a formula for it, keeping all the brew-specific integration there.

@MikeMcQuaid
Copy link
Member

We added a patched/modified python-certifi formula to allow any Python-based formulae to leverage the brewed CA certs (see https://github.com/orgs/Homebrew/discussions/4691). I would like to suggest we keep it, even though building is fast.

Yes, this type of usage seems reasonable 👍🏻

@clason
Copy link
Contributor

clason commented Mar 10, 2024

Something I'm missing from this discussion (and, in fact, missing period) is how this (quite breaking!) transition is communicated to the user. I've been hit several times by a package I rely on simply vanishing without explanation and having to dig through the repo here.

While I understand the rationale and accept the change, I would have expected a brew upgrade to print a prominent warning linking to a FAQ page on how to handle Python packages from now on before things go poof. For example, in $dayjob I rely on Numpy, Scipy, Matplotlib and SymPy being available in a single (default) environment. (I also enjoy the convenience of isympy.) It's not at all clear to me how to achieve this with the new "let's distribute some packages but don't allow installing other packages" strategy.

@clason
Copy link
Contributor

clason commented Mar 18, 2024

...anything? 🦗

@MikeMcQuaid
Copy link
Member

@clason I'm not sure what response you were expecting here, can you elaborate a bit more specifically what you're asking for?

If you're suggesting Homebrew change something: the best way of making that happening is to open a pull request. This document should help and we're happy to walk you through anything else.

Thanks!

@clason
Copy link
Contributor

clason commented Mar 18, 2024

I am asking for explanation and public documentation of the already decided and in-progress Great Python Yeetening. This is project management level and not something a random contributor (who was not involved in this decision) can or should do.

@SMillerDev
Copy link
Member

Homebrew employs exactly 0 project managers, so I think it'll be up to the people who want some documentation to add some documentation.

@MikeMcQuaid
Copy link
Member

I am asking for explanation and public documentation of the already decided and in-progress Great Python Yeetening.

This issue is the closest thing we have to documentation right now.

and not something a random contributor (who was not involved in this decision) can or should do

It is definitely something a random contributor can do. If you don't want to do it: that's ok, though it may mean it does not get done.

@clason
Copy link
Contributor

clason commented Mar 18, 2024

Homebrew employs exactly 0 project managers

Whether you call them "project managers" or "maintainers" or "core members" or "AGM" (whatever that is) doesn't really matter; someone has decided to go through with this and merge the corresponding PRs.

It is definitely something a random contributor can do. If you don't want to do it: that's ok, though it may mean it does not get done.

It is not, since if I knew, I wouldn't be asking. What are people supposed(!) to do now if they, e.g., want to have sympy in their general environment and isympy working? Will matplotlib be going next?

The PR description still has a lot of question marks and todos.

Basically, I am wondering why this process is going on while this is still marked as RFC and open.

@woodruffw
Copy link
Member

Basically, I am wondering why this process is going on while this is still marked as RFC and open.

#157500 (comment) captures the overall scope of the changes here, and effectively "closes" the RFC part of the issue. AFAIK, the reason the issue is still open is so that we can continue to discuss any special cases, e.g. with setuptools.

Re: default environments: could you give us a sense of the packages you're expecting to be Homebrew-managed vs. pip managed? The problem we're currently running into is figuring out an appropriate set of "leaf" Python packages for Homebrew to provide, given PEP 668; I'm not even remotely familiar with the scientific Python ecosystem so context here would be invaluable.

@clason
Copy link
Contributor

clason commented Mar 18, 2024

Any list will be personal, I'm afraid, so unlikely to give a good general case. (If you ask me, the "Big Four" are Numpy, Scipy, Matplotlib, and Sympy, and I would expect being able to install these at the very least. But others would insist on pandas being there, or others.) (Sympy integrates with ipython which is still shipped, so that would be another argument for keeping it as a leaf package.)

But my general point remains:

  1. The comment you linked does not explain what a user should do for the one (inevitable) package they want to add to their homebrew-managed global environment. (I don't mind it being dropped from homebrew, but locking the global env at the same time is just adding insult to injury.)
  2. I had to search a while (and still not find full information) about what's going on after noticing that more and more of my python packages vanished; I would have expected brew upgrade to print a warning message linking to a FAQ page (or at least that comment), similar to what you've done for other breaking changes. (And this is very much a breaking change.)

@woodruffw
Copy link
Member

Any list will be personal, I'm afraid, so unlikely to give a good general case. (If you ask me, the "Big Four" are Numpy, Scipy, Matplotlib, and Sympy, and I would expect being able to install these at the very least. But others would insist on pandas being there, or others.) (Sympy integrates with ipython which is still shipped, so that would be another argument for keeping it as a leaf package.)

It's okay for it to be a personal list, but thanks -- this helps. We're currently discussing a better qualification system for this sort of thing, since we don't want to break the "entirely Homebrew-managed scientific workbench" use case.

I will look into improving the documentation here when I have a free moment, although I'm not sure if it'll be straightforward to print a brew upgrade message in this case -- it may have to live on brew.sh and get linked from this issue. The TL;DR variant is that you should either use pipx to manage Python tooling, virtual environments for big workbenches of things that include non-Homebrew-packaged deps, or --break-system-packages if you understand that doing so may interfere with the Homebrewed Python in ways that we don't support (I believe the EXTERNALLY-MANAGED message now states this).

@cho-m
Copy link
Member

cho-m commented Apr 6, 2024

Basically, I am wondering why this process is going on while this is still marked as RFC and open.

#157500 (comment) captures the overall scope of the changes here, and effectively "closes" the RFC part of the issue. AFAIK, the reason the issue is still open is so that we can continue to discuss any special cases, e.g. with setuptools.

Added note at top of issue to point corresponding comment.

I've pinned and closed issue to keep it visible but also align status.


Can continue discussing details further here. I also have a tracking issue #167905 if anyone wants to see any per-formula decisions.

@cho-m cho-m closed this as completed Apr 6, 2024
@cho-m cho-m added python Python use is a significant feature of the PR or issue and removed in progress Stale bot should stay away labels Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintainer feedback Additional maintainers' opinions may be needed python Python use is a significant feature of the PR or issue python-vendoring Part of the Python resource vendoring project
Projects
None yet
Development

No branches or pull requests

14 participants