Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nix: fix auto-allocate-uids #1335

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andre4ik3
Copy link

@andre4ik3 andre4ik3 commented Feb 13, 2025

A few days ago, there was a change 1 that removed the nix.configureBuildUsers option, and made it so that the build users and group was always managed. Unfortunately this broke the auto-allocate-uids option:

  1. configureBuildUsers (internal variable) is set to false if auto-allocate-uids is set to true. Link

  2. The users and groups are configured when configureBuildUsers is true (so auto-allocate-uids is false)... Link

  3. ...but the users and groups are added to knownUsers and knownGroups regardless... Link

  4. ...which leads to the assertions on Line 798 always being false, and also leads to nix-darwin attempt to delete the nixbld group. Link

The error shown when rebuilding with the problematic change and auto-allocate-uids enabled is this:

error:
Failed assertions:
- refusing to delete group nixbld in users.knownGroups, this would break nix
- refusing to delete user _nixbld1 in users.knownUsers, this would break nix

This PR fixes both of these issues (failed assertion and attempt to delete nixbld group, which is still necessary for auto-allocate-uids despite no users being in the group), by only adding the user assertions when configureBuildUsers is true, and updating the users.knownUsers to also only be set in that case. Additionally, the nixbld group is now always created.

Footnotes

  1. Commit adc989f

Copy link
Collaborator

@emilazy emilazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I tried to accommodate auto-allocate-uids when making these changes but missed this detail. I’m surprised it has users since my experience was that it was pretty broken, but I definitely don’t want to make it harder for people to try out experimental features.

I didn’t realize that the nixbld group still has to exist for auto-allocate-uids; I guess for ownership of store paths being built or something?

@andre4ik3 andre4ik3 force-pushed the fix-auto-allocate-uids branch from c3fe32d to 7f6a969 Compare February 13, 2025 18:44
@andre4ik3
Copy link
Author

andre4ik3 commented Feb 13, 2025

This turned out to be a bit more involved than I originally anticipated. Whilst the original version of the PR did evaluate without errors, if it activated it would have probably broken everything, and by sheer coincidence I was on slow Wi-Fi and it needed to download some big updated packages, which just didn't work. So it never activated, which would have probably totally broken my computer and deleted the build group or something.

Now, I tested both options on my system, toggling between auto-allocate-uids on and off, and watching it delete and recreate the users properly. It turns out there was an activation check that would warn about pre-Sequoia build users, but it would break if no nixbld users existed. So I fixed it by adding a default of 0 as a fallback (let me know if there is a more "proper"/better way to do this) and skipping the check if no users exist, or if auto-allocate-uids is true (which aren't necessarily the same thing -- e.g. coming off of auto-allocate-uids and trying to get back to build users when none exist).

And also, in modules/users/default.nix, it would refuse to delete users with UID below 501, so if you turned on auto-allocate-uids it would leave a bunch of unused users, saying existing user '_nixbldXX' has unexpected uid XXX, skipping.... This was fixed by removing the check for UIDs below 501 (however I'm not sure why that check was there? Is there something else it was designed to be a safeguard against?), so now the users delete properly.

auto-allocate-uids.mp4

@andre4ik3 andre4ik3 requested a review from emilazy February 13, 2025 19:07
YorikSar added a commit to YorikSar/dotfiles that referenced this pull request Feb 16, 2025
Hold off updating nix-darwin past LnL7/nix-darwin#1313
until LnL7/nix-darwin#1335 is resolved, because
of auto-allocate-uids.
Copy link
Collaborator

@emilazy emilazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being so slow at getting back to this one.

Comment on lines 320 to 321
echo "deleting user ${name}..." >&2
dscl . -delete ${escapeShellArg "/Users/${name}"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this some more I’m wondering if we should just keep the warning here for now to unblock this without risking clobbering system users.

cc @Enzime – do you think it’s safe to remove this conditional here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think I’d be comfortable with the conservative change of allowing 350 < UID ≤ (350 + maxBuildUsers) as well as UID > 501 here. I think the idea is to avoid deleting system users.

@andre4ik3 andre4ik3 force-pushed the fix-auto-allocate-uids branch from 7f6a969 to 494257c Compare March 1, 2025 11:42
@andre4ik3
Copy link
Author

Ok, I updated it so now the group is properly emptied when auto-allocate-uids is set to true. This was done by implementing the fix you suggested, by making a known set of nrBuildUsers between 1-128 so they are always considered "managed". (And also the other changes were implemented, apart from the condition change and the user deletion, which I will wait a bit on for any further discussion).

I didn't cap nrBuildUsers in the config though, because I think it can be considered a breaking API change, which would need a new state version and everything. So, a potential error can arise if nrBuildUsers > 128, if auto-allocate-uids is then turned on, it will only delete the first 128 users, which could be seen as problematic. It wouldn't break anything, since none of the users are used anyway, but it's not nice to have those users around not doing anything and polluting the system. For now I think they can be deleted manually, and it shouldn't be too much of a big issue since nrBuildUsers is mostly set to the number of CPU cores, and no Mac has ever had more than 128 CPU cores (at least I don't think it has...? Unless maybe Hackintosh or VM...). But in the end I did it this way to avoid a breaking change.

This also has the side effect of fixing another bug I ran across whilst working on this: if you raise nrBuildUsers, then rebuild, it will make the new users. But then lowering nrBuildUsers back down and rebuilding won't delete them since they're not in knownUsers anymore 🙃 (but again, with the above caveat of only deleting the first 128 build users, and either way it wouldn't work in current nix-darwin as it doesn't delete UIDs < 501)

Copy link
Collaborator

@emilazy emilazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the fancier approach! 💜 And yeah, fixing nrBuildUsers adjustment was my original motivation for wanting to pre‐fill knownUsers that happened to dovetail nicely with the problems here.

I didn't cap nrBuildUsers in the config though, because I think it can be considered a breaking API change, which would need a new state version and everything. So, a potential error can arise if nrBuildUsers > 128, if auto-allocate-uids is then turned on, it will only delete the first 128 users, which could be seen as problematic. It wouldn't break anything, since none of the users are used anyway, but it's not nice to have those users around not doing anything and polluting the system. For now I think they can be deleted manually, and it shouldn't be too much of a big issue since nrBuildUsers is mostly set to the number of CPU cores, and no Mac has ever had more than 128 CPU cores (at least I don't think it has...? Unless maybe Hackintosh or VM...). But in the end I did it this way to avoid a breaking change.

As far as I’m aware, the most parallel Mac ever was the Mac Pro with a Xeon W-3275M, with 28 cores and 56 threads. There are rumours of a 32‐core Apple Silicon chip on the horizon, but that wouldn’t have SMT. I have no idea if max-jobs = auto uses cores or threads, though. In practice I think 32 jobs should be more than enough for almost everybody, because with that many cores you also probably want to have those builds using parallelism themselves.

Evaluation‐time errors are a relatively innocuous kind of backwards incompatibility and I think it would be okay here. The state version is mostly for things where there’s a backwards‐incompatible change in state managed by the system; say, defaulting to a newer database version with an incompatible format, or otherwise changing a default or internal format in a way that could break an existing system at activation.

Tightening up evaluation checks can’t really brick someone’s system in that way, so I think constraining the range of nrBuildUsers to avoid any mishaps is a good idea. Since build UIDs start at 351 these days, and going above 400 has proven dodgy in the past, we probably want the default cap to be < 50. Apple already uses GID 395, so I’d feel hesitant going above 44. Perhaps we could bound the maximum at a nice round 40 or even 36 and wait to see if that’s a problem for anyone? It’s easy for us to expand the maximum range later, especially if we move away from the knownUsers model, but shrinking it is tough, as you’ve found, so I’d err on the side of conservatism.

(People who have a custom ids.uids.nixbld set would be overly constrained by any of these values. But if they know what they’re doing then they can override the type, and if there’s a legitimate use case for high values I’d rather get a bug report about it now than only find out about it when something else breaks.)

# between `_nixbld1` and `_nixbld128` is considered "managed" by
# nix-darwin, and will be created/deleted as `nrBuildUsers` increases
# and decreases (or is set to 0 in case of `auto-allocate-uids`.
nixbldUserNames = map (x: "_nixbld${toString x}") (range 1 128);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nixbldUserNames = map (x: "_nixbld${toString x}") (range 1 128);
nixbldUserNames = map (x: "_nixbld${toString x}") (range 1 maxBuildUsers);

and use that for a lib.types.ints.between 1 maxBuildUsers type in nrBuildUsers.

Comment on lines 320 to 321
echo "deleting user ${name}..." >&2
dscl . -delete ${escapeShellArg "/Users/${name}"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think I’d be comfortable with the conservative change of allowing 350 < UID ≤ (350 + maxBuildUsers) as well as UID > 501 here. I think the idea is to avoid deleting system users.

@emilazy
Copy link
Collaborator

emilazy commented Mar 2, 2025

After thinking about it a little more:

Since the installers don’t really cap the value and someone who installs with 64 users and then gets the default value of 32 would have the users from 32 to maxBuildUsers removed but not the ones beyond that, maybe it actually makes sense to decouple the maximum value we’ll let you set nrBuildUsers to, and the maximum _nixbld* we’ll try to manage. it may make sense to continue managing up to _nixbld128 even if we only allow nrBuildUsers to be set as high as 40.

I’ll sleep on it, but let me know what you think. One concern I have is whether checking the existence of all of those users might meaningfully slow down activation…

@andre4ik3 andre4ik3 force-pushed the fix-auto-allocate-uids branch 3 times, most recently from 37bd408 to 554455d Compare March 2, 2025 10:35
@andre4ik3
Copy link
Author

andre4ik3 commented Mar 2, 2025

Ok, I set the deletion check to be like before, checking if UID > 501, but now also if it's between ids.uids.nixbld and ids.uids.nixbld + nix.maxBuildUsers. And also nix.nrBuildUsers is now limited to be between 1 and nix.maxBuildUsers, which is a new internal setting I added so the maxBuildUsers can be shared between the two modules.

One scenario which I can think of in which this can break is if someone decides that they want to move their UID range to something else, say they are currently on base UID of 350, but want to go to 30000 or something. The new ids.uids.nixbld won't match the old one, and the old UID 350 users won't be deleted in the check. But I guess they can first switch auto-allocate-uids on, then change the UID base, and then turn it back off as a workaround. Or just delete the users manually. And also it's not an issue when going back from 30000 to 350, since the 30000+ UIDs are above 501 :)

Since the installers don’t really cap the value and someone who installs with 64 users and then gets the default value of 32 would have the users from 32 to maxBuildUsers removed but not the ones beyond that, maybe it actually makes sense to decouple the maximum value we’ll let you set nrBuildUsers to, and the maximum _nixbld* we’ll try to manage. it may make sense to continue managing up to _nixbld128 even if we only allow nrBuildUsers to be set as high as 40.

Makes sense, so basically cleaning up the installer's leftover unused users. Although I wonder what the upper limit for this should really be? I set it as 128 for now, since it seemed like a good number, but ideally there wouldn't be any number at all and any user of the build group would be considered "managed".

One concern I have is whether checking the existence of all of those users might meaningfully slow down activation…

Here's a totally unscientific benchmark of 0, 40, 64, and 128 users, based purely off of screen recording my terminal (measured first frame of "setting up user launchd services" to drop to shell + 5 frames). I think 128 users could be considered a bit too slow, but if you think so, I can just set it to 40 or 64 users, which are a bit faster. Of course 0 users is fastest, but currently most people will have 32 in knownUsers.

Oh also in all of these recordings 0 users actually exist (auto-allocate-uids is on), so that might affect things as well.

activation.mp4

Also now manually setting nrBuildUsers will override the default that depends on auto-allocate-uids, which matches NixOS behavior (I think). So if nrBuildUsers is set, no matter if auto-allocate-uids is on or off, it will create that number of users. Just that having auto-allocate-uids being turned on removes the safety check of 0 build users.

@andre4ik3 andre4ik3 force-pushed the fix-auto-allocate-uids branch from 554455d to bf27f60 Compare March 3, 2025 01:37
Fixes auto-allocate-uids not working due to errors during activation
checks. Also fixes a bug where raising `nrBuildUsers` would create new
`_nixbld` users, but lowering it wouldn't then remove those users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants