-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mount: add support for optionally skipping loopback device #361
Comments
The complexity of the mount code is growing a lot:
So this storing the The thing about opening vs. getting an fd from the user is pretty boring and is well handled in the two separate entrypoint APIs for each of those cases, but then we need to go on a path:
plus all the cleanups of optionally closing fds if we were the ones that opened it. I start to think that maybe we could dup() the user's fd just to simply things there (ie: we always close it) and close() each fd from a helper before we return the "next layer" one. Basically, the handling of trying to make sure we rmdir() and close() stuff as appropriate is getting a bit too complicated... particularly when mixed with various early error exits... |
This is definitely cool, and will help tons to avoid loads of loopback devices e.g. in the container image case. I agree that the mount code complexity is growing, in particular since we need to support legacy mount APIs, etc. But, this need only be included in the modern API mount codepath, and I believe the complexity is worth it. |
@alexlarsson what's the story behind needing to support the legacy mount APIs? |
@allisonkarlitskaya Its needed on e.g. rhel9 at least I belive. |
As I understand things today, if we go with the current design in #332 then we don't need the memfd path, right? Skipping the loopback though would be quite valuable down the line, but is IMO not critical right now again versus just fleshing out the status quo.
That seems fine to me if it simplifies the code though. |
Ya, I feel like the memfd thing was an interesting idea that got us on the right path but maybe not all that useful in the end... |
ps: skipping the loopback device still doesn't let us mount in containers, unfortunately. |
A lot of the caution in this "kernel shouldn't mount filesystems from untrusted sources" story is from the possibility that the user can modify the filesystem under the kernel. I wonder if we could convince erofs (and vfs) folks to trust filesystem images that come from non-mutable sources, eg:
|
This subthread is I think covered by containers/storage#2099 (also xref https://github.com/cgwalters/composefs-oci-experimental?tab=readme-ov-file#mounting-composefs-as-non-root )
That's I think nowadays covered by https://lwn.net/Articles/941764/ - but no the real problem remains (mostly for more complex filesystems) that offline crafted input can result in memory corruption - the XFS maintainer mentioned things like having file blocks point to metadata as one that would start to be quite expensive to check. That's mainly a concern for writable filesystems - erofs is regularly fuzzed and is I think fairly robust but... Anyways I think the "composefs as a service" is the best fix here. |
Yeah, EROFS is simpler than read-write OSes, but its not dead simple. The original in-kernel composefs was even simpler, and I believe something like that could eventually get fuzzed to be trustworthy for untrusted data. However, that ship has sailed, and I'm pretty sure general untrusted EROFS mounts are not gonna happen. I wonder if we could handle unprivileged composefs mounts by just extracting the metadata from the erofs into a tmpfs. Maybe we can make that pretty fast by using io_uring to parallelize all the syscalls to create the files. |
As I've said in containers/storage#2099 (comment), personally it won't be hard to enable EROFS with FS_USERNS_MOUNT (e.g. disable compression). |
Apologies for chiming in on this without to much know how on the matter, however more than interested to see rootless composefs being implemented for podman/storage. Does this help? |
I think that won't help with rootless idmapped mounts. We either need FUSE passthrough, or overlay in a user namespace needs to honor the |
This is cool: https://lore.kernel.org/lkml/CAMuHMdVqa2Mjqtqv0q=uuhBY1EfTaa+X6WkG7E2tEnKXJbTkNg@mail.gmail.com/T/
The text was updated successfully, but these errors were encountered: