-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better loopback handling (hiding it) #144
Comments
BTW, I think we could use daemonless fscache interfaces to avoid loopbacked-mount approach as well but lookbacked-mount is a more compatible approach for old kernels (~5.15). |
Also, directly reading files rather than block devices could be done in principle for erofs as the previous kcomposefs (and it's useful for all disk fses since all disk fses have use cases to use loopbacked mounts, and I could even duplicate a whole fscache caching framework so that it's more flexiable), but the problem is that rare in-kernel filesystems really work in this way. It also duplicates iomap interface. If we'd like to avoid loopback devices, I'd suggest if we could support file-based backend in addition to block devices in iomap and make fscache work with iomap so I could cleanup the current erofs codebase as well. |
Yeah, having generic direct-file access via the VFS for all filesystems that support iomap would be great. |
(I will try to talk this with Darrick later..) I will do my best to get better user experience from loopback devices, anyway doing a EROFS-builtin file+caching framework is controversial which I'd like to avoid... |
So to me, the new "composefs" is about putting together things that already exist (overlayfs, fs-verity, erofs for metadata etc.). There's actually precedent for efficient in-kernel-only access to a file: and that's swap files. The more I think about it, the more strong this alignment is:
Am I missing something here? Basically ISTM we could either create a generic kernel shim layer that makes swap files look like a block device in-kernel and point erofs at it, or just directly hardcode erofs to do the same stuff that This alignment seems so strong that I feel like I must be missing something... |
swap files stuffs might be another messy parts that needs to be resolved in the kernel codebase if I remembered correctly since it records physical pinned extents and all I/Os bypassing filesystems... I don't remember when I heard this but I guess that way is not what we'd like to proceed (assumed erofs data is pinned and bypassing filesystems). Actually the simple way is just to use direct I/O to access underlay files, just replace BIO interfaces in iomap to direct I/Os so data can be accessed with direct I/O to page cache, much like what currently fscache does. I think it works but I need to discuss with related maintainer first.... |
OK there is one thing we need in the stack here beyond what swapfiles do today - we still want to verify the fs-verity signature on the erofs metadata in the signed case. Which does need to involve the backing-filesystem code. What's the problem with assuming the erofs is pinned? I can't see a problem with that - it's only a userspace flexibility problem right? And from userspace that constraint seems perfectly fine; while we're running a host system or app (a mounted composefs), we can't move or delete its metadata, which seems perfectly reasonable. The "bypassing filesystem" part of swapfiles though is definitely relevant for the metadata fs-verity path as a I mentioned. But otherwise - what parts of the backing filesystem code could we care about here? Like, let's think about the composefs-on-erofs case. It seems perfectly fine to me to say we will not support fancy features from the lower filesystem (lower erofs) like compressing the "upper erofs metadata file" used by composefs. We just want raw access to the bits again, except we do want fs-verity. |
There are some log-structured filesystems like f2fs could do GC in background for less fragments, which could cause data movement. And except for swap files, rare files are actually pinned (erofs might be another special one then). Currently there are some kernel_read() users, but I tend to avoid this for generic filesystem I/Os, anyway, let me try to ask iomap first. I also think it can be done with fscache daemonless multiple-dirs as well if such interface exist. I will first ask Darrick about this. |
I also try to Cc @brauner here, not sure if he could give some opinions about this as well. |
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/f2fs/f2fs.h#n4457 - f2fs already supports swapfiles, so it must already handle this case. (And actually it looks like f2fs has other special things like "atomic files" that are in this space too)
In the general case, I'm not saying this should be an erofs feature - I'm saying effectively that I think many use cases for loopback mounts could be replaced with something swapfile-like, of which composefs would be one example. |
I once worked on f2fs, I assumed I know more about this. Yes, you could pin this, but log-structured filesystems usually have more fragments and need to do gc.
I understand it could work, but this way really needs to be discussed in the whole fs community. Even I agree, the worst result could be that finally this approach reaches to Linus and I will get a not-good response :(. |
I'm on linux-fsdevel@ if you prefer to discuss there. But I think we could at least gather some baseline consensus on approaches here from the useful-for-composefs perspective and then take a more fully-formed set of proposals to a thread there to decide on. One thing we didn't yet touch on and I'm interested in both your and @brauner's thoughts on is the userspace API side of things to set this up (in the original comment) that we basically pass a file descriptor to swapfiles (or "internal loopback devices", or the generic direct-file iomap or something using the fscache code (didn't quite understand this one)) or whatever would all be an in-kernel implementation detail that could actually be changed later. |
I think it is a great idea to be able to just give either a path, an fd, or a dirfd+path to the mount and have the filesystem read from a file directly using the vfs rather than having to fake a block device for it. However that is just the API. I don't really care about how it would work on the kernel side. That completely depends on what the best approach for the implementation is, which I honestly don't know, and is best hashed out on linux-fsdevel. |
On cool part of using iomap is that it could efficiently expose sparse files to the filesystem. |
Let me try to talk to iomap maintainer first, then let’s talk about this on -fsdevel mailing list |
Fwiw, I've proposed that years ago and I'm working on this in the context of my diskseq changes which extends the fsconfig() system call to also take a source-diskseq property which makes this all completely race free. I also talked about this plan at LSFMM. But it requires porting all filesystems over to the new mount api and some other possible block-level changes. |
I had to do a bit of digging for this, looks like this is: https://lore.kernel.org/linux-block/[email protected]/T/#rff86f0d3635d7fcb080495920c6fb4fd805cc81a
Hmm, yes, making loopback devices less racy sounds nice but I don't see the value in exposing loopback devices to userspace at all for the composefs case. I'm arguing that at least erofs should support directly taking a file as a mount source and doing whatever it wants internally to make that work. |
There's multiple aspects here. The first one is being able to provide an fd as a source property generally. The second one is loopback device allocation through the fsconfig interface. Both are fundamentally related because the latter operates on the source property. I would need to think how I would like an api for this to look like. |
Yeah, much appreciated. It's not something I could help but a generic FS topic instead. (update: I've talked with Darrick, no conclusion of this [since loopback devices are generic for all disk fses to access backing files but duplicate another path causes churn], if brauner would like to follow the original idea in this issue, I'm very glad!) |
From a userspace perspective, the problem with loopback devices are that they are a globally visible resource of part of the internals of a particular mount. Basically you have an operation that you want to do (mount file $path on $mntpoint) which results in one object you care about, which is the mountpoint. But, as part of setup you need this intermediate object, the loopback device, which is left around as a system-wide resource that is visible by admins and other programs, and is even mutable after the mount is done. I guess if you are a sysadmin trying to ad-hoc debug some filesystem image this setup is very useful. However, if you're a program using loopback as an internal implementation detail this gets in the way. Over time there has been some things added to make this saner, like loopback auto-cleanup, and LOOP_CTL_GET_FREE. However, its history as a sysadmin thing still shines through. What we want is the ability to just specify a file as the source of the mount, and for the kernel to do whatever is needed internally to achieve this, and not expose any details to the user. For example, it should not be visible in e.g. losetup, or require access to global loopback devices that may allow you to write to other apps loopbacked files. As a userspace person, I don't actually care what happens internally. It may be that we still create a loopback device internally for the mount and use that. However, it has to be anonymous, immutable, inaccessible for others and tied to the lifetime of the mount. |
On Mon, Jun 12, 2023 at 12:34:44AM -0700, Alexander Larsson wrote:
From a userspace perspective, the problem with loopback devices are
that they are a globally visible resource of part of the initernals of
a particular mount. Basically you have an operation that you want to
do (mount file $path on $mntpoint) which results in one object you
care about, which is the mountpoint. But, as part of setup you need
this intermediate object, the loopback device, which is left around as
a system-wide resource that is visible by admins and other programs,
and is even mutable after the mount is done.
I guess if you are a sysadmin trying to ad-hoc debug some filesystem
image this setup is very useful. However, if you're a program using
loopback as an internal implementation detail this gets in the way.
Over time there has been some things added to make this saner, like
loopback auto-cleanup, and LOOP_CTL_GET_FREE. However, its history as
a sysadmin thing still shines through.
What we want is the ability to just specify a file as the source of
the mount, and for the kernel to do whatever is needed internally to
achieve this, and not expose any details to the user. For example, it
should not be visible in e.g. losetup, or require access to global
loopback devices that may allow you to write to other apps loopbacked
files.
Yes, I just talked to Christoph about this and it's something that I
really want to tackle.
As a userspace person, I don't actually care what happens internally.
It may be that we still create a loopback device internally for the
mount and use that. However, it has to be anonymous, immutable,
inaccessible for others and tied to the lifetime of the mount.
Yes. I think if we can make this almost completely invisible to
userspace (no devtmpfs entry, no sysfs entries etc.) that would be ideal
and would allow us to sidestep the whole namespacing questions.
|
A file as the source of the mount is good, but actually almost all disk fses use BIO interface for all I/Os, I tend to make EROFS work on both bdev-backed and file-backed approaches, but I think another duplicated file-based path with the same on-disk format is unnecessary (I could do, but that is the worst option for me.) |
On the kernel side I agree with this. But as a userspace consumer I don't care, and should not have to. If we get the API right, the kernel should be able to migrate from "just use loop internally" to a better implementation at a later time, without affecting userspace. |
In the long term (apart from hiding these entries), I wonder if there could be some more flexiable file-backed way compared with the current loopback devices and have a unique local I/O interface (I think sticking to BIO approach is possible). |
It would probably also scale/perform better, with less emission of weird device change events and udev handlers. |
If fanotify: add pre-content hooks lands upstream, I think we could make on-demand fetching and file accesses in a new unique and flexible way. |
On-demand fetching could use useful for composefs objects for sure, but that seems like a totally distinct topic from hiding the erofs mount? Am I missing the connection? |
Hi Colin, thanks for the reply. I'd like to keep the only one file-backed interface for simplicity. When fanotify lands (or tends to land) upstream, I'd like to replace fscache support to pure file-backed interfaces (so that on-demand fetching and loop device avoiding can both support.) |
OK, I think I see the connection. If fscache can write to a file, that would require a way to treat a file like a block device in a kernel-internal way, and that same mechanism could be used for our (simpler) readonly case. However the fscache use case here requiring writes would seem to raise a lot of the same points I was originally making here around swap files. Actually though, why would one want fscache to write to a file instead of a directory? Eh, maybe it's not important for me to know 😄 Are we still roughly in agreement that the relevant kernel interfaces would support being passed a file descriptor and it would just do whatever it wants internally? |
Also I just connected this with https://lwn.net/Articles/982887/ which seems like it had some similar points |
But anyway, let's see how the fanotify work goes.. (personally I also like fanotify and ostree+composefs can use this too). ( BTW, fscache for on-demand fetching originates from Incremental FS discussion in 2019. We once put development resources to develop and use it. If fanotify support lands, I think it'd be easy to switch to that way and more scenarios can use this new way.)
I think you could just use mount -t erofs <meta-only file> <mntpoint> to mount Composefs at that time, since I will replace the fscache backend with much simplier interfaces. |
It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Signed-off-by: Gao Xiang <[email protected]>
It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Signed-off-by: Gao Xiang <[email protected]>
It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Signed-off-by: Gao Xiang <[email protected]>
It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Reviewed-by: Sandeep Dhavale <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Reviewed-by: Sandeep Dhavale <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Reviewed-by: Sandeep Dhavale <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Reviewed-by: Sandeep Dhavale <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
So, this will be in 6.12 then. |
Got this quote from someone doing perf testing of backing containers w/composefs at scale:
I need to dig in here a bit...one thing I notice is we're not following the blockdev locking rules and it may help here to ensure our loopback device is locked so udev doesn't scan it at least? |
The file-backed mount support is great, though given we want composefs to be deployable even to older OS/distros it's probably worth a brief investigation of if we can mitigate some of the overhead of loopback. One specific reporter dug in a bit and said udev launching "nfsrahead" is a cause of overhead:
which... (source code here) looks like it's scanning every block device to see if it's mounted somewhere via nfs? (wait what?????) |
After a lot of debate, it seems like we will be focusing on the "erofs+overlayfs" flow. There are positives and negatives to this.
This issue is about one of the negative things we lose with this combination, which is that we need to make a loopback device.
In our usage, the loopback device is an implementation detail of "composefs". However, its existence leaks out to all of the rest of the system, e.g. it shows up in
lsblk
, there's objects in/sysfs
for it, etc.One thing I'd bikeshed here is that perhaps using the new mount API we could add something like this
So instead of passing the
/dev/loopX
pathname, we just give an open fd to the kernel (to erofs) and internally it creates the loopback setup. But the key here is that this block device would be exclusively owned by the erofs instance, it wouldn't be visible to userspace.The text was updated successfully, but these errors were encountered: