Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better loopback handling (hiding it) #144

Open
cgwalters opened this issue Jun 8, 2023 · 36 comments
Open

better loopback handling (hiding it) #144

cgwalters opened this issue Jun 8, 2023 · 36 comments
Labels
enhancement New feature or request

Comments

@cgwalters
Copy link
Contributor

After a lot of debate, it seems like we will be focusing on the "erofs+overlayfs" flow. There are positives and negatives to this.

This issue is about one of the negative things we lose with this combination, which is that we need to make a loopback device.

In our usage, the loopback device is an implementation detail of "composefs". However, its existence leaks out to all of the rest of the system, e.g. it shows up in lsblk, there's objects in /sysfs for it, etc.

One thing I'd bikeshed here is that perhaps using the new mount API we could add something like this

diff --git a/libcomposefs/lcfs-mount.c b/libcomposefs/lcfs-mount.c
index ea2c2e9..b9d608d 100644
--- a/libcomposefs/lcfs-mount.c
+++ b/libcomposefs/lcfs-mount.c
@@ -393,7 +393,7 @@ static int lcfs_mount_erofs(const char *source, const char *target,
                return -errno;
        }
 
-       res = syscall_fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", source, 0);
+       res = syscall_fsconfig(fd_fs, FSCONFIG_SET_FD, "loop-file", src_fd, 0);
        if (res < 0)
                return -errno;

So instead of passing the /dev/loopX pathname, we just give an open fd to the kernel (to erofs) and internally it creates the loopback setup. But the key here is that this block device would be exclusively owned by the erofs instance, it wouldn't be visible to userspace.

@cgwalters cgwalters added the enhancement New feature or request label Jun 8, 2023
@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

BTW, I think we could use daemonless fscache interfaces to avoid loopbacked-mount approach as well but lookbacked-mount is a more compatible approach for old kernels (~5.15).
I need to keep talking with David Howells from time to time about this daemonless fscache-mounting stuffs but he's quite busy on other kernel stuffs.
BBTW, there is already a non-root patch for fscache for the next cycle:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?id=a64498ff493f468ea6d2e441059c24012128b28a

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

Also, directly reading files rather than block devices could be done in principle for erofs as the previous kcomposefs (and it's useful for all disk fses since all disk fses have use cases to use loopbacked mounts, and I could even duplicate a whole fscache caching framework so that it's more flexiable), but the problem is that rare in-kernel filesystems really work in this way.

It also duplicates iomap interface. If we'd like to avoid loopback devices, I'd suggest if we could support file-based backend in addition to block devices in iomap and make fscache work with iomap so I could cleanup the current erofs codebase as well.

@alexlarsson
Copy link
Collaborator

alexlarsson commented Jun 8, 2023

Yeah, having generic direct-file access via the VFS for all filesystems that support iomap would be great.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

Yeah, having generic direct-file access via the VFS for all filesystems that support iomap would be great.

(I will try to talk this with Darrick later..) I will do my best to get better user experience from loopback devices, anyway doing a EROFS-builtin file+caching framework is controversial which I'd like to avoid...

@cgwalters
Copy link
Contributor Author

cgwalters commented Jun 8, 2023

So to me, the new "composefs" is about putting together things that already exist (overlayfs, fs-verity, erofs for metadata etc.).

There's actually precedent for efficient in-kernel-only access to a file: and that's swap files. The more I think about it, the more strong this alignment is:

  • While the file is mounted as an erofs, we don't need to support unlink() or moving it etc. Its physical extents can stay pinned, which is exactly how swap files work.
  • Like swap files, we don't want any buffered IO at all, just efficient direct access to the bits
  • We already have a ton of if (IS_SWAPFILE(inode)) checks sprinkled around the kernel code that add the constraints we want (e.g. vfs_fallocate() will fail on it, may_delete() already rejects unlinking it, etc.)

Am I missing something here? Basically ISTM we could either create a generic kernel shim layer that makes swap files look like a block device in-kernel and point erofs at it, or just directly hardcode erofs to do the same stuff that swapfile.c is doing.

This alignment seems so strong that I feel like I must be missing something...

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

There's actually precedent for efficient in-kernel-only access to a file: and that's swap files. The more I think about it, the more strong this alignment is:

swap files stuffs might be another messy parts that needs to be resolved in the kernel codebase if I remembered correctly since it records physical pinned extents and all I/Os bypassing filesystems... I don't remember when I heard this but I guess that way is not what we'd like to proceed (assumed erofs data is pinned and bypassing filesystems).

Actually the simple way is just to use direct I/O to access underlay files, just replace BIO interfaces in iomap to direct I/Os so data can be accessed with direct I/O to page cache, much like what currently fscache does. I think it works but I need to discuss with related maintainer first....

@cgwalters
Copy link
Contributor Author

cgwalters commented Jun 8, 2023

swap files stuffs might be another messy parts that needs to be resolved in the kernel codebase if I remembered correctly since it records physical pinned extents and all I/Os bypassing filesystems... I don't remember when I heard this but I guess that way is not what we'd like to proceed (assumed erofs data is pinned and bypassing filesystems).

OK there is one thing we need in the stack here beyond what swapfiles do today - we still want to verify the fs-verity signature on the erofs metadata in the signed case. Which does need to involve the backing-filesystem code.

What's the problem with assuming the erofs is pinned? I can't see a problem with that - it's only a userspace flexibility problem right? And from userspace that constraint seems perfectly fine; while we're running a host system or app (a mounted composefs), we can't move or delete its metadata, which seems perfectly reasonable.

The "bypassing filesystem" part of swapfiles though is definitely relevant for the metadata fs-verity path as a I mentioned. But otherwise - what parts of the backing filesystem code could we care about here?

Like, let's think about the composefs-on-erofs case. It seems perfectly fine to me to say we will not support fancy features from the lower filesystem (lower erofs) like compressing the "upper erofs metadata file" used by composefs. We just want raw access to the bits again, except we do want fs-verity.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

What's the problem with assuming the erofs is pinned? I can't see a problem with that

There are some log-structured filesystems like f2fs could do GC in background for less fragments, which could cause data movement. And except for swap files, rare files are actually pinned (erofs might be another special one then).

Currently there are some kernel_read() users, but I tend to avoid this for generic filesystem I/Os, anyway, let me try to ask iomap first. I also think it can be done with fscache daemonless multiple-dirs as well if such interface exist. I will first ask Darrick about this.

@hsiangkao
Copy link
Contributor

I also try to Cc @brauner here, not sure if he could give some opinions about this as well.

@cgwalters
Copy link
Contributor Author

There are some log-structured filesystems like f2fs could do GC in background for less fragments, which could cause data movement.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/f2fs/f2fs.h#n4457 - f2fs already supports swapfiles, so it must already handle this case. (And actually it looks like f2fs has other special things like "atomic files" that are in this space too)

And except for swap files, rare files are actually pinned (erofs might be another special one then).

In the general case, I'm not saying this should be an erofs feature - I'm saying effectively that I think many use cases for loopback mounts could be replaced with something swapfile-like, of which composefs would be one example.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

f2fs already supports swapfiles, so it must already handle this case. (And actually it looks like f2fs has other special things like "atomic files" that are in this space too)

I once worked on f2fs, I assumed I know more about this. Yes, you could pin this, but log-structured filesystems usually have more fragments and need to do gc.

In the general case, I'm not saying this should be an erofs feature - I'm saying effectively that I think many use cases for loopback mounts could be replaced with something swapfile-like, of which composefs would be one example.

I understand it could work, but this way really needs to be discussed in the whole fs community. Even I agree, the worst result could be that finally this approach reaches to Linus and I will get a not-good response :(.

@cgwalters
Copy link
Contributor Author

cgwalters commented Jun 8, 2023

I understand it could work, but this way really needs to be discussed in the whole fs community.

I'm on linux-fsdevel@ if you prefer to discuss there.

But I think we could at least gather some baseline consensus on approaches here from the useful-for-composefs perspective and then take a more fully-formed set of proposals to a thread there to decide on.

One thing we didn't yet touch on and I'm interested in both your and @brauner's thoughts on is the userspace API side of things to set this up (in the original comment) that we basically pass a file descriptor to fsconfig for the source instead of a block device.

swapfiles (or "internal loopback devices", or the generic direct-file iomap or something using the fscache code (didn't quite understand this one)) or whatever would all be an in-kernel implementation detail that could actually be changed later.

@alexlarsson
Copy link
Collaborator

I think it is a great idea to be able to just give either a path, an fd, or a dirfd+path to the mount and have the filesystem read from a file directly using the vfs rather than having to fake a block device for it.

However that is just the API. I don't really care about how it would work on the kernel side. That completely depends on what the best approach for the implementation is, which I honestly don't know, and is best hashed out on linux-fsdevel.

@alexlarsson
Copy link
Collaborator

On cool part of using iomap is that it could efficiently expose sparse files to the filesystem.

@hsiangkao
Copy link
Contributor

Let me try to talk to iomap maintainer first, then let’s talk about this on -fsdevel mailing list

@brauner
Copy link

brauner commented Jun 8, 2023

I understand it could work, but this way really needs to be discussed in the whole fs community.

I'm on linux-fsdevel@ if you prefer to discuss there.

But I think we could at least gather some baseline consensus on approaches here from the useful-for-composefs perspective and then take a more fully-formed set of proposals to a thread there to decide on.

One thing we didn't yet touch on and I'm interested in both your and @brauner's thoughts on is the userspace API side of things to set this up (in the original comment) that we basically pass a file descriptor to fsconfig for the source instead of a block device.

Fwiw, I've proposed that years ago and I'm working on this in the context of my diskseq changes which extends the fsconfig() system call to also take a source-diskseq property which makes this all completely race free. I also talked about this plan at LSFMM.

But it requires porting all filesystems over to the new mount api and some other possible block-level changes.

@cgwalters
Copy link
Contributor Author

Fwiw, I've proposed that years ago and I'm working on this in the context of my diskseq changes which extends the fsconfig() system call to also take a source-diskseq property which makes this all completely race free.

I had to do a bit of digging for this, looks like this is: https://lore.kernel.org/linux-block/[email protected]/T/#rff86f0d3635d7fcb080495920c6fb4fd805cc81a

extends the fsconfig() system call to also take a source-diskseq property which makes this all completely race free.

Hmm, yes, making loopback devices less racy sounds nice but I don't see the value in exposing loopback devices to userspace at all for the composefs case. I'm arguing that at least erofs should support directly taking a file as a mount source and doing whatever it wants internally to make that work.

@brauner
Copy link

brauner commented Jun 8, 2023

There's multiple aspects here. The first one is being able to provide an fd as a source property generally. The second one is loopback device allocation through the fsconfig interface. Both are fundamentally related because the latter operates on the source property. I would need to think how I would like an api for this to look like.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

There's multiple aspects here. The first one is being able to provide an fd as a source property generally. The second one is loopback device allocation through the fsconfig interface. Both are fundamentally related because the latter operates on the source property. I would need to think how I would like an api for this to look like.

Yeah, much appreciated. It's not something I could help but a generic FS topic instead.
If loopback devices suck, we could have a discussion for all such disk fs use cases with loopback mount. honestly I have no better idea on this as well.

(update: I've talked with Darrick, no conclusion of this [since loopback devices are generic for all disk fses to access backing files but duplicate another path causes churn], if brauner would like to follow the original idea in this issue, I'm very glad!)

@alexlarsson
Copy link
Collaborator

alexlarsson commented Jun 12, 2023

From a userspace perspective, the problem with loopback devices are that they are a globally visible resource of part of the internals of a particular mount. Basically you have an operation that you want to do (mount file $path on $mntpoint) which results in one object you care about, which is the mountpoint. But, as part of setup you need this intermediate object, the loopback device, which is left around as a system-wide resource that is visible by admins and other programs, and is even mutable after the mount is done.

I guess if you are a sysadmin trying to ad-hoc debug some filesystem image this setup is very useful. However, if you're a program using loopback as an internal implementation detail this gets in the way. Over time there has been some things added to make this saner, like loopback auto-cleanup, and LOOP_CTL_GET_FREE. However, its history as a sysadmin thing still shines through.

What we want is the ability to just specify a file as the source of the mount, and for the kernel to do whatever is needed internally to achieve this, and not expose any details to the user. For example, it should not be visible in e.g. losetup, or require access to global loopback devices that may allow you to write to other apps loopbacked files.

As a userspace person, I don't actually care what happens internally. It may be that we still create a loopback device internally for the mount and use that. However, it has to be anonymous, immutable, inaccessible for others and tied to the lifetime of the mount.

@brauner
Copy link

brauner commented Jun 12, 2023 via email

@hsiangkao
Copy link
Contributor

A file as the source of the mount is good, but actually almost all disk fses use BIO interface for all I/Os, I tend to make EROFS work on both bdev-backed and file-backed approaches, but I think another duplicated file-based path with the same on-disk format is unnecessary (I could do, but that is the worst option for me.)
So I'd like to know if loop device scalability could be resolved as a generic way for all disk fses, that would be much better!

@alexlarsson
Copy link
Collaborator

So I'd like to know if loop device scalability could be resolved as a generic way for all disk fses, that would be much better!

On the kernel side I agree with this. But as a userspace consumer I don't care, and should not have to. If we get the API right, the kernel should be able to migrate from "just use loop internally" to a better implementation at a later time, without affecting userspace.

@hsiangkao
Copy link
Contributor

So I'd like to know if loop device scalability could be resolved as a generic way for all disk fses, that would be much better!

On the kernel side I agree with this. But as a userspace consumer I don't care, and should not have to. If we get the API right, the kernel should be able to migrate from "just use loop internally" to a better implementation at a later time, without affecting userspace.

In the long term (apart from hiding these entries), I wonder if there could be some more flexiable file-backed way compared with the current loopback devices and have a unique local I/O interface (I think sticking to BIO approach is possible).
Another I'd like to mention here is that if it works better, cachefs backend can be adapted in this form for other disk fses (in addition to EROFS) as well (And I can also cleanup my codebase due to this work.)
Actually our cloud environment also found loop is somewhat inflexible of some use cases (not only EROFS). If Christoph and @brauner agree on this, that improves our internal use cases as well honestly.

@alexlarsson
Copy link
Collaborator

Yes. I think if we can make this almost completely invisible to userspace (no devtmpfs entry, no sysfs entries etc.) that would be ideal and would allow us to sidestep the whole namespacing questions.

It would probably also scale/perform better, with less emission of weird device change events and udev handlers.

@hsiangkao
Copy link
Contributor

If fanotify: add pre-content hooks lands upstream, I think we could make on-demand fetching and file accesses in a new unique and flexible way.
I'm not sure if file interfaces are good ideas but it seems only vfs_iocb_iter_read() likewise is the only dependency, which is much similiar to the current fscache way (but even simplier.)

@cgwalters
Copy link
Contributor Author

On-demand fetching could use useful for composefs objects for sure, but that seems like a totally distinct topic from hiding the erofs mount? Am I missing the connection?

@hsiangkao
Copy link
Contributor

hsiangkao commented Jul 26, 2024

On-demand fetching could use useful for composefs objects for sure, but that seems like a totally distinct topic from hiding the erofs mount? Am I missing the connection?

Hi Colin, thanks for the reply. I'd like to keep the only one file-backed interface for simplicity. When fanotify lands (or tends to land) upstream, I'd like to replace fscache support to pure file-backed interfaces (so that on-demand fetching and loop device avoiding can both support.)

@cgwalters
Copy link
Contributor Author

OK, I think I see the connection. If fscache can write to a file, that would require a way to treat a file like a block device in a kernel-internal way, and that same mechanism could be used for our (simpler) readonly case.

However the fscache use case here requiring writes would seem to raise a lot of the same points I was originally making here around swap files.

Actually though, why would one want fscache to write to a file instead of a directory? Eh, maybe it's not important for me to know 😄

Are we still roughly in agreement that the relevant kernel interfaces would support being passed a file descriptor and it would just do whatever it wants internally?

@cgwalters
Copy link
Contributor Author

However the fscache use case here requiring writes would seem to raise a lot of the same points I was originally making here around swap files.

Also I just connected this with https://lwn.net/Articles/982887/ which seems like it had some similar points

@hsiangkao
Copy link
Contributor

hsiangkao commented Jul 26, 2024

OK, I think I see the connection. If fscache can write to a file, that would require a way to treat a file like a block device in a kernel-internal way, and that same mechanism could be used for our (simpler) readonly case.

However the fscache use case here requiring writes would seem to raise a lot of the same points I was originally making here around swap files.

Actually though, why would one want fscache to write to a file instead of a directory? Eh, maybe it's not important for me to know 😄

But anyway, let's see how the fanotify work goes.. (personally I also like fanotify and ostree+composefs can use this too).

( BTW, fscache for on-demand fetching originates from Incremental FS discussion in 2019. We once put development resources to develop and use it. If fanotify support lands, I think it'd be easy to switch to that way and more scenarios can use this new way.)

Are we still roughly in agreement that the relevant kernel interfaces would support being passed a file descriptor and it would just do whatever it wants internally?

I think you could just use mount -t erofs <meta-only file> <mntpoint> to mount Composefs at that time, since I will replace the fscache backend with much simplier interfaces.
Also it seems Android APEX loopback mounts and similiar use cases can all benefit from this...

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Aug 29, 2024
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.

Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices.  However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed.  There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.

On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
 - Fscache infrastructure has recently been moved into new Netfslib
   which is an unexpected dependency to EROFS really, although it
   originally claims "it could be used for caching other things such as
   ISO9660 filesystems too." [3]

 - It takes an unexpectedly long time to upstream Fscache/Cachefiles
   enhancements.  For example, the failover feature took more than
   one year, and the deamonless feature is still far behind now;

 - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
   perfectly supersede "erofs over fscache" in a simpler way since
   developers (mainly containerd folks) could leverage their existing
   caching mechanism entirely in userspace instead of strictly following
   the predefined in-kernel caching tree hierarchy.

After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)

[1] containers/storage#2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/[email protected]
Closes: containers/composefs#144
Signed-off-by: Gao Xiang <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Aug 29, 2024
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.

Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices.  However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed.  There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.

On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
 - Fscache infrastructure has recently been moved into new Netfslib
   which is an unexpected dependency to EROFS really, although it
   originally claims "it could be used for caching other things such as
   ISO9660 filesystems too." [3]

 - It takes an unexpectedly long time to upstream Fscache/Cachefiles
   enhancements.  For example, the failover feature took more than
   one year, and the deamonless feature is still far behind now;

 - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
   perfectly supersede "erofs over fscache" in a simpler way since
   developers (mainly containerd folks) could leverage their existing
   caching mechanism entirely in userspace instead of strictly following
   the predefined in-kernel caching tree hierarchy.

After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)

[1] containers/storage#2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/[email protected]
Closes: containers/composefs#144
Signed-off-by: Gao Xiang <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Aug 30, 2024
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.

Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices.  However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed.  There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.

On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
 - Fscache infrastructure has recently been moved into new Netfslib
   which is an unexpected dependency to EROFS really, although it
   originally claims "it could be used for caching other things such as
   ISO9660 filesystems too." [3]

 - It takes an unexpectedly long time to upstream Fscache/Cachefiles
   enhancements.  For example, the failover feature took more than
   one year, and the deamonless feature is still far behind now;

 - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
   perfectly supersede "erofs over fscache" in a simpler way since
   developers (mainly containerd folks) could leverage their existing
   caching mechanism entirely in userspace instead of strictly following
   the predefined in-kernel caching tree hierarchy.

After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)

[1] containers/storage#2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/[email protected]

Closes: containers/composefs#144
Signed-off-by: Gao Xiang <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Sep 2, 2024
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.

Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices.  However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed.  There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.

On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
 - Fscache infrastructure has recently been moved into new Netfslib
   which is an unexpected dependency to EROFS really, although it
   originally claims "it could be used for caching other things such as
   ISO9660 filesystems too." [3]

 - It takes an unexpectedly long time to upstream Fscache/Cachefiles
   enhancements.  For example, the failover feature took more than
   one year, and the deamonless feature is still far behind now;

 - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
   perfectly supersede "erofs over fscache" in a simpler way since
   developers (mainly containerd folks) could leverage their existing
   caching mechanism entirely in userspace instead of strictly following
   the predefined in-kernel caching tree hierarchy.

After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)

[1] containers/storage#2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/[email protected]

Closes: containers/composefs#144
Reviewed-by: Sandeep Dhavale <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Sep 5, 2024
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.

Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices.  However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed.  There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.

On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
 - Fscache infrastructure has recently been moved into new Netfslib
   which is an unexpected dependency to EROFS really, although it
   originally claims "it could be used for caching other things such as
   ISO9660 filesystems too." [3]

 - It takes an unexpectedly long time to upstream Fscache/Cachefiles
   enhancements.  For example, the failover feature took more than
   one year, and the deamonless feature is still far behind now;

 - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
   perfectly supersede "erofs over fscache" in a simpler way since
   developers (mainly containerd folks) could leverage their existing
   caching mechanism entirely in userspace instead of strictly following
   the predefined in-kernel caching tree hierarchy.

After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)

[1] containers/storage#2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/[email protected]

Closes: containers/composefs#144
Reviewed-by: Sandeep Dhavale <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
roxell pushed a commit to roxell/linux that referenced this issue Sep 6, 2024
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.

Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices.  However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed.  There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.

On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
 - Fscache infrastructure has recently been moved into new Netfslib
   which is an unexpected dependency to EROFS really, although it
   originally claims "it could be used for caching other things such as
   ISO9660 filesystems too." [3]

 - It takes an unexpectedly long time to upstream Fscache/Cachefiles
   enhancements.  For example, the failover feature took more than
   one year, and the deamonless feature is still far behind now;

 - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
   perfectly supersede "erofs over fscache" in a simpler way since
   developers (mainly containerd folks) could leverage their existing
   caching mechanism entirely in userspace instead of strictly following
   the predefined in-kernel caching tree hierarchy.

After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)

[1] containers/storage#2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/[email protected]

Closes: containers/composefs#144
Reviewed-by: Sandeep Dhavale <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
ToolmanP pushed a commit to ToolmanP/erofs-rs-linux that referenced this issue Sep 10, 2024
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.

Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices.  However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed.  There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.

On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
 - Fscache infrastructure has recently been moved into new Netfslib
   which is an unexpected dependency to EROFS really, although it
   originally claims "it could be used for caching other things such as
   ISO9660 filesystems too." [3]

 - It takes an unexpectedly long time to upstream Fscache/Cachefiles
   enhancements.  For example, the failover feature took more than
   one year, and the deamonless feature is still far behind now;

 - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
   perfectly supersede "erofs over fscache" in a simpler way since
   developers (mainly containerd folks) could leverage their existing
   caching mechanism entirely in userspace instead of strictly following
   the predefined in-kernel caching tree hierarchy.

After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)

[1] containers/storage#2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/[email protected]

Closes: containers/composefs#144
Reviewed-by: Sandeep Dhavale <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
@travier
Copy link
Member

travier commented Oct 10, 2024

@allisonkarlitskaya
Copy link
Collaborator

@alexlarsson
Copy link
Collaborator

So, this will be in 6.12 then.

@cgwalters
Copy link
Contributor Author

Got this quote from someone doing perf testing of backing containers w/composefs at scale:

systemd-udevd.service memory shows that there are 2 orders of magnitude larger spikes in memory during load, and a baseline that is 6x higher.

I need to dig in here a bit...one thing I notice is we're not following the blockdev locking rules and it may help here to ensure our loopback device is locked so udev doesn't scan it at least?

@cgwalters
Copy link
Contributor Author

The file-backed mount support is great, though given we want composefs to be deployable even to older OS/distros it's probably worth a brief investigation of if we can mitigate some of the overhead of loopback.

One specific reporter dug in a bit and said udev launching "nfsrahead" is a cause of overhead:

$ grep -r nfsrahead /usr/lib/udev
/usr/lib/udev/rules.d/99-nfs.rules:SUBSYSTEM=="bdi", ACTION=="add", PROGRAM="/usr/libexec/nfsrahead %k", ATTR{read_ahead_kb}="%c"

which... (source code here) looks like it's scanning every block device to see if it's mounted somewhere via nfs? (wait what?????)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants