Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

man 5 btrfs - compress-force warnings missing #960

Open
auroraanon38 opened this issue Feb 17, 2025 · 4 comments
Open

man 5 btrfs - compress-force warnings missing #960

auroraanon38 opened this issue Feb 17, 2025 · 4 comments
Labels
docs Changes in documentation or help text

Comments

@auroraanon38
Copy link

According to users on #btrfs, compress-force has a side effect which limits uncompressed extent size to 512KB instead of 128MB, leading to gratuitous fragmentation for incompressible data.

This should be documented in the manpage for mount options (usually man 5 btrfs) under compress, compress=<type[:level]>, compress-force, compress-force=<type[:level]>.

Fragmentation is a far more serious issue in many hardware configurations than a small amount of additional compute load.

@adam900710
Copy link
Collaborator

The limit is 128K, not 512K, and the limit comes from compression itself.

The fragmentation of btrfs is not really related to the extent size, this is a little anti-instinct .
The biggest source of fragmentation is in fact the data COW itself, a lot of small random writes will cause a lot of small extents, and no sane extent size limit can avoid that.

On the other hand, the smaller extent size is, the less bookend problem is (the extra space that is no longer referred, but can only be release when the whole original extent is released).

It's not an easy task to educate all end users about those, and it will be more convincing if you can provide a real world scenario that smaller extent sizes are really causing problems.

@Zygo
Copy link

Zygo commented Feb 17, 2025

The 512K size limit applies to uncompressed extents when the compress-force mount option is used and compression fails. This arises from this code in fs/btrfs/inode.c:

$ git grep -p -C9 SZ_512K fs/btrfs
fs/btrfs/inode.c=static bool run_delalloc_compressed(struct btrfs_inode *inode,
fs/btrfs/inode.c-                                   struct folio *locked_folio, u64 start,
fs/btrfs/inode.c-                                   u64 end, struct writeback_control *wbc)
fs/btrfs/inode.c-{
[...]
fs/btrfs/inode.c:       u64 num_chunks = DIV_ROUND_UP(end - start, SZ_512K);
[...]
fs/btrfs/inode.c-       for (i = 0; i < num_chunks; i++) {
fs/btrfs/inode.c:               u64 cur_end = min(end, start + SZ_512K - 1);
[...]

The effect is trivial to reproduce:

# mount -oremount,compress-force=zstd /media/testfs
# for x in $(seq 0 10); do cat /boot/vmlinuz; done > /media/testfs/forced
# btrfs-search-metadata file /media/testfs/forced | head
inode objectid 4030 generation 1603 transid 1603 size 0 nbytes 0 block_group 0 mode 0100644 nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none)
inode ref list objectid 4030 parent_objectid 256 size 1
  inode ref index 6502 name utf-8 'forced'
extent data at 0 generation 1603 ram_bytes 131072 compression zlib type regular disk_bytenr 178401525760 disk_num_bytes 126976 offset 0 num_bytes 131072
extent data at 131072 generation 1603 ram_bytes 393216 compression none type regular disk_bytenr 179480731648 disk_num_bytes 393216 offset 0 num_bytes 393216
extent data at 524288 generation 1603 ram_bytes 524288 compression none type regular disk_bytenr 179481124864 disk_num_bytes 524288 offset 0 num_bytes 524288
extent data at 1048576 generation 1603 ram_bytes 524288 compression none type regular disk_bytenr 179481649152 disk_num_bytes 524288 offset 0 num_bytes 524288
extent data at 1572864 generation 1603 ram_bytes 524288 compression none type regular disk_bytenr 179482173440 disk_num_bytes 524288 offset 0 num_bytes 524288
extent data at 2097152 generation 1603 ram_bytes 524288 compression none type regular disk_bytenr 179482697728 disk_num_bytes 524288 offset 0 num_bytes 524288

Changing SZ_512K to SZ_2M in the kernel increases the extent size:

# mount -oremount,compress-force=zstd /media/testfs
# for x in $(seq 0 10); do cat /boot/vmlinuz; done > /media/testfs/forced
# btrfs-search-metadata file /media/testfs/forced | head
inode objectid 528 generation 33758 transid 33758 size 0 nbytes 0 block_group 0 mode 0100644 nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none)
inode ref list objectid 528 parent_objectid 256 size 1
  inode ref index 66 name utf-8 'forced'
extent data at 0 generation 33758 ram_bytes 131072 compression zlib type regular disk_bytenr 891896221696 disk_num_bytes 126976 offset 0 num_bytes 131072
extent data at 131072 generation 33758 ram_bytes 1966080 compression none type regular disk_bytenr 902082068480 disk_num_bytes 1966080 offset 0 num_bytes 1966080
extent data at 2097152 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902020726784 disk_num_bytes 2097152 offset 0 num_bytes 2097152
extent data at 4194304 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902084034560 disk_num_bytes 2097152 offset 0 num_bytes 2097152
extent data at 6291456 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902086131712 disk_num_bytes 2097152 offset 0 num_bytes 2097152
extent data at 8388608 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902100811776 disk_num_bytes 2097152 offset 0 num_bytes 2097152
extent data at 10485760 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902167789568 disk_num_bytes 2097152 offset 0 num_bytes 2097152
extent data at 12582912 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902193479680 disk_num_bytes 2097152 offset 0 num_bytes 2097152
extent data at 14680064 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902244073472 disk_num_bytes 2097152 offset 0 num_bytes 2097152
extent data at 16777216 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902356271104 disk_num_bytes 2097152 offset 0 num_bytes 2097152
extent data at 18874368 generation 33758 ram_bytes 2097152 compression none type regular disk_bytenr 902377897984 disk_num_bytes 2097152 offset 0 num_bytes 2097152

As you can see, whatever size is used in that function becomes the upper bound on uncompressed extent size.

This problem is specific to compress-force because of another piece of kernel code:

/*
 * Check if the inode needs to be submitted to compression, based on mount
 * options, defragmentation, properties or heuristics.
 */
static inline int inode_need_compress(struct btrfs_inode *inode, u64 start,
                                      u64 end)
{
[...]
        /* force compress */
        if (btrfs_test_opt(fs_info, FORCE_COMPRESS))
                return 1;
        /* defrag ioctl */
        if (inode->defrag_compress)
                return 1;
        /* bad compression ratios */
        if (inode->flags & BTRFS_INODE_NOCOMPRESS)
                return 0;
        if (btrfs_test_opt(fs_info, COMPRESS) ||
            inode->flags & BTRFS_INODE_COMPRESS ||
            inode->prop_compress)
                return btrfs_compress_heuristic(inode, start, end);
        return 0;

The kernel checks for the compress-force mount option first (FORCE_COMPRESS) and runs compression, regardless of whether the compression succeeds or not. This leads to the 512K extent sizes, because compression is broken up into 512K batches for some reason. I don't know what the reason is. Presumably it has something to do with distributing compression across multiple CPU cores. The 512K size was set in 2008, and seems like a good number based on the performance of that era's CPU core counts, but it should be revisited for modern CPUs where even low-end parts have dozens of cores.

For my experiment above I simply changed the size to SZ_2M. That works, in the sense that the code still functions and the filesystem doesn't explode, but I have no idea whether it is good or bad for performance. In theory, if we set this to SZ_128M, it could fix the extent size limit of compress-force.

If compress-force is not used, the code falls through to check BTRFS_INODE_NOCOMPRESS (aka chattr +m) which is set when the compression fails (that part isn't shown above). So if there's a write to a file that doesn't compress, and compress-force isn't used, then btrfs will disable any further compression in the file, and the limit on extent size goes back to 128M because we're no longer running it through run_delalloc_compressed.

If neither compress-force nor BTRFS_INODE_NOCOMPRESS, then btrfs runs a heuristic to see if the compression is likely to work, and if the heuristic says no, the extent isn't compressed; otherwise, compression is attempted. If you write some data that seems compressible, but actually isn't, then BTRFS_INODE_NOCOMPRESS gets set and the extent size goes back to 128M.

If we get all the way to the end of the function, and it returns 1, and compression succeeds, then the (logical) extent size is limited to 128K because that's the size limit for compressed extents.

There's another effect of compress-force here: it can lead to some smaller-than-512K uncompressed extents in the file if they are in the same 512K region as a compressed extent. You can see this effect in the first btrfs-search-metadata output above, where there's a compressed 128K extent followed by 384K of uncompressed extent, because run_delalloc_compressed broke the write into 512K batches and then failed to compress 3 out of 4 128K regions in that first batch. In the second run with a 2M batch size, there's a 128K compressed extent followed by 2M-128K of uncompressed extent, then the 2M batch size takes over for the following extents.

it will be more convincing if you can provide a real world scenario that smaller extent sizes are really causing problems.

Easy:

  1. Create a 20-100 TiB btrfs filesystem
  2. mount -ocompress-force=zstd that filesystem
  3. Copy data onto it
  4. Run a balance, or try to read the data, and remark how amazingly slow it is, and how large the metadata is, compared to a filesystem where -ocompress=zstd (without -force) was used in step 2.

I say it's an easy test case because I have already run into this several times accidentally, before getting the hint and removing -force from the mount options.

On the other hand, the smaller extent size is, the less bookend problem is (the extra space that is no longer referred, but can only be release when the whole original extent is released).

Not really relevant for this kind of use case. If a user is writing files with 128M extents, and they want to read them quickly, it's very common that they're not overwriting the files, so they won't encounter the bookending problem.

Bookending usually only comes up when people use prealloc or indiscriminate defrag. This issue is about getting sensible extent sizes from sequential writes.

@Forza-tng
Copy link
Contributor

In theory, if we set this to SZ_128M, it could fix the extent size limit of compress-force.

But only for uncompressed extents during compress-force?

Would it be possible to increase compressed extent sizes too, or 128k assumed in too many places? Bigger extents would probably make more sense now when media is so much faster. At 1GiB/s, the latency per KiB is only 1us, or 1ms per MiB.

@Zygo
Copy link

Zygo commented Feb 28, 2025

But only for uncompressed extents during compress-force?

For compress without -force, the 512K issue doesn't seem to occur--extents are 128M long even in files with some compression. That's an anomaly--based on what I know so far, I would expect a high rate of 512K-alignment in extent refs in files with a mix of compressible and incompressible data, but so far I haven't observed that.

Would it be possible to increase compressed extent sizes too, or 128k assumed in too many places?

Check fs/btrfs/compression.h where there is a rationale for 128K in the comments. Note that rationale was written in 2008 (in inode.c), when CPU and IO subsystems had different capabilities, but also note that the world today is full of embedded ARM devices using SD/MMC storage, which match the performance of circa-2008 hardware.

I think it's mostly a matter of:

  1. change #define BTRFS_MAX_COMPRESSED SZ_128K (and MAX_UNCOMPRESSED) in the kernel
  2. update a few external tools that need to know the size (btrfs check, maybe btrfs fi defrag's default extent sizes?)
  3. make sure the compression heuristics still work with the new size (e.g. no integer overflows)

...but I haven't tested it.

There's a lot of drama around that change: there would need to be an incompat flag, and maybe use up some bytes in the superblock or a new tree item to record what the size is if it becomes configurable. Performance for some workloads would improve, while other workloads would suffer.

The filesystem would be unmountable on small-memory machines--although the criteria to be considered "small" probably also excludes using the higher supported zstd levels or running btrfs balance, so that might be a moot point. grub is the ultimate in small-memory btrfs machines, but there's an easy workaround for grub: don't compress the kernel or ramdisk files.

To pay for the drama, there would have to be a large, provable benefit for making the change. On low-latency NVMe devices and modern CPUs, the metadata processing gains from reducing the number of compressed extents for large files may be negligible. On the other hand, the cost of unnecessary decompression for seeky workloads might also be negligible.

This is all easy for someone with some spare time to test, and hopefully post the results.

@kdave kdave added the docs Changes in documentation or help text label Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Changes in documentation or help text
Projects
None yet
Development

No branches or pull requests

5 participants