Redfs support for rhel kernel-5.14.0-503.35.1.el9_5 #3

openunix · 2025-04-27T06:35:48Z

This is the cherry-picks based on branch redfs-ubuntu-noble-6.8.0-58.60-updates@2a889c7f6036。

Use invalidate_lock instead of fuse's private i_mmap_sem. The intended purpose is exactly the same. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: Miklos Szeredi <[email protected]> Reviewed-by: Miklos Szeredi <[email protected]> Signed-off-by: Jan Kara <[email protected]> (cherry picked from commit 8bcbbe9)

There is a potential race between fuse_read_interrupt() and fuse_request_end(). TASK1 in fuse_read_interrupt(): delete req->intr_entry (while holding fiq->lock) TASK2 in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock wake up TASK3 TASK3 request is freed TASK1 in fuse_read_interrupt(): dereference req->in.h.unique ***BAM*** Fix by always grabbing fiq->lock if the request was ever interrupted (FR_INTERRUPTED set) thereby serializing with concurrent fuse_read_interrupt() calls. FR_INTERRUPTED is set before the request is queued on fiq->interrupts. Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not cleared in this case. Reported-by: lijiazi <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit e1e71c1)

Callers of fuse_writeback_range() assume that the file is ready for modification by the server in the supplied byte range after the call returns. If there's a write that extends the file beyond the end of the supplied range, then the file needs to be extended to at least the end of the range, but currently that's not done. There are at least two cases where this can cause problems: - copy_file_range() will return short count if the file is not extended up to end of the source range. - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file, hence the region may not be fully allocated. Fix by flushing writes from the start of the range up to the end of the file. This could be optimized if the writes are non-extending, etc, but it's probably not worth the trouble. Fixes: a2bc923 ("fuse: fix copy_file_range() in the writeback case") Fixes: 6b1bdb5 ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)") Cc: <[email protected]> # v5.2 Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 59bda8e)

The struct fuse_conn argument is not used and can be removed. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit a9667ac)

In writeback cache mode mtime/ctime updates are cached, and flushed to the server using the ->write_inode() callback. Closing the file will result in a dirty inode being immediately written, but in other cases the inode can remain dirty after all references are dropped. This result in the inode being written back from reclaim, which can deadlock on a regular allocation while the request is being served. The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because serving a request involves unrelated userspace process(es). Instead do the same as for dirty pages: make sure the inode is written before the last reference is gone. - fallocate(2)/copy_file_range(2): these call file_update_time() or file_modified(), so flush the inode before returning from the call - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so flush the ctime directly from this helper Reported-by: chenguanyou <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 5c791fe)

Fuse ->release() is otherwise asynchronous for the reason that it can happen in contexts unrelated to close/munmap. Inode is already written back from fuse_flush(). Add it to fuse_vma_close() as well to make sure inode dirtying from mmaps also get written out before the file is released. Also add error handling. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 36ea233)

Add missing inode lock annotatation; found by syzbot. Reported-and-tested-by: [email protected] Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit bda9a71)

Due to the introduction of kmap_local_*, the storage of slots used for short-term mapping has changed from per-CPU to per-thread. kmap_atomic() disable preemption, while kmap_local_*() only disable migration. There is no need to disable preemption in several kamp_atomic places used in fuse. Link: https://lwn.net/Articles/836144/ Signed-off-by: Peng Hao <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 5fe0fc9)

'ia->io=io' has been set in fuse_io_alloc. Signed-off-by: Peng Hao <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit b5d9758)

Logically it belongs there since attributes are invalidated due to the updated ctime. This is a cleanup and should not change behavior. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 371e8fd)

Use list_first_entry_or_null() instead of list_empty() + list_entry(). Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 84840ef)

Rename didn't decrement/clear nlink on overwritten target inode. Create a common helper fuse_entry_unlinked() that handles this for unlink, rmdir and rename. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit cefd1b8)

The fuse_iget() call in create_new_entry() already updated the inode with all the new attributes and incremented the attribute version. Incrementing the nlink will result in the wrong count. This wasn't noticed because the attributes were invalidated right after this. Updating ctime is still needed for the writeback case when the ctime is not refreshed. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 97f044f)

Only invalidate attributes that the operation might have changed. Introduce two constants for common combinations of changed attributes: FUSE_STATX_MODIFY: file contents are modified but not size FUSE_STATX_MODSIZE: size and/or file contents modified Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit fa5eee5)

The attribute version in fuse_inode should be updated whenever the attributes might have changed on the server. In case of cached writes this is not the case, so updating the attr_version is unnecessary and could possibly affect performance. Open code the remaining part of fuse_write_update_size(). Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 8c56e03)

This function already updates the attr_version in fuse_inode, regardless of whether the size was changed or not. Rename the helper to fuse_write_update_attr() to reflect the more generic nature. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 27ae449)

Extend the fuse_write_update_attr() helper to invalidate cached attributes after a write. This has already been done in all cases except in fuse_notify_store(), so this is mostly a cleanup. fuse_direct_write_iter() calls fuse_direct_IO() which already calls fuse_write_update_attr(), so don't repeat that again in the former. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit d347739)

A READ request returning a short count is taken as indication of EOF, and the cached file size is modified accordingly. Fix the attribute version checking to allow for changes to fc->attr_version on other inodes. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 484ce65)

It's safe to call file_update_time() if writeback cache is not enabled, since S_NOCMTIME is set in this case. This part is purely a cleanup. __fuse_copy_file_range() also calls fuse_write_update_attr() only in the writeback cache case. This is inconsistent with other callers, where it's called unconditionally. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 20235b4)

There are two instances of "bool is_wb = fc->writeback_cache" where the actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)". Clean up these cases by storing the second condition in the local variable. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit c15016b)

In case of writeback_cache fuse_fillattr() would revert the queried attributes to the cached version. Move this to fuse_change_attributes() in order to manage the writeback logic in a central helper. This will be necessary for patches that follow. Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling fuse_change_attributes(), so this should not change behavior. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 04d82db)

If writeback_cache is enabled, then the size, mtime and ctime attributes of regular files are always valid in the kernel's cache. They are retrieved from userspace only when the inode is freshly looked up. Add a more generic "cache_mask", that indicates which attributes are currently valid in cache. This patch doesn't change behavior. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 4b52f05)

When deciding to send a GETATTR request take into account the cache mask (which attributes are always valid). The cache mask takes precedence over the invalid mask. This results in the GETATTR request not being sent unnecessarily. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit ec85537)

fuse_update_attributes() refreshes metadata for internal use. Each use needs a particular set of attributes to be refreshed, but currently that cannot be expressed and all but atime are refreshed. Add a mask argument, which lets fuse_update_get_attr() to decide based on the cache_mask and the inval_mask whether a GETATTR call is needed or not. Reported-by: Yongji Xie <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit c6c745b)

It is possible to trigger a crash by splicing anon pipe bufs to the fuse device. The reason for this is that anon_pipe_buf_release() will reuse buf->page if the refcount is 1, but that page might have already been stolen and its flags modified (e.g. PG_lru added). This happens in the unlikely case of fuse_dev_splice_write() getting around to calling pipe_buf_release() after a page has been stolen, added to the page cache and removed from the page cache. Fix by calling pipe_buf_release() right after the page was inserted into the page cache. In this case the page has an elevated refcount so any release function will know that the page isn't reusable. Reported-by: Frank Dinoff <[email protected]> Link: https://lore.kernel.org/r/CAAmZXrsGg2xsP1CK+cbuEMumtrqdvD-NKnWzhNcvn71RV3c1yw@mail.gmail.com/ Fixes: dd3bb14 ("fuse: support splice() writing to fuse device") Cc: <[email protected]> # v2.6.35 Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 712a951)

Checking buf->flags should be done before the pipe_buf_release() is called on the pipe buffer, since releasing the buffer might modify the flags. This is exactly what page_cache_pipe_buf_release() does, and which results in the same VM_BUG_ON_PAGE(PageLRU(page)) that the original patch was trying to fix. Reported-by: Justin Forbes <[email protected]> Fixes: 712a951 ("fuse: fix page stealing") Cc: <[email protected]> # v2.6.35 Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 4734417)

The acceptable maximum value of lend parameter in filemap_write_and_wait_range() is LLONG_MAX rather than -1. And there is also some logic depending on LLONG_MAX check in write_cache_pages(). So let's pass LLONG_MAX to filemap_write_and_wait_range() in fuse_writeback_range() instead. Fixes: 59bda8e ("fuse: flush extending writes") Signed-off-by: Xie Yongji <[email protected]> Cc: <[email protected]> # v5.15 Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit e388164)

This is in prep for following per inode DAX checking. Signed-off-by: Jeffle Xu <[email protected]> Reviewed-by: Vivek Goyal <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit cecd491)

We add 'always', 'never', and 'inode' (default). '-o dax' continues to operate the same which is equivalent to 'always'. The following behavior is consistent with that on ext4/xfs: - The default behavior (when neither '-o dax' nor '-o dax=always|never|inode' option is specified) is equal to 'inode' mode, while 'dax=inode' won't be printed among the mount option list. - The 'inode' mode is only advisory. It will silently fallback to 'never' mode if fuse server doesn't support that. Also noted that by the time of this commit, 'inode' mode is actually equal to 'always' mode, before the per inode DAX flag is introduced in the following patch. Signed-off-by: Jeffle Xu <[email protected]> Reviewed-by: Vivek Goyal <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 780b1b9)

Expand the fuse protocol to support per inode DAX. FUSE_HAS_INODE_DAX flag is added indicating if fuse server/client supporting per inode DAX. It can be conveyed in both FUSE_INIT request and reply. FUSE_ATTR_DAX flag is added indicating if DAX shall be enabled for corresponding file. It is conveyed in FUSE_LOOKUP reply. Signed-off-by: Jeffle Xu <[email protected]> Reviewed-by: Vivek Goyal <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 98046f7)

When mounting a user-space filesystem using io_uring, the initialization of the rings is done separately in the server side. If for some reason (e.g. a server bug) this step is not performed it will be impossible to unmount the filesystem if there are already requests waiting. This issue is easily reproduced with the libfuse passthrough_ll example, if the queue depth is set to '0' and a request is queued before trying to unmount the filesystem. When trying to force the unmount, fuse_abort_conn() will try to wake up all tasks waiting in fc->blocked_waitq, but because the rings were never initialized, fuse_uring_ready() will never return 'true'. Fixes: 3393ff9 ("fuse: block request allocation until io-uring init is complete") Signed-off-by: Luis Henriques <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: Miklos Szeredi <[email protected]> Reviewed-by: Bernd Schubert <[email protected]> Signed-off-by: Christian Brauner <[email protected]> (cherry picked from commit d550114) (cherry picked from commit 3a7a4e7)

task-A (application) might be in request_wait_answer and try to remove the request when it has FR_PENDING set. task-B (a fuse-server io-uring task) might handle this request with FUSE_IO_URING_CMD_COMMIT_AND_FETCH, when fetching the next request and accessed the req from the pending list in fuse_uring_ent_assign_req(). That code path was not protected by fiq->lock and so might race with task-A. For scaling reasons we better don't use fiq->lock, but add a handler to remove canceled requests from the queue. This also removes usage of fiq->lock from fuse_uring_add_req_to_ring_ent() altogether, as it was there just to protect against this race and incomplete. Also added is a comment why FR_PENDING is not cleared. Fixes: c090c8a ("fuse: Add io-uring sqe commit and fetch support") Cc: <[email protected]> # v6.14 Reported-by: Joanne Koong <[email protected]> Closes: https://lore.kernel.org/all/CAJnrk1ZgHNb78dz-yfNTpxmW7wtT88A=m-zF0ZoLXKLUHRjNTw@mail.gmail.com/ Signed-off-by: Bernd Schubert <[email protected]> Reviewed-by: Joanne Koong <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 09098e6) (cherry picked from commit bbe1041)

Function fuse_uring_create() is used only from dev_uring.c and does not need to be exposed in the header file. Furthermore, it has the wrong signature. While there, also remove the 'struct fuse_ring' forward declaration. Signed-off-by: Luis Henriques <[email protected]> Reviewed-by: Bernd Schubert <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 841c7b8) (cherry picked from commit 54d9a86)

fuse_notify_inval_entry and fuse_notify_delete were using fixed allocations of FUSE_NAME_MAX to hold the file name. Often that large buffers are not needed as file names might be smaller, so this uses the actual file name size to do the allocation. Signed-off-by: Bernd Schubert <[email protected]> Reviewed-by: Jingbo Xu <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 2412085) (cherry picked from commit 65e7afa)

...when calling fuse_iget(). Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 253e524) (cherry picked from commit 4f1ba11)

Function fuse_direntplus_link() might call fuse_iget() to initialize a new fuse_inode and change its attributes. If fi->attr_version is always initialized with 0, even if the attributes returned by the FUSE_READDIR request is staled, as the new fi->attr_version is 0, fuse_change_attributes will still set the staled attributes to inode. This wrong behaviour may cause file size inconsistency even when there is no changes from server-side. To reproduce the issue, consider the following 2 programs (A and B) are running concurrently, A B ---------------------------------- -------------------------------- { /fusemnt/dir/f is a file path in a fuse mount, the size of f is 0. } readdir(/fusemnt/dir) start //Daemon set size 0 to f direntry fallocate(f, 1024) stat(f) // B see size 1024 echo 2 > /proc/sys/vm/drop_caches readdir(/fusemnt/dir) reply to kernel Kernel set 0 to the I_NEW inode stat(f) // B see size 0 In the above case, only program B is modifying the file size, however, B observes file size changing between the 2 'readonly' stat() calls. To fix this issue, we should make sure readdirplus still follows the rule of attr_version staleness checking even if the fi->attr_version is lost due to inode eviction. To identify this situation, the new fc->evict_ctr is used to record whether the eviction of inodes occurs during the readdirplus request processing. If it does, the result of readdirplus may be inaccurate; otherwise, the result of readdirplus can be trusted. Although this may still lead to incorrect invalidation, considering the relatively low frequency of evict occurrences, it should be acceptable. Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/ Reported-by: Jiachen Zhang <[email protected]> Suggested-by: Miklos Szeredi <[email protected]> Signed-off-by: Zhang Tianci <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit fabde09)

A readdir buffer of 4K might be just enough to read a single file name at a time - increase the buffer size to the max_pages. Reviewed-by: Bernd Schubert <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit d77db6e)

This reverts commit 6996dac. (cherry picked from commit 1b4cae7)

…MAX_PAGES) Introduce the capability to dynamically configure the max pages limit (FUSE_MAX_MAX_PAGES) through a sysctl. This allows system administrators to dynamically set the maximum number of pages that can be used for servicing requests in fuse. Previously, this is gated by FUSE_MAX_MAX_PAGES which is statically set to 256 pages. One result of this is that the buffer size for a write request is limited to 1 MiB on a 4k-page system. The default value for this sysctl is the original limit (256 pages). $ sysctl -a | grep max_pages_limit fs.fuse.max_pages_limit = 256 $ sysctl -n fs.fuse.max_pages_limit 256 $ echo 1024 | sudo tee /proc/sys/fs/fuse/max_pages_limit 1024 $ sysctl -n fs.fuse.max_pages_limit 1024 $ echo 65536 | sudo tee /proc/sys/fs/fuse/max_pages_limit tee: /proc/sys/fs/fuse/max_pages_limit: Invalid argument $ echo 0 | sudo tee /proc/sys/fs/fuse/max_pages_limit tee: /proc/sys/fs/fuse/max_pages_limit: Invalid argument $ echo 65535 | sudo tee /proc/sys/fs/fuse/max_pages_limit 65535 $ sysctl -n fs.fuse.max_pages_limit 65535 Signed-off-by: Joanne Koong <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 2b3933b) (cherry picked from commit b023958)

No need to take lock, we can have that in atomic way. fuse-io-uring and virtiofs especially benefit from it as they don't need the fiq lock at all. Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 47b2694)

This is especially needed for better ftrace analysis, for example to build histograms. So far the request unique was missing, because it was added after the first trace message. IDs/req-unique now might not come up perfectly sequentially anymore, but especially with cloned device or io-uring this did not work perfectly anyway. Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 4415892)

I've been timing various fuse operations and it's quite annoying to do with kprobes. Add two tracepoints for sending and ending fuse requests to make it easier to debug and time various operations. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: Bernd Schubert <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 396b209) (cherry picked from commit 6e77e0e)

fuse_uring_send_next_to_ring() can just call into fuse_uring_send and avoid code dup. Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 9efaa8d)

Rename trace_fuse_request_send to trace_fuse_request_enqueue Add trace_fuse_request_send Add trace_fuse_request_bg_enqueue Add trace_fuse_request_enqueue This helps to track entire request time and time in different queues. Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 4a7f142)

Our file system has a translation capability for S3-to-posix. The current value of 1kiB is enough to cover S3 keys, but does not allow encoding of %xx escape characters. The limit is increased to (PATH_MAX - 1), as we need 3 x 1024 and that is close to PATH_MAX (4kB) already. -1 is used as the terminating null is not included in the length calculation. Testing large file names was hard with libfuse/example file systems, so I created a new memfs that does not have a 255 file name length limitation. libfuse/libfuse#1077 The connection is initialized with FUSE_NAME_LOW_MAX, which is set to the previous value of FUSE_NAME_MAX of 1024. With FUSE_MIN_READ_BUFFER of 8192 that is enough for two file names + fuse headers. When FUSE_INIT reply sets max_pages to a value > 1 we know that fuse daemon supports request buffers of at least 2 pages (+ header) and can therefore hold 2 x PATH_MAX file names - operations like rename or link that need two file names are no issue then. Signed-off-by: Bernd Schubert <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 27992ef) (cherry picked from commit 573e7ab)

open_by_handle_at(2) can fail with -ESTALE with a valid handle returned by a previous name_to_handle_at(2) for evicted fuse inodes, which is especially common when entry_valid_timeout is 0, e.g. when the fuse daemon is in "cache=none" mode. The time sequence is like: name_to_handle_at(2) # succeed evict fuse inode open_by_handle_at(2) # fail The root cause is that, with 0 entry_valid_timeout, the dput() called in name_to_handle_at(2) will trigger iput -> evict(), which will send FUSE_FORGET to the daemon. The following open_by_handle_at(2) will send a new FUSE_LOOKUP request upon inode cache miss since the previous inode eviction. Then the fuse daemon may fail the FUSE_LOOKUP request with -ENOENT as the cached metadata of the requested inode has already been cleaned up during the previous FUSE_FORGET. The returned -ENOENT is treated as -ESTALE when open_by_handle_at(2) returns. This confuses the application somehow, as open_by_handle_at(2) fails when the previous name_to_handle_at(2) succeeds. The returned errno is also confusing as the requested file is not deleted and already there. It is reasonable to fail name_to_handle_at(2) early in this case, after which the application can fallback to open(2) to access files. Since this issue typically appears when entry_valid_timeout is 0 which is configured by the fuse daemon, the fuse daemon is the right person to explicitly disable the export when required. Also considering FUSE_EXPORT_SUPPORT actually indicates the support for lookups of "." and "..", and there are existing fuse daemons supporting export without FUSE_EXPORT_SUPPORT set, for compatibility, we add a new INIT flag for such purpose. Reviewed-by: Amir Goldstein <[email protected]> Signed-off-by: Jingbo Xu <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit e022f6a) (cherry picked from commit 52da885)

This makes the code a bit easier to read and allows to more easily add more conditions when an exclusive lock is needed. Signed-off-by: Bernd Schubert <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 699cf82) (cherry picked from commit f400249)

fuse_finish_open() is called from fuse_open_common() and from fuse_create_open(). In the latter case, the O_TRUNC flag is always cleared in finish_open()m before calling into fuse_finish_open(). Move the bits that update attribute cache post O_TRUNC open into a helper and call this helper from fuse_open_common() directly. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 0c9d708) (cherry picked from commit 110fb13)

This removed the need to pass isdir argument to fuse_put_file(). Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit e26ee4e) (cherry picked from commit cbabbdd)

fuse_open_common() has a lot of code relevant only for regular files and O_TRUNC in particular. Copy the little bit of remaining code into fuse_dir_open() and stop using this common helper for directory open. Also split out fuse_dir_finish_open() from fuse_finish_open() before we add inode io modes to fuse_finish_open(). Suggested-by: Miklos Szeredi <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 7de64d5) (cherry picked from commit 8dcafbe)

In preparation for inode io modes, a server open response could fail due to conflicting inode io modes. Allow returning an error from fuse_finish_open() and handle the error in the callers. fuse_finish_open() is used as the callback of finish_open(), so that FMODE_OPENED will not be set if fuse_finish_open() fails. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit d2c487f) (cherry picked from commit c1531c7)

The fuse inode io mode is determined by the mode of its open files/mmaps and parallel dio opens and expressed in the value of fi->iocachectr: > 0 - caching io: files open in caching mode or mmap on direct_io file < 0 - parallel dio: direct io mode with parallel dio writes enabled == 0 - direct io: no files open in caching mode and no files mmaped Note that iocachectr value of 0 might become positive or negative, while non-parallel dio is getting processed. direct_io mmap uses page cache, so first mmap will mark the file as ff->io_opened and increment fi->iocachectr to enter the caching io mode. If the server opens the file in caching mode while it is already open for parallel dio or vice versa the open fails. This allows executing parallel dio when inode is not in caching mode and no mmaps have been performed on the inode in question. Signed-off-by: Bernd Schubert <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit cb098dd) (cherry picked from commit 0a64c3c)

So far this is just a helper to remove complex locking logic out of fuse_direct_write_iter. Especially needed by the next patch in the series to that adds the fuse inode cache IO mode and adds in even more locking complexity. Signed-off-by: Bernd Schubert <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 9bbb671) (cherry picked from commit a709f7b)

Instead of denying caching mode on parallel dio open, deny caching open only while parallel dio are in-progress and wait for in-progress parallel dio writes before entering inode caching io mode. This allows executing parallel dio when inode is not in caching mode even if shared mmap is allowed, but no mmaps have been performed on the inode in question. An mmap on direct_io file now waits for all in-progress parallel dio writes to complete, so parallel dio writes together with FUSE_DIRECT_IO_ALLOW_MMAP is enabled by this commit. Signed-off-by: Bernd Schubert <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 205c1d8) (cherry picked from commit afe3358)

In some cases, the fi->writepages may be empty. And there is no need to check fi->writepages with spin_lock, which may have an impact on performance due to lock contention. For example, in scenarios where multiple readers read the same file without any writers, or where the page cache is not enabled. Also remove the outdated comment since commit 6b2fb79 ("fuse: optimize writepages search") has optimize the situation by replacing list with rb-tree. Signed-off-by: yangyun <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit ac5cffe) (cherry picked from commit ba1236c)

This may be a typo. The comment has said shared locks are not allowed when this bit is set. If using shared lock, the wait in `fuse_file_cached_io_open` may be forever. Fixes: 205c1d8 ("fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAP") CC: [email protected] # v6.9 Signed-off-by: yangyun <[email protected]> Reviewed-by: Bernd Schubert <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 2f3d8ff) (cherry picked from commit bbddfd7)

Due to user buffer misalignent we actually need one page more, i.e. 1025 instead of 1024, will be handled differently. For now we just bump up the max. (cherry picked from commit 3f71501)

This is to allow copying into the buffer from the application without the need to copy in ring context (and with that, the need that the ring task is active in kernel space). Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 43d1a63) (cherry picked from commit ea01f94)

If pinned pages are used the application can write into these pages and io_uring_cmd_complete_in_task() is not needed. Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 5f0264c)

Add an smp_rmb() before checking list states in fuse_uring_destruct() to ensure proper ordering between list modifications and emptiness checks. During connection teardown lists are checked without holding a lock, and ithout this barrier, the CPU executing fuse_uring_destruct() might see inconsistent list states, leading to false WARN_ON triggers even though the lists have been properly emptied. The smp_rmb() ensures we see the final consistent state of all lists after teardown operations complete on other CPUs. This fixes occasional false WARN_ON triggers during connection teardown. Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 2a889c7)

jankara and others added 30 commits April 22, 2025 14:55

fuse: remove unused arg in fuse_write_file_get()

ee2b6ee

The struct fuse_conn argument is not used and can be removed. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit a9667ac)

fuse: annotate lock in fuse_reverse_inval_entry()

fd044e5

Add missing inode lock annotatation; found by syzbot. Reported-and-tested-by: [email protected] Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit bda9a71)

fuse: delete redundant code

5f31b61

'ia->io=io' has been set in fuse_io_alloc. Signed-off-by: Peng Hao <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit b5d9758)

fuse: move fuse_invalidate_attr() into fuse_update_ctime()

a5d2cda

Logically it belongs there since attributes are invalidated due to the updated ctime. This is a cleanup and should not change behavior. Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 371e8fd)

fuse: simplify __fuse_write_file_get()

aef200f

Use list_first_entry_or_null() instead of list_empty() + list_entry(). Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 84840ef)

fuse: add fuse_should_enable_dax() helper

30b4920

This is in prep for following per inode DAX checking. Signed-off-by: Jeffle Xu <[email protected]> Reviewed-by: Vivek Goyal <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit cecd491)

luis-henrix and others added 30 commits April 24, 2025 09:25

fuse: use FUSE_ROOT_ID in fuse_get_root_inode()

eae2cd9

...when calling fuse_iget(). Signed-off-by: Miklos Szeredi <[email protected]> (cherry picked from commit 253e524) (cherry picked from commit 4f1ba11)

Revert "virtiofs: use pages instead of pointer for kernel direct IO"

012360e

This reverts commit 6996dac. (cherry picked from commit 1b4cae7)

fuse: {io-uring} Avoid _send code dup

5744e72

fuse_uring_send_next_to_ring() can just call into fuse_uring_send and avoid code dup. Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 9efaa8d)

fuse: Increase the default max pages limit to 8182

80f281f

Due to user buffer misalignent we actually need one page more, i.e. 1025 instead of 1024, will be handled differently. For now we just bump up the max. (cherry picked from commit 3f71501)

fuse: {io-uring] Avoid complete-in-task if pinned pages are used

cee53e8

If pinned pages are used the application can write into these pages and io_uring_cmd_complete_in_task() is not needed. Signed-off-by: Bernd Schubert <[email protected]> (cherry picked from commit 5f0264c)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redfs support for rhel kernel-5.14.0-503.35.1.el9_5 #3

Redfs support for rhel kernel-5.14.0-503.35.1.el9_5 #3

Uh oh!

openunix commented Apr 27, 2025

Uh oh!

Uh oh!

Redfs support for rhel kernel-5.14.0-503.35.1.el9_5 #3

Are you sure you want to change the base?

Redfs support for rhel kernel-5.14.0-503.35.1.el9_5 #3

Uh oh!

Conversation

openunix commented Apr 27, 2025

Uh oh!

Uh oh!