-
Notifications
You must be signed in to change notification settings - Fork 2k
zfsvfs_hold() leaks VFS s_active reference when z_unmounted is B_TRUE, causing permanent EBUSY on pool export #18309
Description
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Ubuntu |
| Distribution Version | 24.04.4 LTS (Noble Numbat) |
| Kernel Version | 6.17.0-14-generic (HWE) |
| Architecture | x86_64 |
| OpenZFS Version | 2.3.4-1ubuntu2 (kernel module), 2.2.2-0ubuntu9.4 (userspace) |
Describe the problem you're observing
zpool export fails permanently with EBUSY after ZFS ioctls race with dataset unmount operations. No userspace process, mount, file descriptor, or kernel keyring holds the pool open. The only recovery is a full system reboot.
Root cause: zfsvfs_hold() in module/zfs/zfs_ioctl.c (line ~1435) leaks a VFS s_active reference. When getzfsvfs() succeeds (incrementing s_active via zfs_vfs_ref() -> atomic_inc_not_zero), but z_unmounted is subsequently found to be B_TRUE, the function returns EBUSY without calling zfs_vfs_rele(). The companion function zfsvfs_rele() (immediately below in the same file) correctly releases both the teardown lock AND the VFS reference -- the error path only releases the teardown lock.
The leaked s_active permanently prevents generic_shutdown_super() -> zpl_kill_sb() -> dmu_objset_disown() -> spa_close(), keeping spa_refcount above spa_minref.
Reachability: z_unmounted is set to B_TRUE at three sites in module/os/linux/zfs/zfs_vfsops.c (lines ~1410, ~1920, ~1967), in functions zfsvfs_teardown(), zfs_resume_fs(), and zfs_end_fs(). All three execute from ZFS ioctl handlers that hold an active VFS reference via getzfsvfs(), guaranteeing s_active > 0 when the flag is set. Critically, zfs_resume_fs() releases the teardown write lock at its bail: label BEFORE dmu_objset_disown() runs, creating a window where a concurrent ioctl can acquire the read lock, observe z_unmounted == B_TRUE, and leak s_active.
Evidence from live system:
- Custom kernel module read
s_active = 2on the ZFS superblock with zero mounts across all 14 mount namespaces deactivate_super()fires during export attempt;generic_shutdown_super()NEVER fires/proc/spl/kstat/zfs/rpool/objset-0x105persists (dataset alive in kernel despite zero mounts)- Bug state persisted 20+ hours without self-resolving
- 16 diagnostic commands confirmed zero userspace holders (lsof, fuser, mountinfo, keyrings, systemd units, ARC flush, drop_caches -- all negative)
Proposed fix (PR submitted): Add the missing zfs_vfs_rele() call, guarded by zfs_vfs_held() to handle the zfsvfs_create() fallback path. This matches the existing pattern in zfsvfs_rele().
Describe how to reproduce the problem
The bug requires two concurrent ZFS ioctls racing with dataset unmount:
- IOCTL-A (e.g.,
zfs_ioc_recv) acquiress_activeviagetzfsvfs(), then internally callszfs_resume_fs()which fails, settingz_unmounted = B_TRUEand releasing the teardown write lock - IOCTL-B (e.g.,
zfs_ioc_objset_statsviazfs get) callszfsvfs_hold()->getzfsvfs()during the window after the write lock release but beforedmu_objset_disown()clearsos_user_ptr - IOCTL-B observes
z_unmounted == B_TRUE, returns EBUSY withoutzfs_vfs_rele() zpool exportfails permanently with EBUSY
Trigger scenario: ZFS-on-root deployment via debootstrap + chroot with mount-over operations, bind mounts, and teardown running zpool set cachefile=none and zpool export -f while lazy unmount is in progress.
Reproducer script: A multi-threaded C program spawning concurrent mount/unmount/ioctl threads is attached. The race window is narrow (nanoseconds); reproduction is intermittent. Live system evidence (s_active=2 persisting 20+ hours) serves as primary proof.
Include any warning/errors/backtraces from the system logs
# zpool export attempt
$ sudo zpool export -f rpool
cannot export 'rpool': pool is busy
# Kernel module proof of leaked s_active
$ sudo insmod /tmp/sb_probe/sb_probe.ko && sudo dmesg | grep sb_probe && sudo rmmod sb_probe
[75229.126131] sb_probe: ZFS sb=ffff8e904d47c000 s_active=2 s_id="zfs" s_flags=0x60010000
# Zero mounts across all namespaces
$ cat /proc/*/mountinfo 2>/dev/null | grep rpool | wc -l
0
# Mount namespace count
$ ls /proc/*/ns/mnt 2>/dev/null | wc -l
14
# Dataset alive in kernel despite zero mounts
$ cat /proc/spl/kstat/zfs/rpool/objset-0x105 | head -3
197 1 0x01 27 7600 123374565221 13205739627447
name type data
dataset_name 7 rpool/ROOT/ubuntu
# ZFS module refcount stuck at 1
$ lsmod | grep zfs
zfs 6823936 1
# spl_delay_taskq counter climbing (12 dispatches/min, all cancelled)
$ awk '/tasks_dispatched_delayed/ {print $3}' /proc/spl/kstat/taskq/spl_delay_taskq.0
2950
# 60 seconds later:
$ awk '/tasks_dispatched_delayed/ {print $3}' /proc/spl/kstat/taskq/spl_delay_taskq.0
2962
# bpftrace: deactivate_super fires, generic_shutdown_super never fires during export
# (kprobe_events missing on this kernel config, attach-only trace)
# Result: deactivate_super called 15+ times during export, generic_shutdown_super: 0 times
# Exhaustive elimination of userspace holders (all empty/negative):
$ zfs mount # (empty)
$ lsof /dev/zfs # (empty)
$ fuser -v /dev/nvme0n1p4 # (empty)
$ ps aux | grep "[z]fs\|[z]ed" # (empty)
$ cat /proc/keys | grep zfs # (empty)
$ systemctl list-units --type=mount | grep zfs # (empty)
$ find /proc/*/fd -lname "*rpool*" 2>/dev/null # (empty)
$ ls -la /proc/*/root 2>/dev/null | grep optane # (empty)
$ echo 3 | sudo tee /proc/sys/vm/drop_caches # no effect on s_active
$ zpool get freeing,leaked rpool # 0, 0
$ zfs list -t snapshot -r rpool # (empty)