Unsuitable SSD/NVMe hardware for ZFS - WD BLACK SN770 and others #14793

admnd · 2023-04-25T13:20:51Z

admnd
Apr 25, 2023

Originally started as a bug, but after investigations and comments it is definitely more a hardware issue related to ZFS than a ZFS bug so I open a general discussion here, free feel to put constructive observations/ideas/workarounds/suggestions.

TL;DR: Some NVME sticks just crash with ZFS, probably due to the fact they are unable to sustain I/O bursts. It is not clear why this happens, the controller might just crash or a combination of firmware/BIOS/hardware makes it unstable/crash when used in a ZFS pool.

Hardware

OS: Gentoo Linux x86/64 with kernel 6.2.12 and ZFS 2.1.11.
Hardware:
- CPU: AMD Ryzen 7950X
- Motherboard: Asus TUF Gaming X670E-Plus WiFi (upgraded to the latest available BIOS => 1410 as of 05/25/2023)
- 3x NVMe WD Black SN770 2TB with latest firmware as of 05/25/2023 (731100WD) configured with 4K sectors
- PSU: MSI 850W

Issue observed

My system zpool is composed of a single RAID-Z1 VDEV composed of 3x WD Black SN770 2TB them selves configured in 4K logical sectors (I did not test with 512b sectors to see if the issue still happens....yet). The VDEV uses LZ4 compression, is not encrypted neither the underlying modules (they do not support that), standard 128K stripes are used. No L2ARC cache used. System has plenty of free RAM so no RAM underpressure.

Under "normal" daily usage I did not experience anything, the zpool is regularly scrubbed and nothing to report: no checksum error, no frozen tasks, no crash, nothing, the pool completes all scrubbings wonderfully well. The machine also experience no freeze or kernel crashes/"oopses", no stuck tasks (I have had reported an issue with auditd here a couple of weeks ago but this guy is now inactive, see bug #14697). Even "emerging" big stuff like dev-qt/qtwebengine with 32 CMake jobs in parallel or reemerging the whole system from scratch with 32 parallel tasks with heavy packages rebuilt at the same time succeeds. No crashes.

However, if I use zfs send to make a backup of the system datasets on a local TrueNAS box over a 10GbE link this is another story: most of the time one of the NVMe modules randomly crash. The issues also happens at different times in the data transfer: sometimes the issue appears after 12Gb, sometimes after 78Gb, sometimes after 93 Gb and so on. If I am lucky, sometimes it completes the operation successfully (less than a quarter of the time). Itchy and annoying. I have managed also to reproduce it with rsync-ing a dataset on an empty new one in the same pool also this happens more rarely. The TrueNAS box and network are out of concern as they run smoothly and as I can reproduce the issue locally by sending the ZFS stream in /dev/null (zfs send .... | cat > /dev/null).

When the crash happens, the following trace appears in the kernel logs:

[430771.216723] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[430771.216727] nvme nvme2: Does your device have a faulty power saving mode enabled?
[430771.216729] nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[430771.266732] nvme 0000:13:00.0: enabling device (0000 -> 0002)
[430771.266814] nvme nvme2: Disabling device after reset failure: -19
[430771.283392] I/O error, dev nvme2n1, sector 1812765936 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=1 offset=928127770624 size=16384 flags=180880
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=1 offset=1394183585792 size=24576 flags=180880
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=2 offset=1575062740992 size=4096 flags=180880 [430771.283399] nvme2n1: detected capacity change from 3907029168 to 0

At this point, if I am lucky enough, I can manage to bring it back to life using a sledgehammer:

echo 1 > /sys/bus/pci/devices/0000\:12\:00.0/remove
echo 1 > /sys/bus/pci/rescan

If the faulted device reappears the zpool becomes ONLINE again and completes its resilvering (a couple of KB or MB). In the worst case, another one NVMe also drops off the pool which becomes suspended so I have to powercycle the machine or push its reset button. Of course, doing a nvme list at this point either completely freezes either lists the two remaining NVMe modules, depending on what is alive.

My best guess so far is that the Western Digital SN 770 modules controller is not not beefy enough to handle a burst of I/O requests (knowing they have no DRAM cache) so it is put on its knees and become so unresponsive that it is unable to complete a reset request on its own (no AER reported in logs BTW). As not always the same module crashes, they do not seems be all defective or I am extremely unlucky. Pool scrubbing might by a bit lighter for the controller so the scrubs/resilvers work without any issue (maximum observed speed observe is around 4.5~5 GB/s when scrubbing the pool according to zpool status).

What has been tried so far

Several things! Without any improvements unfortunately:

As suggested in the error, put nvme_core.default_ps_max_latency_us=0 pcie_aspm=off on the kernel command-line;
Move the NVMe around in different slots (temperatures seems reasonable and they all have heatsinks)
Playing around with some zfs kernel modules parameters: lowering values of zfs_vdev_sync_read_min_active,zfs_vdev_sync_read_max_active and their async counterpart (I used the same values set as defaults for fs_vdev_scrub_max_active and fs_vdev_scrub_max_active) ;
Throttling with throttle : zfs send ... | throttle -M 300 | ...
Tinkering with the blkio cgroup
Running a short S.M.A.R.T. test: nothing special to say, all of the three NMVe modules pass it.
Put the whole machine hardware settings on their BIOS defaults (No PBO, no RAM overclocking)
Memtesting the RAM (3 passes, no errors)
rsync-ing the system dataset on a virtual disk over iSCSI (no crash! yeah! impractical however)
zfs send from a FreeBSD live media : FreeBSD allocates a 200MB host buffer for each module but unfortunately no more success and a zfs send also hangs :/
PCIe 3.0 & 2.0 enforced on all M.2 slots => still crashes
PCIe power management set at "off" in BIOS/UEFI.

Some thoughts / ideas of tests to try

Use 512b sectors (pool has to be destroyed)
Swap the WD Black SN 850 modules of my secondary machine with those and see if this solves the issue on this machine (while being functional on the other machine)
Burn a candle

Is there a "ZFS native" way to throttle I/O operations in the case of doing a zfs send?

Has anybody here experienced something like this? If so, what are the other brands/models subject to a similar issue?

admnd · 2023-04-25T21:07:52Z

admnd
Apr 25, 2023
Author

Found something interesting in a proposed patch in a discussion whose topic was "[PATCH] nvme-pci: fix host memory buffer allocation size" dating of may 10th 2022. The starting point of the discussion start here => https://www.spinics.net/lists/kernel/msg4339024.html

At some point (https://www.spinics.net/lists/kernel/msg4352567.html), it is mentioned that:

WD SN770 NVMe are problematic (the author experience the very same freezes than me but does not mentions ZFS so I guess that he uses a single standalone drive with something else than ZFS)
Switching the I/O scheduler to "mq-deadline" improved the situation without solving it completely.

Also in a subsequent message ( https://www.spinics.net/lists/kernel/msg4372632.html ) it is also mentioned that the situation has improved drastically with the patch.

And another point of the discussion about having the Host Memory Buffer of just 32MB. According to my logs, I have the same allocation:

[    3.264207] nvme nvme2: pci function 0000:08:00.0
[    3.264207] nvme nvme1: pci function 0000:0e:00.0
[    3.264207] nvme nvme0: pci function 0000:04:00.0
[    3.302554] nvme nvme2: allocated 32 MiB host memory buffer.
[    3.303343] nvme nvme0: allocated 32 MiB host memory buffer.
[    3.303721] nvme nvme1: allocated 32 MiB host memory buffer.
[    3.306596] nvme nvme2: 32/0/0 default/read/poll queues
[    3.307029] nvme nvme0: 32/0/0 default/read/poll queues
[    3.307622] nvme nvme1: 32/0/0 default/read/poll queues

For the record, here is excerpts of some messages:

Taken from https://www.spinics.net/lists/kernel/msg4352567.html :

On my current setup (WD SN770 on ThinkPad X1 Carbon Gen9) frequently the NVME
controller stops responding. Switching from no scheduler to mq-deadline reduced
this but did not eliminate it.
Since switching to HMB of 1 * 200MiB and no scheduler this did not happen anymore.
(But I'll need some more time to gain real confidence in this)

Initially I assumed that the PAGE_SIZE * MAX_ORDER_NR_PAGES was indeed
meant as a minimum for DMA allocation.
As that is not the case, removing the min() completely instead of the max() I
proposed would obviously be the correct thing to do.

Taken from https://www.spinics.net/lists/kernel/msg4372632.html :

So this patch dramatically improves the stability of my disk.
Without it and queue/scheduler=none the controller stops responding after a few
minutes. mq-deadline reduced it to every few hours.
With the patch it happens roughly once a week.

Current parameters for the nvme kernel modules on my system are on their defaults:

parm:           use_threaded_interrupts:int => 0
parm:           use_cmb_sqes:use controller's memory buffer for I/O SQes (bool) => Y
parm:           max_host_mem_size_mb:Maximum Host Memory Buffer (HMB) size per controller (in MiB) (uint) => 128
parm:           sgl_threshold:Use SGLs when average request segment size is larger or equal to this size. Use 0 to disable SGLs. (uint) => 32768
parm:           io_queue_depth:set io queue depth, should >= 2 and < 4096 => 1024
parm:           write_queues:Number of queues to use for writes. If not set, reads and writes will share a queue set. => 0
parm:           poll_queues:Number of queues to use for polled IO. => 0
parm:           noacpi:disable acpi bios quirks (bool) => N

Going though the code of drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c (checked with a 6.2.12 Linux kernel) suggests that the famous patch has not been applied because the "min_t" is still there:

static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
{
        u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
        u64 hmminds = max_t(u32, dev->ctrl.hmminds * 4096, PAGE_SIZE * 2);
        u64 chunk_size;

        /* start big and work our way down */
        for (chunk_size = min_chunk; chunk_size >= hmminds; chunk_size /= 2) {
                if (!__nvme_alloc_host_mem(dev, preferred, chunk_size)) {
                        if (!min || dev->host_mem_size >= min)
                                return 0;
                        nvme_free_host_mem(dev);
                }
        }

        return -ENOMEM;
}

The patch in question is mentioned at the very beginning of the discussion and is this one:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aacf1c0d5a5..0546523cc20b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2090,7 +2090,7 @@ static int __nvme_alloc_host_mem(struct nvme_dev *dev, u64 preferred,
 
 static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
 {
-	u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
+	u64 min_chunk = max_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
 	u64 hmminds = max_t(u32, dev->ctrl.hmminds * 4096, PAGE_SIZE * 2);
 	u64 chunk_size;

Another related thread is here => https://lore.kernel.org/linux-nvme/f94565db-f217-4a56-83c3-c6429807185c@t-8ch.de/
Quoting:

I am wondering about the calculation of the NVMe Host Memory Buffer sizes.
It seems to me that the current algorithm to calculate this size does not lead
to an optimal result.

Hardware information:
mn : WD_BLACK SN770 1TB
fr : 731030WD
hmpre : 51200 (limited by max_host_mem_size_mb to 32768 -> 128MiB)
hmmin : 823
hmminds : 0
hmmaxd : 8

To me this looks like the disk wants 200MiB allocated that can be described in
eight descriptors.
However the kernel log has the following entry:

[ 8.981685] nvme nvme0: allocated 32 MiB host memory buffer.

Tracing through drivers/nvme/host/pci.c the following happens:

The loop in nvme_alloc_host_mem() is only entered once.
min: 3371008
preferred: 134217728
min_chunk: 4194304
chunk_size: 4194304

Now in __nvme_alloc_host_mem() the loop is called the eight times for hmmaxd,
each time allocating 4194304 bytes (4 MiB).
The end result is that a total of 32MiB of Host Memory Buffer are allocated
which is the bare minimum instead of the 200 MiB that are preferred and
available.

It seems that the logic to calculate min_chunk in nvme_alloc_host_mem() starts
with a too small value.

All of this is on a normal x86 laptop with plenty of system memory.
It's reproducible with current git (46cf2c613f4b10eb12f749207b0fd2c1bfae3088)
and 5.17.4.

0 replies

admnd · 2023-04-25T23:23:25Z

admnd
Apr 25, 2023
Author

Above patch tried, but in my case, worsens the issue :( The crash happens much more earlier than before.
Fiddling around with parameters of nvme.ko, I managed to have a higher allocation of 200 MB with nvme.max_host_mem_size_mb=512 + the above patch applied.

1 reply

foolab Aug 15, 2025

And just now, I got an error with SN850 :-(

[8336605.935521] nvme nvme1: I/O tag 820 (4334) opcode 0x0 (I/O Cmd) QID 2 timeout, aborting req_op:FLUSH(2) size:0
[8336605.935579] nvme nvme1: Abort status: 0x0
[8336606.571489] nvme nvme1: I/O tag 104 (5068) opcode 0x0 (I/O Cmd) QID 6 timeout, aborting req_op:FLUSH(2) size:0
[8336606.571539] nvme nvme1: Abort status: 0x0
[8336607.079686] nvme nvme1: I/O tag 655 (a28f) opcode 0x0 (I/O Cmd) QID 11 timeout, aborting req_op:FLUSH(2) size:0
[8336607.079740] nvme nvme1: Abort status: 0x0
[8336621.023691] nvme nvme1: I/O tag 871 (e367) opcode 0x0 (I/O Cmd) QID 3 timeout, aborting req_op:FLUSH(2) size:0
[8336621.023738] nvme nvme1: Abort status: 0x0
[8336636.143857] nvme nvme1: I/O tag 820 (4334) opcode 0x0 (I/O Cmd) QID 2 timeout, reset controller
[8336706.776713] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[8336716.836834] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[8336716.836973] nvme nvme1: Disabling device after reset failure: -19
[8336716.898251] I/O error, dev nvme1n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[8336716.898251] I/O error, dev nvme1n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[8336716.898255] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.898256] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.898260] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.898273] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584146255872 size=118784 flags=3145856
[8336716.898274] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145731584 size=131072 flags=3145856
[8336716.898274] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584146509824 size=36864 flags=3145856
[8336716.898275] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145600512 size=131072 flags=3145856
[8336716.898276] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145862656 size=131072 flags=3145856
[8336716.898276] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145260544 size=20480 flags=3145856
[8336716.898279] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.898286] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145993728 size=131072 flags=3145856
[8336716.898286] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=446787076096 size=16384 flags=3145856
[8336716.898289] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584146378752 size=131072 flags=3145856
[8336716.901291] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584146124800 size=126976 flags=3145856
[8336716.901302] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.901320] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.920808] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304

admnd · 2023-04-25T23:54:25Z

admnd
Apr 25, 2023
Author

Basically at this point, I am out of options with those sticks. Those are a replacement for a trio of ADATA Gammix S70 Blade which were also problematic because their namespace had a bad value for EUI64: Basically all were all set to eui64=0000000000000000 which made the system totally confused about who was who.

So my only option at this point is to get another model :/ Perhaps I will keep them for a much-less intensive use.

Reality is: not all NVMe hardware can play nicely with ZFS. It seems that investing in higher end of hardware is not an option, especially with ZFS. I won't ever consider switching them back to 512b sectors, I don't think this will solve the issue and if ever it solves it, there is a significant performance penalty.

Hoping my hours of investigations would avoid someone wasting money in junk hardware. It is a bit disappointing that this junk is coming from a well-known brand.

PS: Free feel to further elaborate. I will post if I get something new on this.

0 replies

IvanVolosyuk · 2023-04-26T03:45:03Z

IvanVolosyuk
Apr 26, 2023

I would try to replace the PSU with another one and probably 1000W one. Often mysterious problems end up with replacing faulty PSU.

…

On Wed, Apr 26, 2023 at 9:23 AM admnd ***@***.***> wrote: Above patch tried, but in my case, worsens the issue :( The crash happens much more early than before. Fiddling around with parameters of nvme.ko, I managed to have a higher allocation of 200 MB with nvme.max_host_mem_size_mb=512 + the above patch applied. — Reply to this email directly, view it on GitHub <#14793 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXQ6HOVYHJWDVAHYS4RWYDXDBMHPANCNFSM6AAAAAAXLAAQ7E> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

2 replies

admnd Apr 26, 2023
Author

Thank you for this suggestion. It is still plausible and I keep it. However, it is very unlikely that it is the cause here for mainly two reasons: 1. The PSU is not even at half load, 2. I would have seen other symptoms while the machine is on very high load or while a scrub is running, 3. someone experienced a similar issue with other hardware (and managed to fix it).

Having to replace the PSU means throwing a significant cash amount for a test that might not be a success. Better to save the money for beefier modules. But if I manage to get one in a way or another, worth a try. I might try swap the PSU for a trusted one I have in my secondary machine (not 1000W however) as I cannot reproduce the issue on it (3x SN850, working #1 with ZFS since day 1).

Manawyrm Jun 22, 2023

NVMe storage uses 3.3V supply voltage, which gets created locally on the mainboard from the 12V (sometimes 5V) supply rails on basically all mainboards. The 3.3V rail on the ATX connector is unused on most boards.
If that doesn't work properly, the mainboard is at fault.

Flaaxxx · 2023-04-26T06:58:12Z

Flaaxxx
Apr 26, 2023

This might be a longshot, but where have you connected your NVMe? Did you use the onboard slots or a riser card with bifurcation? And if you used the onboard slots which ones did you use?

From the Manual you can see one of the slots shares bandwith with the Sata Ports if theres anything in there it could cause a Problem. Further x670 daisy chanins 2x the x670 chipset to give more connectivity. A Guess off mine could be that this issue could be cause by limited bandwith between chipsets and the CPU which might cause the controller to look like its dropping.

My suggestion to troubleshoot this, is to get a bifurcating riser card put it in the 16x Slot and have all the NVMes directly connected to the CPU. This would eliminate going over the Chipsets.

Unfortunatly ASUS has no blockdiagram of the Board and where which PCIe Lanes go with which speed. But I would see if limiting the speed of the drives could also be causing this issue. PCIe Switching link speed caused me a lot of headaches with my rx5700 xt GPU. It caused some weird issue of it disconnecting crashing the drivers etc. So pretty similar to what you experience.

Those 2 would be my guesses for this issue.

1 reply

admnd Apr 26, 2023
Author

Very savvy, thank you. I have no riser here to try your first suggestion (as 7950X has a built in GPU I can pull out the dGPU) this week. But what I can do is to rebuild a pool with 2x NVMe in mirror rather than 3 in RAID-Z1 and see what would happen.

Indeed, the description is a bit hidden in the technical details:
https://www.asus.com/ca-en/motherboards-components/motherboards/tuf-gaming/tuf-gaming-x670e-plus-wifi/techspec/

The paragraph "Storage" says:

AMD Ryzen™ 7000 Series Desktop Processors
M.2_1 slot (Key M), type 2242/2260/2280 (supports PCIe 5.0 x4 mode)
M.2_3 slot (Key M), type 2242/2260/2280 (supports PCIe 4.0 x4 mode)
AMD X670 Chipset
M.2_2 slot (Key M), type 2242/2260/2280/22110 (supports PCIe 3.0 x4 & SATA modes)**
M.2_4 slot (Key M), type 2242/2260/2280 (supports PCIe 4.0 x4 mode)

The actual configuration is one NVMe module in M.2_1, one in M.2_2 and the third in M.2_3. Two of them connected directly to the CPU, the third going via the chipset. I also tried M.2_1, M.2_3, M.2_4 but with similar results. BIOS being on auto settings, they run at their native speed (PCIe 4.0). I will try to lower to PCIe 3.0 or even 2.0 and see what happens.

I have the impression of being just above a certain threshold, not that far away.

Lyndeno · 2023-04-26T13:55:33Z

Lyndeno
Apr 26, 2023

It's interesting you're having issues with the SN770.

I was having issues with mine (2TB as well) in my laptop. ZFS, Btrfs on LVM/LUKS even ext4, my drive would reset just like yours in my laptop. Whether during boot or when sitting there doing nothing, or something. Seemingly random.

I took it to my computer store to get it replaced. Through their testing the drive passed all tests, so they did not replace it. I believe they were testing with windows.

I am going to RMA it with WD, hopefully my replacement performs better.

I have the exact same drive in my desktop(X570 5950X), using a single ZFS vdev as root. I have not experienced these issues. I would try putting the desktop drive in my laptop (XPS 9560)to see if it has issues but that would be quite an inconvenience to me. So I am just going to RMA it. The previous drive in my laptop did not have these issues.

This stuff occurred with both 512b and 4kb sectors I believe.

4 replies

admnd Apr 27, 2023
Author

Seems some other guys around encounter problems with this model (See links on the next comment bubble). This model has no DRAM cache and seems very prone to crash even idle it seems to crash. It its definitely not expected to see that (however a performance loss WAS).

My guess is those target the general market where I/O are not that heavy and only one module used with a machine not up 24h a day. Thus, WD engineers (maybe) have not put a high stress on those because it is not the use case they are supposed to fit in ;) WD is a reputable brand, products are tested. Companies are not always too big to fails but sometimes mistakes or more-or-less-stupid-management-decisions can be done for various reasons: not having though about a detail, cutting costs with sub-standard components, etc. Pure speculation at this point I cannot tell what the real cause is, I do not work at WD or have contacts there.

I am curious to see if your replacement improved your situation or if it it just as unstable as the replaced SN770.

"If you want high performance NVMe, use a model with DRAM". I learnt life the hard way on this one.

Lyndeno Apr 27, 2023

We'll see, I still have to send it in. But the SN770 in my desktop has been performing well, no errors to report.

I would have got a Firecuda (I do like Seagate, and in the case of nvme firmware upgrades, Seagate is way more Linux friendly) but for the capacity, it was almost double the price.

I am not doing heavy i/o normally, but I do game, compile and stuff on this computer and the WD has been performing fine. Which is why I suspect (I hope) it's a faulty drive in my laptop.

mabra Jun 22, 2023

Saw your message late.
I started with two FireCudas.
The first died after 4 weeks, the other one is causing pool-crashs and give messages like this:
Device: /dev/nvme0, number of Error Log entries increased from 756 to 760
According to the specs, they have ram as cache.
The WD never caused a problem for me.
Can say this, because my storage crashed again yesterday.

Lyndeno Aug 19, 2024

An update to my situation.

The RMA SN770 replacement was exhibiting the same issues on my laptop.

I have been running two SN770 in my desktop in a ZFS mirror for around a year and a half now. Recently, one of them is resetting/disconnecting, degrading the pool.

I have not checked to see if it is the same drive each time.

It seems to happen as a result of something. Sometimes, simply logging in to Gnome causes it to happen. Not sure why, as this pool does not hold any root files.

I also noticed it sometimes occurs when my phone starts backing up to immich, I have the postgres database stored on that mirror. I have not yet tried any troubleshooting, kernel params, settings, etc. Only change I have made is turn on the fan on my Hyper M.2 card. Still occurs occasionally.

My Firecuda 520 root on XFS has been rock solid for four years.

admnd · 2023-04-26T18:37:36Z

admnd
Apr 26, 2023
Author

Others pointers (FreeBSD):

At this point, I have opened a case with WD, perhaps something can be done at their level. As I should have some freetime tomorrow, I will try to exchange modules between my two machines.

3 replies

Lyndeno Apr 26, 2023

These are similar to other posts I have seen (different drives) where the power supply was the issue.

I am hoping that is not the case for my laptop. I guess I could replace the battery? But I just got a new battery last year.

In the meantime, I will continue with my RMA with WD. I hope it is simply a bad drive.

Lyndeno May 24, 2023

I have received a replacement SN770, within two days that drive started exhibiting the same problems as the last one.

I have ordered a cheap 1TB Timetec SSD for my laptop. It has been a few days so far and no issues. I will put the SN770 into my desktop to go with the other one. There seems to be some imcompatibility between the drive model and my laptop.

Lyndeno Aug 19, 2024

See my other comment #14793 (reply in thread), the WD drives are exhibiting these issues on my Asus desktop now.

admnd · 2023-04-28T00:27:02Z

admnd
Apr 28, 2023
Author

SN770 Swapped out for 3x WD SN 850 configured in 4K. Day & night! My 7950X is literally breathing again! Over 100K IOPS while emerging GCC 13, zpool scrubs are going easily to 5-6 GB/s.

Earlier this afternoon, I tried to swap one module at a time. Guess what? One SN 770 quit the pool seconds after the resilvering started, the second reset in the middle. I had thousands checksums errors reported. Fortunately I have daily snapshots stored on a TrueNAS box, so not an issue. This junk is even not able to sustain a pool resilvering.

So, gentlemen, moral of the story : Don't use DRAM-less NVMe stuff with ZFS
The troubles they bring do not worth it not counting they are a real bottleneck.

Will give news on what happens with my now famous SN 770 when I will have :) Perhaps they will do better in my secondary machine or in the junk-box.

Thank you, again, for jumping in and take some of your time to put suggestions here. This is greatly appreciated.

2 replies

mabra Jun 22, 2023

This does not explain, why each srub/resilver works fine for me with this model.
In opposite to my FireCuda, it even does not log errors.
For me, all the crashes followed a "return from hibernate", though not directly.

Lyndeno Aug 19, 2024

Scrubs also work just fine for me after rebooting after having one of the drives reset. Full speed

mabra · 2023-05-23T18:24:10Z

mabra
May 23, 2023

Stumpled over this by searching for consequences of my pool crash.
Just a side-note, I am not that deep in linux and modern hardware, as in earlier times.
I am using a ZFS mirror of two NVMEs, which are "Seagate FireCuda 520 SSD ZP2000" (2 TB) and "WD_BLACK SN770 2TB" (2 TB) in the original place on a Supermicro H12SSL-C motherboard with AMD EPYC 7252 (8 core) since about a year.
Originally, I started with two of the Firecudas, but one gave up very early and I made this experience with Seagate over and over my livetime and to come to a immidiate replace (because it is only a mirror), I bought the WD and was able to recover.
The first failed Firecuda was completely dead, looks like hw-only failure.
The crash, which leads to a loose of my complete pool, happened immidiate after return from hibernate (it is a workstation) .....,
which fails very often (using debian11) with kernel 6.1 (installed 14 days bevore!!).
See not any evidence, this this will be a ZFS problem, more the kernel ...
At this crash of 2023-05-19, the WD was the first one who has been checked, but the second (immidiately following) line was the Firecuda - but the order MAY say nothing, even though the ZED mails arrives in the same order.
Just as a note.

1 reply

mabra Jun 28, 2023

Found the debate about ZFS+HIBERNAT late, yesterday. There is even speaking, that something like "hibernate should not be used with ZFS" on the one side, and working on patches on the other hand.
Now, I can see, that my obersavtions - for my crash scenarios - was quite right - it happend always and only after return from hibernate. No crashes or errors otherwise with the mentioned disks WD/Seagate).

gregorst3 · 2023-06-08T18:07:46Z

gregorst3
Jun 8, 2023

Hello @admnd I'm experiencing the same problems on my server infrastructure, I recently added this wd nvme (sn850x) just for some low-spec VM that I did not prefer to run on my main nvme composed by different pm9a3.
As soon as I installed that nvme I got woken up during the night for a crash on my servers (random time , x days).
I found out that this can be related to a firmware problem on our nvme, I had to temporarily boot a Windows machine to update the firmware (because they only provide the tool only for windows) of the sn850x and after that seems like the problem is gone.

3 replies

admnd Jun 8, 2023
Author

No issues here with a pool composed of sn850x modules (and an older one with sn850 modules) but yes it is recommended to apply the latest updates from the manufacturer and, personally, this is the very first thing I do when I unbox a NVMe.

The issue appears with SN770 and probably some others DRAMless NVMe. Perhaps WD will release a fix in the future that correct the issue, until then, avoid that model.

posixpoet Mar 25, 2024

Firmware upgrades with Linux:
https://community.frame.work/t/western-digital-drive-update-guide-without-windows-wd-dashboard/20616

Thaodan Nov 29, 2024

I have also a WD SN850X. I never experienced issues with 4K LBA. Firmware is 620311WD.

Maybe this is something that is fixed in some WD SDD's but not in others.

x0rzavi · 2023-06-29T15:35:39Z

x0rzavi
Jun 29, 2023

I don't know if its related somehow but here's my 2 cents.

I had an SN570 500GB (dram less) NVMe, which was actually quite newish (less than 1 year old). I never had any issues initially with ZFS and gentoo on it, been using ZFS since the last 5 months. Until recently, I started noticing random kernel crashes and ZFS status reporting permanent errors while scrubbing. My RAM was perfectly fine concluding from the fact that memtest86+ tests reported pass twice consecutively.

To my surprise, upon rebooting to windows, WD dashboard reported that "NVM subsystem reliability has degraded" with 99% lifetime remaining. Even, SMART tests started failing. And unfortunately, the drive had to be replaced out.

0 replies

dm17 · 2023-07-04T17:55:18Z

dm17
Jul 4, 2023

Would be cool for a "ZFS NVMe Recommendations List" to come out of this discussion.

I imagine SLC and MLC NVMes would be above the rest. What are the other criteria of which ZFS users should be aware when identifying the best SSD hardware?

3 replies

justinclift Sep 1, 2023

As a potential starting point for this, these are the NVMe drive models we're using in our production servers (no issues at all for 12+ months):

SAMSUNG MZVL21T0HCLR-00B00 - 1TB model
KXG60ZNV1T02 TOSHIBA - 1TB model
SAMSUNG MZQLB1T9HAJR-00007 - 2TB model
SAMSUNG MZVLB1T0HBLR-00000 - 1TB model

They're all configured on our servers as ZFS mirrors, using two of each model per server. So, one server will have (say) 2x SAMSUNG MZVL21T0HCLR-00B00 1TB. Another server might have (say) 2x SAMSUNG MZQLB1T9HAJR-00007 2TB, etc.

justinclift Feb 23, 2024

~~For consumer level NVMe drives, the 2x (ZFS mirrored) 1TB Crucial CT1000P5SSD8 drives in my workstation have been working without issue since July 2021.~~

~~Would buy them again, but they don't seem to be available for sale any more. 😵‍💫~~

Since writing the above I've moved to using SAS drives (any generation really, but SAS3+ preferred) and no longer use consumer drives in my systems.

Ironically, it's actually cheaper to buy an Ebay SAS controller + a bunch of 2nd hand SAS SSDs (mostly with ~95% of their endurance left) than buy brand new SATA drives. And the SAS ones often have ~40x the endurance of consumer SATA drives. (!)

justinclift Jun 28, 2024

On the Proxmox forums, the Kingston DC1000B NVMe drives seem to be commonly recommended:

https://www.kingston.com/en/ssd/dc1000b-data-center-boot-ssd

Unfortunately they're tiny (480GB max), and the write speed of even those "large" 480GB ones is around SATA speeds. Their rated endurance is only 475TBW (.5 DWPD/5 years) so not great for write heavy use cases either.

rodrigoaguilera · 2023-08-24T12:44:01Z

rodrigoaguilera
Aug 24, 2023

I think I'm suffering from this on a 8TB Corsair MP600 PRO NH used as additional storage for a proxmox 8. rsync seems to trigger it specially.

The sledgehammer solution:

echo 1 > /sys/bus/pci/devices/0000\:12\:00.0/remove
echo 1 > /sys/bus/pci/rescan

Brings back the device for me but the zfs pool doesn't come back. I think it is because proxmox creates the pool with a /dev/nvme0nX and the X changes with every "resurrection".

I'm going to try ext4 next on that device and see how it goes.

I wanted to post here in case there is more people with the same device and similar problems.

2 replies

rodrigoaguilera Aug 28, 2023

Been stressing the drive with ext4 for a few days with fio, rsync and various file copying operations and no problem so far, 4 days uptime. With ZFS the controller died after 15-20 minutes of IO.

In the post above I forgot to mention that I was on the latest firmware 51.3

I won't be testing more on that drive with ZFS so I can't provide more info.

kftsehk Oct 18, 2023

have you tried force fsync on the test with ext4? rsync --fsync or so.

for the /dev/nvme0nX change, use /dev/disk/by-id/<find-your-disk-partition-id>, this id won't change when unplugged or resurrected

agrenott · 2023-10-14T19:53:49Z

agrenott
Oct 14, 2023

Just FYI, I had the exact same issue with a brand new WD BLACK SN770, and swapping my PSU solved the issue (while my previous one seemed perfectly fine)...

5 replies

agrenott Dec 6, 2023

Sad news, it's in fact not (only?) the power supply.
Just faced the issue on the exact same phisical config after updating to latest proxmox version (so not sure whether this is kernel and/or ZFS version related).
Kernel Linux proxmox 6.5.11-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64 GNU/Linux
ZFS

zfs-2.2.0-pve4
zfs-kmod-2.2.0-pve4

Skaronator Dec 6, 2023

Just offtopic, but make sure to update to 2.2.1 due to the data corruption bug "in" 2.2.0

agrenott Dec 6, 2023

Thanks! According to release notes it has been back ported into zfs-kmod-2.2.0-pve4.

justinclift Dec 6, 2023

Pretty sure there was some kind of serious bug found in 2.2.1 as well, so a 2.2.2 release should be out in short order.

fmagin Dec 7, 2023

Yes 2.2.1 had another similar looking issue, but it only showed up if you were using 4k sectors with LUKS #15533

kftsehk · 2023-10-18T21:47:00Z

kftsehk
Oct 18, 2023

[430771.216723] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[430771.216727] nvme nvme2: Does your device have a faulty power saving mode enabled?

Last time I saw this was with either firmware / hardware issue, RMA solves sometimes, if they return you a piece with newer version of firmware or an internal known defect fixed.

I would suggest not to buy same brand & model of the same batch for all vdev in a pool, that might put you at risk of faulting all disks if ever there is a hardware / firmware / manufacture issue.

0 replies

a1bert01 · 2025-08-13T08:49:53Z

a1bert01
Aug 13, 2025

FYI: SN580 1TB under windows10/11 with latest firmware 281040wd, the same behaviour after switching to 4096 lba , reproducible by fio with numjobs=4

0 replies

ButterBarTheGr8 · 2025-08-15T23:15:32Z

ButterBarTheGr8
Aug 15, 2025

Posting here to A) Thank the OP, B) spread the word....

TL/DR_SUMMARY: The OP is 100% correct, this IS some kind of a problem between ZFS, the WD drives (SN770 and SN850S, SN850XE), and maybe even the underlying hardware. Better said, it's a particular chemistry of calamity that ultimately results in the problems everyone is describing. A drive will randomly drop out the zpool, write errors will be seen, and generally nothing other than a reboot will reset the drive controller, thus allowing zfs to resilver and heal the pool. I've spent WAY too much time on this and ultimately, switching filesystems was the fix. So here is how I got there and maybe some help for you.

DETAILS: I started with the 'ole trusty mdadm to build an array from 12 x 4TB SN850XE drives shucked from USB3 cases. Before you say shucked drives are the problem - just know I verified the controller, controller firmware, clock mechanism, and memory chips are identical to the SN850X, available as a standalone drive. For some time I thought the shucking trick was my enemy. Nope. I used some PCIe 4.0 x16 to 4x(x4) adapters found here to place the drives in three of the five PCIe x16 slots available. Supermicro H12SSL-i, AMD Epyc 7352, 256GB of memory. I wasn't happy with the contact mechanism between thermal pad and drive, but more on that later. A mdadm RAID 5 array would fail building itself around the 80-90% mark, every time I tried for about ten different attempts. RAID 0, 1, 10 were all fine, but not when distributed parity was a player. I changed build flags and settings, sector sizes, an array-of-partitions instead of disks. I went through lvmraid and snapraid (both of which rely on the md subsystem). Failed every time.

Another factor here is heat, these little things get HOT. So i switched to these drive carriages, which because of the screws in the middle of the heatsink, had better contact with the thermal pads used. More mdadm attempts, more failures.

Enter ZFS. I've always been a little shaky with ZFS because of it's proximity to the kernel but building a ZFS pool doesn't carry the bitmap overhead and drive geometry mapping that mdadm has. Building a RAID 5 zpool was a snap and I was mounted with encrypted and unencrypted datasets immediately. But just like all the others above me in this thread - large file transfers and even sustained small file transfers would kill the system. So next I started digging through dmesg.

Since this is a Proxmox box and I used SR-IOV and PCI passthrough religiously, PCI Advanced Error Reporting (AER) and PCI Access Control (ACS) had to be enabled. That instantly produces the below.

kernel: nvme 0000:87:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
kernel: {204}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
kernel: {204}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {204}[Hardware Error]: event severity: corrected
kernel: {204}[Hardware Error]:  Error 0, type: corrected
kernel: {204}[Hardware Error]:   section_type: PCIe error
kernel: {204}[Hardware Error]:   port_type: 0, PCIe end point
kernel: {204}[Hardware Error]:   version: 0.2
kernel: {204}[Hardware Error]:   command: 0x0407, status: 0x0010
 kernel: {204}[Hardware Error]:   device_id: 0000:c6:00.0
kernel: {204}[Hardware Error]:   slot: 0
kernel: {204}[Hardware Error]:   secondary_bus: 0x00
kernel: {204}[Hardware Error]:   vendor_id: 0x15b7, device_id: 0x5030
kernel: {204}[Hardware Error]:   class_code: 010802
kernel: {204}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
kernel: nvme 0000:c6:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
kernel: nvme 0000:c6:00.0:    [ 0] RxErr                  (First)
kernel: nvme 0000:c6:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

These errors will show up every few seconds and make sure you've got log rotation turned on, or you've turned off AER logging with a boot flag, else you're going to be exhausting drive capacity in a few hours. Please let me save you countless hours of digging through kernel dev forums and just tell you this is a complete red herring. AMD Epyc series processors are very "chatty" about the PCIe Bus. The slightest re-ordering of bus transaction data results in similar above messages, of which most CPUs DO NOT flag the kernel over. You may even wind up on a forum where an AMD engineer calls this a firmware errata since corrected in later generations of Epyc and Ryzen CPUs. It's also quite dependent on the underlying northbridge controller in the CPU.

Alas, it's a red herring. And since the hardware error (the WD drive is complaining about transactional re-ordering) is corrected by the drive controller - it's quite "normal" and NOT contributory to the problem.

BACK TO THE STORY: I went through another PCIe card from Dell that can handle the 22110 drives but still the same failures with ZFS. Sometimes I could get 8 or 10 TB transferred (I used straight CIFS, rsync, NFS, and others), and sometimes just a few GB. Sometimes the pool would pause and the transfer would continue for awhile, and then after a hard failure and reboot, resilver itself and heal for the amount of data I was able to transfer. Eventually I switched to SFF-8654 carrier cards and Silverstone active cooler carriages. Heat would not beat me! But still the same problem, nearly repeatable for every zpool flag, feature, anything I could switch on or off.

I downgraded the PCIe negotiation rate to gen 3.0 - same problem.
I thought ASPM might be my enemy so I disabled that first in the OS, then in BIOS - same problem.
I disabled SR-IOV and ACS because maybe the drives needed a direct DMA-based communication pathway to other PCIe devices - same problem.
I tried different sector sizes (ashift) and even found a way to emulate 512kb based partitions - same problem.
I changed OS families to RedHat based distros (Proxmox is Debian) thinking the ZFS modules were borked - same problem.
I rate limited rsync because when transferring from a completely SATA array, I was getting 750MB/s, faster than the SATA III spec. I still can't explain why an rysnc would show that number, but it seemed anomalous. Even limiting rsync to 100MB/s, the ZFS pool would still fail - same problem.

I placed thermal sensors on the drives and the Silverstone carriages are won-dee-ful. They were keeping the drives around 50 degrees, or lower. Heat was thus not a factor.

So in a final effort to maintain allegiance to ZFS, I swapped the drives to a completely Samsung platform (980 Pro). Two things happened...one the number of AER messages got cut in half. Two, no more pool crashes and removed drives!!! All other things unchanged, that told me that some bit of chemistry between the CPU, the board, the drives, and ZFS was the problem. So I then tested on a SuperMicro X10SDV board - albeit with a single PCIe card, bifurcated x4x4x4x4 and running at PCIe gen 3.0 speeds. Nope, ZFS and the WD drives still broke. Samsung drives were A-ok. That's an intel board with a completely different IOMMU, AER, and ACS structure.

So the final conclusion here, after all that testing is that ZFS pools, definitively when the pool uses a parity structure (RAIDz.*), are not compatible with current generation WD M.2 NVME drives. The OP's hypothesis of burst writes might be the culprit.

FINALITY: With ZFS and mdadm cooked, I switched to RAID 5 BTRFS. Not a single problem. rsync transfer rates are 600MB/s from a pure SATA ZFS array of 24 x 2TB M.2 drives. That's less than what rsync reported on the ZFS pool, but it's also realistic. SMART load tests show 6000MB/s, on part for these drives. Nothing special, no unique flags for the BTRFS RAID 5 array, i don't even use commit=120. But I can copy hundreds of TB's back and forth with not a single problem. So here's what I know to be true:

If you think ZFS isn't a player here - you're wrong. Queue the trolls and their war drums! Single drives using ext4, XFS, NTFS were rock solid for me. SMART heavy load tests, sustained transfers etc. If you drive is failing in these scenarios, it's a bigger issue.
RAID 0, 1, 10 don't seems have this problem with WD M.2 NVME drives. Only raidz, z2 and the parity flavors of ZFS. YMMV though. I never stress tested these configurations as I like me some parity.
You might NEVER see this problem manifest if all you have is small and infrequent file transfers.
You can't reset a drive and re-add/replace/clear the drive once ZFS removes it. The drive controller needs a reset signal or power cycle.

5. Use BTRFS with these drives. The write-hole problem was fixed.

3 replies

mariusmuja Aug 15, 2025

For me the issue was fixed by formatting the drives to 512 sectors from 4K sectors. The same drives that were immediately failing before have been working fine on a ZFS pool for 1+ years.

Use BTRFS ...

No way (not ready to trust BTRFS with my data after it ate it last time...)

ButterBarTheGr8 Aug 16, 2025

Fair - there's still massive skepticism over BTRFS after the write hole and RAID5/6 debacle. Arch still vilifies it, so goes the Internet. For me, it's all that's left and it works fine. I backup everything in triplicate across three different filesystems. I tried the 512 trick. Though the SN850 series is 4k native, you can program the controller to 512e. Same problem. Since I traffic in 30GB+ files, I didn't want to hamstring a 4K native drive with a 512e sector size. To each their own though...

no-usernames-left Nov 7, 2025

I cannot even imagine how long this took you.

For me, however, replacing the drives with ones that didn't shit the bed in such a manner would be my way forward.

As you said, to each their own!

wiesl · 2025-08-16T04:02:56Z

wiesl
Aug 16, 2025

Did you all try latest ZFS versions? 2.2.8 or 2.3.3?

Maybe you are hit by:

Today's update to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a (almost identical to current 2.3.0 RC) caused permanent pool corruption #16631
CKSUM and WRITE errors when receiving snapshots or scrubbing (2.2.4, LUKS) #15646

fixed by #16687

0 replies

Speed7811 · 2025-09-22T07:56:20Z

Speed7811
Sep 22, 2025

Dear all,

we have several issues like in your description but not with WD either with Samsung SSD 990 Pro. Our Setup:

QNAP Storage TS-h1090FU (Firmware: JS06716L
Firmware QuTS hero h5.2.2.2952 from 11/24) with 10x Samsung NVMe 990 Pro (Firmware: 4B2QJXD7) - All NVMe attached with QDA-U2MP. We bought this storage in 11/24. We did scrubbing every week at Sunday night at 1:00 AM.

The setup is one ZFS RAID-6 pool with activated dedup.

The first 4 month we didn't had any problems - but then the horror began. Every week a few minutes after the scrubbing began a hard disk was disconnected due to a timeout. Not always the same disk - sometimes Disk 1, then Disk 10 and then Disk 2. If we stop the scheduled scrubbing the storage is running over two weeks without any issues. Smart infos of all disks are ok.

We created a QNAP-ticket and the the storage was replaced by a new one in 04/2025 (QNAP thought it was the backplane). Same QNAP firmware, same NVMe with the same adapter (QDA-U2MP). The new storage runs ~5 month (like the old one) without any problems and yesterday one hard disk was disconnected a few minutes after scrubbing was started - with a timeout. The temperature of the hard disk during scrubbing is ~30°C (all HDDs are cooled by the QDA-U2MP adapter and we are using air condition).

ZFS parameters:

/sys/module/zfs/parameters/_vfs_zfs_scan_idle => 50
/sys/module/zfs/parameters/_vfs_zfs_scrub_delay => 4

Heres the log:

2025-09-21 01:33:08 +02:00 <7> [7917634.182999] slow spa_sync: pool zpool1 txg 9898649 pass 1 started 30 seconds ago, calls 1
2025-09-21 01:33:08 +02:00 <7> [7917634.183008] slow spa_sync info current_stage:SCAN deferfree:1 misc:1 poolsync:1(tx_create:11317171 dataset_sync:3987519 undirty_space:11317171 quota_update:3987519 dataset_sync2:0 dir_sync:5986211 syncmos:8442241 sync_task:170) free:1 freeenqueue:0 scan:0 vdev_sync:0 smartddt:0 postmeta:0
2025-09-21 01:33:08 +02:00 <6> [7917634.695020] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=550, sq_id=3, ns_id=1, nlba=255, slba=0xdc868cf8, prp1=0x4c3ba0000, prp2=0x161134c00, control=0x0, result=0x0, status=0x0 30203ms
2025-09-21 01:33:08 +02:00 <6> [7917634.712478] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=551, sq_id=3, ns_id=1, nlba=255, slba=0xdc868ef8, prp1=0xf40920000, prp2=0x161134000, control=0x0, result=0x0, status=0x0 30220ms
2025-09-21 01:33:08 +02:00 <4> [7917634.729913] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:08 +02:00 <6> [7917634.734126] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=552, sq_id=3, ns_id=1, nlba=7, slba=0x1473553b0, prp1=0x5c5468000, prp2=0x0, control=0x0, result=0x0, status=0x0 30153ms
2025-09-21 01:33:08 +02:00 <4> [7917634.750779] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:08 +02:00 <6> [7917634.754994] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=254, sq_id=9, ns_id=1, nlba=255, slba=0xdc868df8, prp1=0x8b54c0000, prp2=0x161134200, control=0x0, result=0x0, status=0x0 30263ms
2025-09-21 01:33:08 +02:00 <6> [7917634.755042] nvme nvme3: io cmd_op=0x2, cmd_id=552, sq_id=3, ns_id=1, nlba=7, slba=0x1473553b0, prp1=0x5c5468000, prp2=0x0, control=0x0, result=0x0, status=0x7 30174ms
2025-09-21 01:33:08 +02:00 <4> [7917634.772429] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:08 +02:00 <6> [7917634.791732] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=636, sq_id=15, ns_id=1, nlba=255, slba=0xdc868ff8, prp1=0xe11120000, prp2=0x161134800, control=0x0, result=0x0, status=0x0 30299ms
2025-09-21 01:33:08 +02:00 <4> [7917634.809254] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:08 +02:00 <4> [7917634.813513] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:08 +02:00 <6> [7917634.975005] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=637, sq_id=15, ns_id=1, nlba=7, slba=0x1482e14e8, prp1=0x189918000, prp2=0x0, control=0x0, result=0x0, status=0x0 30005ms
2025-09-21 01:33:08 +02:00 <6> [7917634.991792] nvme nvme3: io cmd_op=0x2, cmd_id=637, sq_id=15, ns_id=1, nlba=7, slba=0x1482e14e8, prp1=0x189918000, prp2=0x0, control=0x0, result=0x0, status=0x7 30021ms
2025-09-21 01:33:08 +02:00 <4> [7917634.991802] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:11 +02:00 <6> [7917638.023009] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=151, sq_id=1, ns_id=1, nlba=7, slba=0x1473c1478, prp1=0xc760ba000, prp2=0x0, control=0x0, result=0x0, status=0x0 30015ms
2025-09-21 01:33:11 +02:00 <6> [7917638.039729] nvme nvme3: io cmd_op=0x2, cmd_id=151, sq_id=1, ns_id=1, nlba=7, slba=0x1473c1478, prp1=0xc760ba000, prp2=0x0, control=0x0, result=0x0, status=0x7 30031ms
2025-09-21 01:33:11 +02:00 <4> [7917638.039738] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:14 +02:00 <6> [7917640.199017] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=577, sq_id=8, ns_id=1, nlba=127, slba=0x45fd2100, prp1=0xa25160000, prp2=0x161134900, control=0x0, result=0x0, status=0x0 30017ms
2025-09-21 01:33:14 +02:00 <6> [7917640.216514] nvme nvme3: io cmd_op=0x2, cmd_id=577, sq_id=8, ns_id=1, nlba=127, slba=0x45fd2100, prp1=0xa25160000, prp2=0x161134900, control=0x0, result=0x0, status=0x7 30034ms
2025-09-21 01:33:14 +02:00 <4> [7917640.232381] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:19 +02:00 <6> [7917645.959058] nvme nvme3: timeout, aborting io cmd_op=0x2, cmd_id=160, sq_id=16, ns_id=1, nlba=7, slba=0x1474a02c8, prp1=0x3f9b95000, prp2=0x0, control=0x0, result=0x0, status=0x0 30239ms
2025-09-21 01:33:19 +02:00 <6> [7917645.975867] nvme nvme3: io cmd_op=0x2, cmd_id=160, sq_id=16, ns_id=1, nlba=7, slba=0x1474a02c8, prp1=0x3f9b95000, prp2=0x0, control=0x0, result=0x0, status=0x7 30255ms
2025-09-21 01:33:19 +02:00 <4> [7917645.975883] nvme nvme3: Abort status: 0x0
2025-09-21 01:33:38 +02:00 <6> [7917664.903015] nvme nvme3: timeout, reset controller io cmd_op=0x2, cmd_id=550, sq_id=3, ns_id=1, nlba=255, slba=0xdc868cf8, prp1=0x4c3ba0000, prp2=0x161134c00, control=0x0, result=0x0, status=0x0 60411ms
2025-09-21 01:34:40 +02:00 <6> [7917726.855015] nvme nvme3: timeout, reset controller ioctl user cmd_op=0x0, cmd_id=0, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0xa, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x0 61924ms
2025-09-21 01:35:01 +02:00 <3> [7917747.374005] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
2025-09-21 01:35:01 +02:00 <6> [7917747.393185] nvme nvme3: flush cmd_op=0, cmd_id=0x107, sq_id=7, ns_id=1, result=0x0, status=0x371 107161ms
2025-09-21 01:35:01 +02:00 <6> [7917747.393190] nvme nvme3: io cmd_op=0x2, cmd_id=151, sq_id=1, ns_id=1, nlba=7, slba=0x1473c1478, prp1=0xc760ba000, prp2=0x0, control=0x0, result=0x0, status=0x371 109339ms
2025-09-21 01:35:01 +02:00 <6> [7917747.393196] nvme nvme3: io cmd_op=0x2, cmd_id=550, sq_id=3, ns_id=1, nlba=255, slba=0xdc868cf8, prp1=0x4c3ba0000, prp2=0x161134c00, control=0x0, result=0x0, status=0x371 142901ms
2025-09-21 01:35:01 +02:00 <6> [7917747.402974] nvme nvme3: io cmd_op=0x2, cmd_id=577, sq_id=8, ns_id=1, nlba=127, slba=0x45fd2100, prp1=0xa25160000, prp2=0x161134900, control=0x0, result=0x0, status=0x371 107166ms
2025-09-21 01:35:01 +02:00 <6> [7917747.418309] nvme nvme3: io cmd_op=0x2, cmd_id=551, sq_id=3, ns_id=1, nlba=255, slba=0xdc868ef8, prp1=0xf40920000, prp2=0x161134000, control=0x0, result=0x0, status=0x371 142926ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434439] nvme nvme3: io cmd_op=0x2, cmd_id=254, sq_id=9, ns_id=1, nlba=255, slba=0xdc868df8, prp1=0x8b54c0000, prp2=0x161134200, control=0x0, result=0x0, status=0x371 142942ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434441] nvme nvme3: io cmd_op=0x2, cmd_id=636, sq_id=15, ns_id=1, nlba=255, slba=0xdc868ff8, prp1=0xe11120000, prp2=0x161134800, control=0x0, result=0x0, status=0x371 142942ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434444] nvme nvme3: io cmd_op=0x2, cmd_id=82, sq_id=10, ns_id=1, nlba=7, slba=0x2aa00f08, prp1=0xc8da41000, prp2=0x0, control=0x0, result=0x0, status=0x371 109407ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434446] nvme nvme3: io cmd_op=0x2, cmd_id=637, sq_id=15, ns_id=1, nlba=7, slba=0x1482e14e8, prp1=0x189918000, prp2=0x0, control=0x0, result=0x0, status=0x371 112428ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434449] nvme nvme3: io cmd_op=0x2, cmd_id=83, sq_id=10, ns_id=1, nlba=7, slba=0x2aa0f2c0, prp1=0x881ff2000, prp2=0x0, control=0x0, result=0x0, status=0x371 109407ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434451] nvme nvme3: io cmd_op=0x2, cmd_id=160, sq_id=16, ns_id=1, nlba=7, slba=0x1474a02c8, prp1=0x3f9b95000, prp2=0x0, control=0x0, result=0x0, status=0x371 101443ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434454] nvme nvme3: io cmd_op=0x2, cmd_id=84, sq_id=10, ns_id=1, nlba=7, slba=0x2aa00f10, prp1=0xd85bb7000, prp2=0x0, control=0x0, result=0x0, status=0x371 109407ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434463] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=0, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0xa, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434466] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=1, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x9, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434468] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=2, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x8, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434470] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=3, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x7, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434471] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=4, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x5, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434473] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=5, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x4, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434474] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=6, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x3, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434476] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=7, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x6, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434478] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=8, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x1, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434479] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=11, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x2, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434481] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=24, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0xf, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434482] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=25, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0xe, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434484] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=26, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0xd, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434486] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=27, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0x10, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434487] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=28, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0xc, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.434488] nvme nvme3: ioctl user cmd_op=0x0, cmd_id=29, sq_id=0, ns_id=0, dw2=0x0, dw3=0x0, dw10=0xb, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 82503ms
2025-09-21 01:35:01 +02:00 <6> [7917747.450554] nvme nvme3: io cmd_op=0x2, cmd_id=552, sq_id=3, ns_id=1, nlba=7, slba=0x1473553b0, prp1=0x5c5468000, prp2=0x0, control=0x0, result=0x0, status=0x371 112663ms
2025-09-21 01:35:01 +02:00 <6> [7917747.466686] nvme nvme3: ioctl user cmd_op=0x2, cmd_id=15, sq_id=0, ns_id=-1, dw2=0x0, dw3=0x0, dw10=0x7f0002, dw11=0x0, dw12=0x0, dw13=0x0, dw14=0x0, dw15=0x0, result=0x0, status=0x371 812ms
2025-09-21 01:35:01 +02:00 <6> [7917747.882003] nvme nvme3: change state from 1 to 3
2025-09-21 01:35:01 +02:00 <4> [7917747.899004] nvme nvme3: fail to change state from 3 to 3
2025-09-21 01:35:22 +02:00 <3> [7917768.405750] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
2025-09-21 01:35:22 +02:00 <4> [7917768.412134] nvme nvme3: Removing after probe failure status: -19
2025-09-21 01:35:22 +02:00 <6> [7917768.418339] nvme nvme3: change state from 3 to 5
2025-09-21 01:35:29 +02:00 <3> [7917776.007072] INFO: task md9_raid1:8637 blocked for more than 122 seconds.
2025-09-21 01:35:29 +02:00 <3> [7917776.013980]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:29 +02:00 <3> [7917776.019924] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:29 +02:00 <6> [7917776.027956] task:md9_raid1       state:D stack:    0 pid: 8637 ppid:     2 flags:0x00004000
2025-09-21 01:35:29 +02:00 <6> [7917776.036510] Call Trace:
2025-09-21 01:35:29 +02:00 <6> [7917776.039153]  __schedule+0x1dd/0x610
2025-09-21 01:35:29 +02:00 <6> [7917776.042835]  schedule+0x39/0xa0
2025-09-21 01:35:29 +02:00 <6> [7917776.046169]  md_super_wait+0x69/0xa0
2025-09-21 01:35:29 +02:00 <6> [7917776.049940]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.053620]  write_page+0x2b3/0x2d0
2025-09-21 01:35:30 +02:00 <6> [7917776.057303]  md_update_sb+0x3c4/0x900
2025-09-21 01:35:30 +02:00 <6> [7917776.061159]  ? md_bitmap_daemon_work+0x1eb/0x3c0
2025-09-21 01:35:30 +02:00 <6> [7917776.065971]  md_check_recovery+0x32b/0x570
2025-09-21 01:35:30 +02:00 <6> [7917776.070263]  raid1d+0x61/0xeb0
2025-09-21 01:35:30 +02:00 <6> [7917776.073511]  ? pick_next_task_fair+0xdb/0x380
2025-09-21 01:35:30 +02:00 <6> [7917776.078063]  ? __schedule+0x1e5/0x610
2025-09-21 01:35:30 +02:00 <6> [7917776.081915]  ? schedule+0x39/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917776.085417]  ? schedule_timeout+0x1b6/0x270
2025-09-21 01:35:30 +02:00 <6> [7917776.089791]  ? __schedule+0x1e5/0x610
2025-09-21 01:35:30 +02:00 <6> [7917776.093644]  ? md_thread+0x126/0x170
2025-09-21 01:35:30 +02:00 <6> [7917776.097409]  md_thread+0x126/0x170
2025-09-21 01:35:30 +02:00 <6> [7917776.101002]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.104681]  ? mddev_resume+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.108536]  kthread+0x107/0x140
2025-09-21 01:35:30 +02:00 <6> [7917776.111955]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:30 +02:00 <6> [7917776.115810]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:30 +02:00 <3> [7917776.119578] INFO: task jbd2/md9-8:8690 blocked for more than 122 seconds.
2025-09-21 01:35:30 +02:00 <3> [7917776.126565]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:30 +02:00 <3> [7917776.132507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:30 +02:00 <6> [7917776.140537] task:jbd2/md9-8      state:D stack:    0 pid: 8690 ppid:     2 flags:0x00004000
2025-09-21 01:35:30 +02:00 <6> [7917776.149089] Call Trace:
2025-09-21 01:35:30 +02:00 <6> [7917776.151722]  __schedule+0x1dd/0x610
2025-09-21 01:35:30 +02:00 <6> [7917776.155403]  schedule+0x39/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917776.158735]  md_write_start+0x118/0x210
2025-09-21 01:35:30 +02:00 <6> [7917776.162761]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.166441]  raid1_make_request+0x5f/0xa80
2025-09-21 01:35:30 +02:00 <6> [7917776.170730]  ? __update_load_avg_se+0x1f8/0x250
2025-09-21 01:35:30 +02:00 <6> [7917776.175458]  ? __kmalloc+0x4d/0x430
2025-09-21 01:35:30 +02:00 <6> [7917776.179135]  md_handle_request+0xb9/0x130
2025-09-21 01:35:30 +02:00 <6> [7917776.183336]  ? enqueue_task_fair+0xd4/0x8a0
2025-09-21 01:35:30 +02:00 <6> [7917776.187715]  ? disk_map_sector_rcu+0x5c/0x150
2025-09-21 01:35:30 +02:00 <6> [7917776.192266]  md_submit_bio+0xe3/0x270
2025-09-21 01:35:30 +02:00 <6> [7917776.196121]  ? ktime_get+0x30/0x90
2025-09-21 01:35:30 +02:00 <6> [7917776.199713]  ? blk_queue_enter+0x1e4/0x240
2025-09-21 01:35:30 +02:00 <6> [7917776.204000]  ? submit_bio_checks+0x27a/0x4a0
2025-09-21 01:35:30 +02:00 <6> [7917776.208466]  ? mempool_alloc+0x60/0x170
2025-09-21 01:35:30 +02:00 <6> [7917776.212493]  submit_bio_noacct+0x159/0x400
2025-09-21 01:35:30 +02:00 <6> [7917776.216784]  ? submit_bio+0x6d/0x150
2025-09-21 01:35:30 +02:00 <6> [7917776.220551]  submit_bio+0x6d/0x150
2025-09-21 01:35:30 +02:00 <6> [7917776.224146]  ? ext4_bio_write_page+0x128/0x340
2025-09-21 01:35:30 +02:00 <6> [7917776.228784]  ext4_io_submit+0x4a/0x60
2025-09-21 01:35:30 +02:00 <6> [7917776.232641]  ext4_writepage+0x101/0x560
2025-09-21 01:35:30 +02:00 <6> [7917776.236672]  ? page_mkclean+0x69/0xb0
2025-09-21 01:35:30 +02:00 <6> [7917776.240528]  __writepage+0xb/0x60
2025-09-21 01:35:30 +02:00 <6> [7917776.244035]  write_cache_pages+0x1ac/0x350
2025-09-21 01:35:30 +02:00 <6> [7917776.248325]  ? write_one_page+0x100/0x100
2025-09-21 01:35:30 +02:00 <6> [7917776.252528]  ? md_submit_bio+0xe3/0x270
2025-09-21 01:35:30 +02:00 <6> [7917776.256558]  ? xas_load+0x9/0x50
2025-09-21 01:35:30 +02:00 <6> [7917776.259977]  ? find_get_entry+0xbe/0x130
2025-09-21 01:35:30 +02:00 <6> [7917776.264094]  generic_writepages+0x57/0x90
2025-09-21 01:35:30 +02:00 <6> [7917776.268297]  ? jbd2_journal_submit_inode_data_buffers+0x5c/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.274416]  jbd2_journal_submit_inode_data_buffers+0x5c/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.280359]  ext4_journal_submit_inode_data_buffers+0x2d/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917776.286301]  ? __wake_up_common_lock+0x82/0xb0
2025-09-21 01:35:30 +02:00 <6> [7917776.290938]  jbd2_journal_commit_transaction+0x4fa/0x1770
2025-09-21 01:35:30 +02:00 <6> [7917776.296533]  ? lock_timer_base+0x62/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.300649]  ? kjournald2+0xd5/0x270
2025-09-21 01:35:30 +02:00 <6> [7917776.304415]  kjournald2+0xd5/0x270
2025-09-21 01:35:30 +02:00 <6> [7917776.308008]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.311687]  ? commit_timeout+0x10/0x10
2025-09-21 01:35:30 +02:00 <6> [7917776.315713]  kthread+0x107/0x140
2025-09-21 01:35:30 +02:00 <6> [7917776.319128]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:30 +02:00 <6> [7917776.322984]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:30 +02:00 <3> [7917776.326782] INFO: task quiesce zpool1:12553 blocked for more than 123 seconds.
2025-09-21 01:35:30 +02:00 <3> [7917776.334207]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:30 +02:00 <3> [7917776.340151] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:30 +02:00 <6> [7917776.348181] task:quiesce zpool1  state:D stack:    0 pid:12553 ppid:     2 flags:0x00004000
2025-09-21 01:35:30 +02:00 <6> [7917776.356735] Call Trace:
2025-09-21 01:35:30 +02:00 <6> [7917776.359372]  __schedule+0x1dd/0x610
2025-09-21 01:35:30 +02:00 <6> [7917776.363050]  schedule+0x39/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917776.366387]  cv_wait_common+0x14c/0x2a0 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.370940]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.374653]  txg_quiesce+0x1e6/0x260 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.378967]  txg_quiesce_thread+0x391/0x850 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.383867]  ? lpl_utsname_init+0x80/0x80 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.388614]  ? kill_zibblkptr+0x10/0x10 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.393182]  ? txg_quiesce+0x260/0x260 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.397646]  ? __thread_exit+0x10/0x10 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.402107]  thread_generic_wrapper+0x7a/0xc0 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.407179]  kthread+0x107/0x140
2025-09-21 01:35:30 +02:00 <6> [7917776.410598]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:30 +02:00 <6> [7917776.414452]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:30 +02:00 <3> [7917776.418217] INFO: task sync zpool1:12554 blocked for more than 123 seconds.
2025-09-21 01:35:30 +02:00 <3> [7917776.425378]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:30 +02:00 <3> [7917776.431319] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:30 +02:00 <6> [7917776.439350] task:sync zpool1     state:D stack:    0 pid:12554 ppid:     2 flags:0x00004000
2025-09-21 01:35:30 +02:00 <6> [7917776.447903] Call Trace:
2025-09-21 01:35:30 +02:00 <6> [7917776.450539]  __schedule+0x1dd/0x610
2025-09-21 01:35:30 +02:00 <6> [7917776.454217]  schedule+0x39/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917776.457549]  taskq_wait+0x75/0xc0 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.461577]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.465258]  taskq_destroy+0x4b/0x410 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.469662]  dsl_scan_sync+0x1640/0x2070 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.474332]  spa_sync+0x946/0x14d0 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.478447]  ? __wake_up_common+0x82/0x120
2025-09-21 01:35:30 +02:00 <6> [7917776.482759]  txg_sync_thread+0x3c8/0x560 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.487414]  ? txg_quiesce_hook_after_wait+0x10/0x10 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.493099]  ? __thread_exit+0x10/0x10 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.497563]  thread_generic_wrapper+0x7a/0xc0 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.502634]  kthread+0x107/0x140
2025-09-21 01:35:30 +02:00 <6> [7917776.506054]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:30 +02:00 <6> [7917776.509909]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:30 +02:00 <3> [7917776.513802] INFO: task nfsd:30975 blocked for more than 123 seconds.
2025-09-21 01:35:30 +02:00 <3> [7917776.520354]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:30 +02:00 <3> [7917776.526297] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:30 +02:00 <6> [7917776.534329] task:nfsd            state:D stack:    0 pid:30975 ppid:     2 flags:0x00004000
2025-09-21 01:35:30 +02:00 <6> [7917776.542881] Call Trace:
2025-09-21 01:35:30 +02:00 <6> [7917776.545517]  __schedule+0x1dd/0x610
2025-09-21 01:35:30 +02:00 <6> [7917776.549199]  schedule+0x39/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917776.552532]  cv_wait_common+0x14c/0x2a0 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.557084]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.560785]  dmu_issue+0x38f/0x450 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.564924]  zfs_read+0x291/0x390 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.568974]  ? ddt_zap_create+0x50/0x50 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.573542]  ? dmu_buf_transfer_write+0x140/0x140 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.578981]  ? dmu_buf_transfer_write+0x140/0x140 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.584424]  zpl_iter_read+0xc5/0x160 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.588804]  generic_file_splice_read+0xe6/0x190
2025-09-21 01:35:30 +02:00 <6> [7917776.593616]  splice_direct_to_actor+0xcd/0x220
2025-09-21 01:35:30 +02:00 <6> [7917776.598254]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:30 +02:00 <6> [7917776.602201]  ? nfsd_xattr_errno+0x30/0x30 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.607018]  nfsd_splice_read+0xaf/0x130 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.611746]  nfsd4_encode_read+0x17d/0x360 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.616652]  nfsd4_encode_operation+0x9f/0x1b0 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.621904]  nfsd4_proc_compound+0x1eb/0x570 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.626983]  nfsd_dispatch+0xdc/0x1c0 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.631452]  svc_process+0x54d/0x7e0
2025-09-21 01:35:30 +02:00 <6> [7917776.635226]  ? nfsd_svc+0x320/0x320 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.639520]  ? nfsd_destroy+0x60/0x60 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.643989]  nfsd+0xe8/0x160 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.647669]  kthread+0x107/0x140
2025-09-21 01:35:30 +02:00 <6> [7917776.651089]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:30 +02:00 <6> [7917776.654942]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:30 +02:00 <3> [7917776.658711] INFO: task nfsd:30976 blocked for more than 123 seconds.
2025-09-21 01:35:30 +02:00 <3> [7917776.665265]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:30 +02:00 <3> [7917776.671207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:30 +02:00 <6> [7917776.679236] task:nfsd            state:D stack:    0 pid:30976 ppid:     2 flags:0x00004000
2025-09-21 01:35:30 +02:00 <6> [7917776.687790] Call Trace:
2025-09-21 01:35:30 +02:00 <6> [7917776.690427]  __schedule+0x1dd/0x610
2025-09-21 01:35:30 +02:00 <6> [7917776.694107]  schedule+0x39/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917776.697439]  schedule_timeout+0x191/0x270
2025-09-21 01:35:30 +02:00 <6> [7917776.701638]  ? __next_timer_interrupt+0x110/0x110
2025-09-21 01:35:30 +02:00 <6> [7917776.706539]  __cv_timedwait_common+0xf7/0x2d0 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.711615]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.715323]  zio_wait+0x1cd/0x280 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.719373]  dmu_issue+0x406/0x450 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.723510]  ? dmu_context_init+0x9c/0x1e0 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.728344]  zfs_read+0x291/0x390 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.732394]  ? ddt_zap_create+0x50/0x50 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.736965]  ? dmu_buf_transfer_write+0x140/0x140 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.742404]  ? dmu_buf_transfer_write+0x140/0x140 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.747844]  zpl_iter_read+0xc5/0x160 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.752220]  generic_file_splice_read+0xe6/0x190
2025-09-21 01:35:30 +02:00 <6> [7917776.757029]  splice_direct_to_actor+0xcd/0x220
2025-09-21 01:35:30 +02:00 <6> [7917776.761668]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:30 +02:00 <6> [7917776.765613]  ? nfsd_xattr_errno+0x30/0x30 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.770428]  nfsd_splice_read+0xaf/0x130 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.775158]  nfsd4_encode_read+0x17d/0x360 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.780061]  nfsd4_encode_operation+0x9f/0x1b0 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.785315]  nfsd4_proc_compound+0x1eb/0x570 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.790390]  nfsd_dispatch+0xdc/0x1c0 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.794855]  svc_process+0x54d/0x7e0
2025-09-21 01:35:30 +02:00 <6> [7917776.798624]  ? nfsd_svc+0x320/0x320 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.802918]  ? nfsd_destroy+0x60/0x60 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.807382]  nfsd+0xe8/0x160 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.811062]  kthread+0x107/0x140
2025-09-21 01:35:30 +02:00 <6> [7917776.814480]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:30 +02:00 <6> [7917776.818334]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:30 +02:00 <3> [7917776.822105] INFO: task nfsd:30977 blocked for more than 123 seconds.
2025-09-21 01:35:30 +02:00 <3> [7917776.828657]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:30 +02:00 <3> [7917776.834601] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:30 +02:00 <6> [7917776.842632] task:nfsd            state:D stack:    0 pid:30977 ppid:     2 flags:0x00004000
2025-09-21 01:35:30 +02:00 <6> [7917776.851186] Call Trace:
2025-09-21 01:35:30 +02:00 <6> [7917776.853826]  __schedule+0x1dd/0x610
2025-09-21 01:35:30 +02:00 <6> [7917776.857506]  schedule+0x39/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917776.860840]  cv_wait_common+0x14c/0x2a0 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917776.865394]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917776.869102]  dmu_tx_wait_profile+0xdb/0x6f0 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.874022]  dmu_tx_assign_profile+0x122/0x190 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.879204]  zfs_dirty_inode+0x166/0x3d0 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.883861]  ? ddt_zap_create+0x50/0x50 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.888430]  ? dmu_buf_transfer_write+0x140/0x140 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.893867]  ? dmu_buf_transfer_write+0x140/0x140 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.899307]  zpl_dirty_inode+0x28/0x50 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.903774]  __mark_inode_dirty+0x28/0x1c0
2025-09-21 01:35:30 +02:00 <6> [7917776.908071]  generic_update_time+0x6f/0xc0
2025-09-21 01:35:30 +02:00 <6> [7917776.912362]  touch_atime+0x9f/0x100
2025-09-21 01:35:30 +02:00 <6> [7917776.916065]  zpl_iter_read+0x135/0x160 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917776.920530]  generic_file_splice_read+0xe6/0x190
2025-09-21 01:35:30 +02:00 <6> [7917776.925343]  splice_direct_to_actor+0xcd/0x220
2025-09-21 01:35:30 +02:00 <6> [7917776.929983]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:30 +02:00 <6> [7917776.933928]  ? nfsd_xattr_errno+0x30/0x30 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.938744]  nfsd_splice_read+0xaf/0x130 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.943473]  nfsd4_encode_read+0x17d/0x360 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.948375]  nfsd4_encode_operation+0x9f/0x1b0 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.953628]  nfsd4_proc_compound+0x1eb/0x570 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.958704]  nfsd_dispatch+0xdc/0x1c0 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.963171]  svc_process+0x54d/0x7e0
2025-09-21 01:35:30 +02:00 <6> [7917776.966944]  ? nfsd_svc+0x320/0x320 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.971237]  ? nfsd_destroy+0x60/0x60 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.975704]  nfsd+0xe8/0x160 [nfsd]
2025-09-21 01:35:30 +02:00 <6> [7917776.979387]  kthread+0x107/0x140
2025-09-21 01:35:30 +02:00 <6> [7917776.982808]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:30 +02:00 <6> [7917776.986666]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:30 +02:00 <3> [7917776.990441] INFO: task nfsd:30978 blocked for more than 123 seconds.
2025-09-21 01:35:30 +02:00 <3> [7917776.996993]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:30 +02:00 <3> [7917777.002938] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:30 +02:00 <6> [7917777.010968] task:nfsd            state:D stack:    0 pid:30978 ppid:     2 flags:0x00004000
2025-09-21 01:35:30 +02:00 <6> [7917777.019524] Call Trace:
2025-09-21 01:35:30 +02:00 <6> [7917777.022158]  __schedule+0x1dd/0x610
2025-09-21 01:35:30 +02:00 <6> [7917777.025837]  schedule+0x39/0xa0
2025-09-21 01:35:30 +02:00 <6> [7917777.029169]  cv_wait_common+0x14c/0x2a0 [lpl]
2025-09-21 01:35:30 +02:00 <6> [7917777.033717]  ? wait_woken+0x80/0x80
2025-09-21 01:35:30 +02:00 <6> [7917777.037416]  dbuf_dirty_compute_state+0x3a/0xe0 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917777.042681]  dbuf_dirty_leaf_enter+0x6e/0x240 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917777.047773]  dbuf_dirty_leaf+0x3a/0x130 [zfs]
2025-09-21 01:35:30 +02:00 <6> [7917777.052342]  dmu_buf_transfer_write+0xe1/0x140 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.057519]  dmu_buf_set_transfer+0x76/0x160 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.062523]  dmu_buf_set_complete+0x67/0x90 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.067438]  dmu_issue+0x3f7/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.071570]  ? dmu_context_init+0x9c/0x1e0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.076399]  dmu_write_uio_dbuf+0xbd/0xd0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.081143]  ? ddt_zap_create+0x50/0x50 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.085712]  ? dmu_buf_set_set_error+0x60/0x60 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.090888]  ? dmu_buf_read_uio+0x40/0x40 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.095626]  zfs_write+0x182d/0x2100 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.099918]  ? dev_hard_start_xmit+0xb1/0x140
2025-09-21 01:35:31 +02:00 <6> [7917777.104488]  ? dbuf_rele_and_unlock+0x170/0x560 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.109740]  ? ipt_do_table+0x318/0x420
2025-09-21 01:35:31 +02:00 <6> [7917777.113787]  ? dbuf_find+0x13a/0x140 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.118094]  ? dbuf_hold_impl+0x277/0x3c0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.122839]  ? dbuf_read+0x410/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.127147]  ? dbuf_read+0x357/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.131437]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.135380]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.139323]  ? mutex_lock+0x9/0x30
2025-09-21 01:35:31 +02:00 <6> [7917777.142932]  ? dbuf_read+0x357/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.147244]  zpl_write_common_iovec+0xcd/0x150 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.152421]  zpl_iter_write+0x99/0x100 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.156887]  do_iter_readv_writev+0x12d/0x1a0
2025-09-21 01:35:31 +02:00 <6> [7917777.161437]  do_iter_write+0xe9/0x220
2025-09-21 01:35:31 +02:00 <6> [7917777.165295]  nfsd_vfs_write+0x466/0x6d0 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.169934]  nfsd4_write+0xee/0x150 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.174231]  nfsd4_proc_compound+0x3b5/0x570 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.179308]  nfsd_dispatch+0xdc/0x1c0 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.183771]  svc_process+0x54d/0x7e0
2025-09-21 01:35:31 +02:00 <6> [7917777.187540]  ? nfsd_svc+0x320/0x320 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.191831]  ? nfsd_destroy+0x60/0x60 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.196296]  nfsd+0xe8/0x160 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.199978]  kthread+0x107/0x140
2025-09-21 01:35:31 +02:00 <6> [7917777.203395]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:31 +02:00 <6> [7917777.207249]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:31 +02:00 <3> [7917777.211024] INFO: task nfsd:30979 blocked for more than 124 seconds.
2025-09-21 01:35:31 +02:00 <3> [7917777.217582]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:31 +02:00 <3> [7917777.223524] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:31 +02:00 <6> [7917777.231553] task:nfsd            state:D stack:    0 pid:30979 ppid:     2 flags:0x00004000
2025-09-21 01:35:31 +02:00 <6> [7917777.240109] Call Trace:
2025-09-21 01:35:31 +02:00 <6> [7917777.242754]  __schedule+0x1dd/0x610
2025-09-21 01:35:31 +02:00 <6> [7917777.246435]  schedule+0x39/0xa0
2025-09-21 01:35:31 +02:00 <6> [7917777.249773]  schedule_timeout+0x191/0x270
2025-09-21 01:35:31 +02:00 <6> [7917777.253977]  ? __next_timer_interrupt+0x110/0x110
2025-09-21 01:35:31 +02:00 <6> [7917777.258886]  __cv_timedwait_common+0xf7/0x2d0 [lpl]
2025-09-21 01:35:31 +02:00 <6> [7917777.263961]  ? wait_woken+0x80/0x80
2025-09-21 01:35:31 +02:00 <6> [7917777.267794]  zio_wait+0x1cd/0x280 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.271844]  dbuf_read+0x237/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.275982]  zib_get_bp_data+0x371/0x620 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.280651]  zio_read_partial+0x8c/0x6c0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.285310]  ? arc_buf_add_ref+0x1a0/0x1a0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.290147]  arc_read_zib+0x8d8/0x1b70 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.294637]  ? dsl_dataset_dirty_zib+0x36/0x60 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.299796]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.303739]  ? kmem_cache_alloc+0x30/0x3f0
2025-09-21 01:35:31 +02:00 <6> [7917777.308049]  ? dbuf_rele_and_unlock+0x560/0x560 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.313322]  ? dsl_dataset_dirty+0x33/0x60 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.318134]  ? __cv_init+0x66/0x150 [lpl]
2025-09-21 01:35:31 +02:00 <6> [7917777.322360]  ? zio_create_impl+0x119/0x700 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.327191]  ? dbuf_dirty_indirect+0x60/0x80 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.332176]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.336119]  ? mutex_lock+0x9/0x30
2025-09-21 01:35:31 +02:00 <6> [7917777.339743]  ? dbuf_read_fileclone+0x3e/0x530 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.344840]  ? dbuf_rele_and_unlock+0x170/0x560 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.350112]  ? dbuf_create_zib+0x305/0x4b0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.354941]  ? dbuf_find+0x13a/0x140 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.359253]  ? dmu_thread_async_process+0x30/0x40 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.364702]  ? dbuf_read_zib+0x98/0x130 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.369274]  dbuf_read_zib+0x98/0x130 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.373677]  ? zio_null+0x24/0x30 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.377726]  dbuf_read_impl+0x46d/0x550 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.382292]  dbuf_read+0x180/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.386431]  dbuf_findbp_zib+0x157/0x300 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.391096]  dbuf_hold_create_zib+0x42/0xd0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.396017]  dbuf_hold_impl+0x323/0x3c0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.400589]  ? dbuf_dirty_mdn_object+0x56/0xa0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.405773]  dbuf_findbp_zib+0x13f/0x300 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.410435]  dbuf_hold_create_zib+0x42/0xd0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.415351]  dbuf_hold_impl+0x323/0x3c0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.419919]  ? dbuf_dirty_indirect+0x60/0x80 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.424930]  ? dmu_objset_userused_enabled+0xd/0x40 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.430569]  dmu_buf_set_setup_buffers+0x33a/0x1070 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.436185]  ? dbuf_rele_and_unlock+0x170/0x560 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.441435]  ? __cv_init+0x66/0x150 [lpl]
2025-09-21 01:35:31 +02:00 <6> [7917777.445639]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.449581]  ? kmem_cache_alloc+0x30/0x3f0
2025-09-21 01:35:31 +02:00 <6> [7917777.453872]  ? lpl_kmem_cache_alloc+0xfd/0xd80 [lpl]
2025-09-21 01:35:31 +02:00 <6> [7917777.459048]  ? arc_buf_remove_ref+0xe6/0x130 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.464051]  ? dbuf_is_metadata.part.17+0x3c/0x40 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.469474]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.473415]  ? mutex_lock+0x9/0x30
2025-09-21 01:35:31 +02:00 <6> [7917777.477040]  ? dsl_dir_tempreserve_impl+0x3e0/0x570 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.482633]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.486572]  ? __kmalloc+0x3af/0x430
2025-09-21 01:35:31 +02:00 <6> [7917777.490341]  ? lpl_kmem_zalloc+0x11a/0x180 [lpl]
2025-09-21 01:35:31 +02:00 <6> [7917777.495172]  dmu_issue+0x102/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.499305]  ? dmu_context_init+0x9c/0x1e0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.504135]  dmu_write_uio_dbuf+0xbd/0xd0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.508878]  ? ddt_zap_create+0x50/0x50 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.513444]  ? dmu_buf_set_set_error+0x60/0x60 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.518619]  ? dmu_buf_read_uio+0x40/0x40 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.523357]  zfs_write+0x182d/0x2100 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.527652]  ? dev_hard_start_xmit+0xb1/0x140
2025-09-21 01:35:31 +02:00 <6> [7917777.532220]  ? dbuf_rele_and_unlock+0x170/0x560 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.537479]  ? ipt_do_table+0x318/0x420
2025-09-21 01:35:31 +02:00 <6> [7917777.541526]  ? dbuf_find+0x13a/0x140 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.545834]  ? dbuf_hold_impl+0x277/0x3c0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.550577]  ? dbuf_read+0x410/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.554884]  ? dbuf_read+0x357/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.559172]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.563113]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.567056]  ? mutex_lock+0x9/0x30
2025-09-21 01:35:31 +02:00 <6> [7917777.570666]  ? dbuf_read+0x357/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.574975]  zpl_write_common_iovec+0xcd/0x150 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.580160]  zpl_iter_write+0x99/0x100 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.584624]  do_iter_readv_writev+0x12d/0x1a0
2025-09-21 01:35:31 +02:00 <6> [7917777.589177]  do_iter_write+0xe9/0x220
2025-09-21 01:35:31 +02:00 <6> [7917777.593040]  nfsd_vfs_write+0x466/0x6d0 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.597683]  nfsd4_write+0xee/0x150 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.601979]  nfsd4_proc_compound+0x3b5/0x570 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.607055]  nfsd_dispatch+0xdc/0x1c0 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.611522]  svc_process+0x54d/0x7e0
2025-09-21 01:35:31 +02:00 <6> [7917777.615291]  ? nfsd_svc+0x320/0x320 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.619584]  ? nfsd_destroy+0x60/0x60 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.624049]  nfsd+0xe8/0x160 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.627729]  kthread+0x107/0x140
2025-09-21 01:35:31 +02:00 <6> [7917777.631150]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:31 +02:00 <6> [7917777.635005]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:31 +02:00 <3> [7917777.638774] INFO: task nfsd:30980 blocked for more than 124 seconds.
2025-09-21 01:35:31 +02:00 <3> [7917777.645325]       Tainted: P         C O      5.10.60-qnap #1
2025-09-21 01:35:31 +02:00 <3> [7917777.651266] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-09-21 01:35:31 +02:00 <6> [7917777.659296] task:nfsd            state:D stack:    0 pid:30980 ppid:     2 flags:0x00004000
2025-09-21 01:35:31 +02:00 <6> [7917777.667853] Call Trace:
2025-09-21 01:35:31 +02:00 <6> [7917777.670488]  __schedule+0x1dd/0x610
2025-09-21 01:35:31 +02:00 <6> [7917777.674169]  schedule+0x39/0xa0
2025-09-21 01:35:31 +02:00 <6> [7917777.677498]  schedule_timeout+0x191/0x270
2025-09-21 01:35:31 +02:00 <6> [7917777.681701]  ? __next_timer_interrupt+0x110/0x110
2025-09-21 01:35:31 +02:00 <6> [7917777.686599]  __cv_timedwait_common+0xf7/0x2d0 [lpl]
2025-09-21 01:35:31 +02:00 <6> [7917777.691672]  ? wait_woken+0x80/0x80
2025-09-21 01:35:31 +02:00 <6> [7917777.695381]  zio_wait+0x1cd/0x280 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.699429]  dmu_issue+0x406/0x450 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.703561]  ? dmu_context_init+0x9c/0x1e0 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.708393]  zfs_read+0x291/0x390 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.712441]  ? ddt_zap_create+0x50/0x50 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.717011]  ? dmu_buf_transfer_write+0x140/0x140 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.722448]  ? dmu_buf_transfer_write+0x140/0x140 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.727887]  zpl_iter_read+0xc5/0x160 [zfs]
2025-09-21 01:35:31 +02:00 <6> [7917777.732266]  generic_file_splice_read+0xe6/0x190
2025-09-21 01:35:31 +02:00 <6> [7917777.737079]  splice_direct_to_actor+0xcd/0x220
2025-09-21 01:35:31 +02:00 <6> [7917777.741718]  ? _cond_resched+0x10/0x20
2025-09-21 01:35:31 +02:00 <6> [7917777.745663]  ? nfsd_xattr_errno+0x30/0x30 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.750479]  nfsd_splice_read+0xaf/0x130 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.755207]  nfsd4_encode_read+0x17d/0x360 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.760110]  nfsd4_encode_operation+0x9f/0x1b0 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.765362]  nfsd4_proc_compound+0x1eb/0x570 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.770438]  nfsd_dispatch+0xdc/0x1c0 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.774902]  svc_process+0x54d/0x7e0
2025-09-21 01:35:31 +02:00 <6> [7917777.778670]  ? nfsd_svc+0x320/0x320 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.782964]  ? nfsd_destroy+0x60/0x60 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.787432]  nfsd+0xe8/0x160 [nfsd]
2025-09-21 01:35:31 +02:00 <6> [7917777.791110]  kthread+0x107/0x140
2025-09-21 01:35:31 +02:00 <6> [7917777.794531]  ? kthread_bind+0x10/0x10
2025-09-21 01:35:31 +02:00 <6> [7917777.798389]  ret_from_fork+0x1f/0x30
2025-09-21 01:35:42 +02:00 <3> [7917788.931689] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
2025-09-21 01:35:42 +02:00 <3> [7917788.938264] blk_update_request: I/O error, dev nvme3n1, sector 5490087032 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <3> [7917788.938292] blk_update_request: I/O error, dev nvme3n1, sector 5491000008 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <3> [7917788.938295] blk_update_request: I/O error, dev nvme3n1, sector 1174216960 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <3> [7917788.938304] blk_update_request: I/O error, dev nvme3n1, sector 3699805432 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <3> [7917788.938314] blk_update_request: I/O error, dev nvme3n1, sector 5489644464 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <6> [7917788.938316] nvme nvme3: nvme_remove(3308): pci function 0000:46:00.0
2025-09-21 01:35:42 +02:00 <3> [7917788.938320] blk_update_request: I/O error, dev nvme3n1, sector 3699805688 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <4> [7917788.938322] nvme nvme3: fail to change state from 5 to 5
2025-09-21 01:35:42 +02:00 <3> [7917788.938324] blk_update_request: I/O error, dev nvme3n1, sector 3699805944 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <6> [7917788.938327] nvme nvme3: change state from 5 to 6
2025-09-21 01:35:42 +02:00 <3> [7917788.938355] blk_update_request: I/O error, dev nvme3n1, sector 715132688 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <3> [7917788.938362] blk_update_request: I/O error, dev nvme3n1, sector 715190976 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <3> [7917788.938366] blk_update_request: I/O error, dev nvme3n1, sector 715132680 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
2025-09-21 01:35:42 +02:00 <3> [7917788.942008] md: super_written gets error=-5
2025-09-21 01:35:42 +02:00 <2> [7917788.942013] md/raid1:md9: Disk failure on nvme3n1p1, disabling device.
2025-09-21 01:35:42 +02:00 <2> [7917788.942013] md/raid1:md9: Operation continuing on 9 devices.
2025-09-21 01:35:42 +02:00 <6> [7917788.942015] md/raid:md9: report qnap hal event: type = HAL_EVENT_RAID, action = SET_RAID_PD_ERROR
2025-09-21 01:35:42 +02:00 <6> [7917788.942017] md/raid:md9: report qnap hal event: raid_id=9, pd_name=/dev/nvme3n1p1, spare=/dev/(null), pd_repair_sector=0
2025-09-21 01:35:42 +02:00 <4> [7917788.949669]    zio_vdev_io_start ( 3705): zio ffff888f349e2808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035007] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.117123]     zio_vdev_io_done ( 3786): zio ffff888f349e2808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035007] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.130656]    zio_vdev_io_start ( 3705): zio ffff888a61a3d010 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035016] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.144180]   zio_vdev_io_assess ( 3945): zio ffff888f349e2808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035007] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.157712]     zio_vdev_io_done ( 3786): zio ffff888a61a3d010 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035016] err 6
2025-09-21 01:35:43 +02:00 <6> [7917789.158251] nvme nvme3: nvme_ns_remove(4141): nvme3n1 kref:5
2025-09-21 01:35:43 +02:00 <4> [7917789.171283]   zio_vdev_io_assess ( 3945): zio ffff888a61a3d010 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035016] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.190615]    zio_vdev_io_start ( 3705): zio ffff888da333a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616579] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.204151]     zio_vdev_io_done ( 3786): zio ffff888da333a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616579] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.217685]    zio_vdev_io_start ( 3705): zio ffff888a61a3c4a0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616561] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.231201]   zio_vdev_io_assess ( 3945): zio ffff888da333a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616579] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.244721]     zio_vdev_io_done ( 3786): zio ffff888a61a3c4a0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616561] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.258246]    zio_vdev_io_start ( 3705): zio ffff888a61a3db80 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616593] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.271763]   zio_vdev_io_assess ( 3945): zio ffff888a61a3c4a0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616561] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.285279]     zio_vdev_io_done ( 3786): zio ffff888a61a3db80 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616593] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.298796]    zio_vdev_io_start ( 3705): zio ffff888a61a3adc0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034995] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.312309]   zio_vdev_io_assess ( 3945): zio ffff888a61a3db80 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616593] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.325838]    zio_vdev_io_start ( 3705): zio ffff888a61a3c4a0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034957] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.339356]     zio_vdev_io_done ( 3786): zio ffff888a61a3adc0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034995] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.352873]     zio_vdev_io_done ( 3786): zio ffff888a61a3c4a0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034957] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.366390]   zio_vdev_io_assess ( 3945): zio ffff888a61a3adc0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034995] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.372827] ---- NAS_Disk_Hot_Remove(enc_id=0, port_id=1) start.
2025-09-21 01:35:43 +02:00 <4> [7917789.379907]    zio_vdev_io_start ( 3705): zio ffff888a61a3d010 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035136] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.399627]   zio_vdev_io_assess ( 3945): zio ffff888a61a3c4a0 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034957] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.413144]     zio_vdev_io_done ( 3786): zio ffff888a61a3d010 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035136] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.426665]    zio_vdev_io_start ( 3705): zio ffff888a61a3a250 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:39254057] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.440179]   zio_vdev_io_assess ( 3945): zio ffff888a61a3d010 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035136] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.453695]     zio_vdev_io_done ( 3786): zio ffff888a61a3a250 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:39254057] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.467213]    zio_vdev_io_start ( 3705): zio ffff888a61a3eca8 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616754] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.480729]   zio_vdev_io_assess ( 3945): zio ffff888a61a3a250 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:39254057] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.494250]     zio_vdev_io_done ( 3786): zio ffff888a61a3eca8 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616754] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.507770]    zio_vdev_io_start ( 3705): zio ffff888a61a39c98 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034983] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.521285]   zio_vdev_io_assess ( 3945): zio ffff888a61a3eca8 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616754] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.534808]     zio_vdev_io_done ( 3786): zio ffff888a61a39c98 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034983] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.548328]    zio_vdev_io_start ( 3705): zio ffff888a61a39128 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034987] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.561842]   zio_vdev_io_assess ( 3945): zio ffff888a61a39c98 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034983] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.575360]     zio_vdev_io_done ( 3786): zio ffff888a61a39128 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034987] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.588879]    zio_vdev_io_start ( 3705): zio ffff888a61a3a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616619] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.602397]   zio_vdev_io_assess ( 3945): zio ffff888a61a39128 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034987] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.615913]     zio_vdev_io_done ( 3786): zio ffff888a61a3a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616619] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.629428]   zio_vdev_io_assess ( 3945): zio ffff888a61a3a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:12616619] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.642940]    zio_vdev_io_start ( 3705): zio ffff888a61a38b70 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035095] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.656466]     zio_vdev_io_done ( 3786): zio ffff888a61a38b70 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035095] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.669989]    zio_vdev_io_start ( 3705): zio ffff888a61a3a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034943] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.683502]   zio_vdev_io_assess ( 3945): zio ffff888a61a38b70 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59035095] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.697017]     zio_vdev_io_done ( 3786): zio ffff888a61a3a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034943] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.710529]    zio_vdev_io_start ( 3705): zio ffff888a61a39128 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:28429561] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.724042]   zio_vdev_io_assess ( 3945): zio ffff888a61a3a808 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034943] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.737558]     zio_vdev_io_done ( 3786): zio ffff888a61a39128 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:28429561] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.751075]    zio_vdev_io_start ( 3705): zio ffff888a61a39c98 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034953] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.764589]   zio_vdev_io_assess ( 3945): zio ffff888a61a39128 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:28429561] err 6
2025-09-21 01:35:43 +02:00 <4> [7917789.778107]     zio_vdev_io_done ( 3786): zio ffff888a61a39c98 size    32768 flag     427126 type 2 prop [0:0:0:0] bookmark [181:1:0:59034953] err 6
2025-09-21 01:35:43 +02:00 <7> [7917789.819213] zfskern: log_syncing_info import/export phase 0, slowman txg 9898649, add_remove_device 0
2025-09-21 01:35:43 +02:00 <7> [7917789.819215] zfskern:[QUIESCE 9898649] total_quiesce_time 1025ms, txg_quiesce_time 1023ms, spa_sync_zib time 2ms, sync count 17 (meta-13), iops 8500 (39MB)
2025-09-21 01:35:43 +02:00 <7> [7917789.819217] zfskern: syncdata 2 - collect 0, dispatchidr 1, syncscan 0, updaterefW 0, syncref 0, WriteW 0, syncddt 0, prune 0, refmap_rc 0, ddt_rc 0, ddt_reload 0
2025-09-21 01:35:43 +02:00 <7> [7917789.819218] zfskern: dispatchidr - daleW 0, dataW 0, dedupW 0, mergeW 0, allocW 0, verifyW 0, fillW 0
2025-09-21 01:35:43 +02:00 <7> [7917789.819219] zfskern: pipeline - collect 0 dale 0 data 0 dedup 0, merge 0, alloc 0 verify 0 fill 0, write (full 0, assign 0, scatter 0, dedup 0), updateref 0
2025-09-21 01:35:43 +02:00 <7> [7917789.819220] zfskern: count - map(l,u,a,d,full,frag)=(0,0,0,0,0,13)
2025-09-21 01:35:43 +02:00 <7> [7917789.819220] free dirty space: dnode_free_range 262144 free_long_range 0
2025-09-21 01:35:43 +02:00 <7> [7917789.819222] zfskern: [SYNC zpool1 9898649] dirty: 3MB(leaf 1, meta 2), synctime : (a 185642 s 185634: (ds 0: (dst 0 dp 0), dss 185625: (dsss 0 dad 0 rd 0 dsv 0 siqr 185625 dsse 0), mgc 0) 
2025-09-21 01:35:43 +02:00 <7> [7917789.819223] p 8 p2 5 ps 0 to 0 q 2), defer_stime : 0, misc_stime : 0
2025-09-21 01:35:43 +02:00 <7> [7917789.819224] free_synctime : (0 arcfreed 0 issue 0 ready 0 io 2 done 0 cnt 120, 0), scan_stime : 185625, vdev_stime : 0, meta_stime : 5, display_time : 0, smartddt_time : 0, postmeta_time : 0
2025-09-21 01:35:49 +02:00 <2> [7917795.482291] md/raid1:md13: Disk failure on nvme3n1p4, disabling device.
2025-09-21 01:35:49 +02:00 <2> [7917795.482291] md/raid1:md13: Operation continuing on 9 devices.
2025-09-21 01:35:49 +02:00 <6> [7917795.495032] md/raid:md13: report qnap hal event: type = HAL_EVENT_RAID, action = SET_RAID_PD_ERROR
2025-09-21 01:35:49 +02:00 <6> [7917795.504195] md/raid:md13: report qnap hal event: raid_id=13, pd_name=/dev/nvme3n1p4, spare=/dev/(null), pd_repair_sector=0
2025-09-21 01:35:52 +02:00 <2> [7917798.595330] md/raid1:md321: Disk failure on nvme3n1p5, disabling device.
2025-09-21 01:35:52 +02:00 <2> [7917798.595330] md/raid1:md321: Operation continuing on 1 devices.
2025-09-21 01:35:52 +02:00 <6> [7917798.608353] md/raid:md321: report qnap hal event: type = HAL_EVENT_RAID, action = SET_RAID_PD_ERROR
2025-09-21 01:35:52 +02:00 <6> [7917798.617635] md/raid:md321: report qnap hal event: raid_id=321, pd_name=/dev/nvme3n1p5, spare=/dev/(null), pd_repair_sector=0
2025-09-21 01:35:52 +02:00 <6> [7917798.644084] md: recovery of RAID array md321
2025-09-21 01:35:52 +02:00 <6> [7917798.648554] md: Recovering started: md321
2025-09-21 01:35:52 +02:00 <6> [7917798.652757] md/raid:md321: report qnap hal event: type = HAL_EVENT_RAID, action = REBUILDING_START
2025-09-21 01:35:52 +02:00 <6> [7917798.661917] md/raid:md321: report qnap hal event: raid_id=321, pd_name=/dev/(null), spare=/dev/(null), pd_repair_sector=0
2025-09-21 01:35:55 +02:00 <6> [7917801.074996] nvme nvme3: nvme_free_ns(559): nvme3n1 ns_head->kref:1
2025-09-21 01:35:55 +02:00 <6> [7917801.081400] nvme-subsystem nvme-subsys3: nvme_free_ns_head(543): nvme-subsys3 subsys->kref:2
2025-09-21 01:35:55 +02:00 <6> [7917801.090041] nvme nvme3: nvme_free_ctrl(4636): nvme3 nvme-subsys3->kref:1
2025-09-21 01:35:55 +02:00 <6> [7917801.097183] nvme-subsystem nvme-subsys3: nvme_destroy_subsystem(2810): nvme-subsys3
2025-09-21 01:35:58 +02:00 <4> [7917804.927318] ---- NAS_Disk_Hot_Remove(enc_id=0, port_id=1) finished.
2025-09-21 01:35:59 +02:00 <4> [7917805.061696] ---- [HAL] Enc root PD nvme3n1(enc_id=0, port_id=1) removed, presence 2
2025-09-21 01:39:19 +02:00 <6> [7918005.819148] md: md321: recovery done.
2025-09-21 01:39:19 +02:00 <6> [7918005.823009] md: Recovering done: md321
2025-09-21 01:39:19 +02:00 <6> [7918005.826952] md/raid:md321: report qnap hal event: type = HAL_EVENT_RAID, action = REBUILDING_COMPLETE
2025-09-21 01:39:19 +02:00 <6> [7918005.836380] md/raid:md321: report qnap hal event: raid_id=321, pd_name=/dev/(null), spare=/dev/(null), pd_repair_sector=0

I think it's a combination of zfs + the harddisks. I donÄt think it's the storage because is was already replaced. It seems that the scrubbing creates so much IO that the hard disk gets disconnected.

Does anybody has an idea how to solve it?

Best regards

Rainer

2 replies

NoteAfterNote Sep 25, 2025

@Speed7811 there is an interesting comment in "Speed7811's September 22, 2025 comment in "Unsuitable SSD/NVMe hardware for ZFS"" at https://news.ycombinator.com/item?id=45377499 that says "Note "Copy the text from PCDIY!" in "Windows 11 SSD issues blamed on reviewers using 'early versions of firmware'": https://news.ycombinator.com/item?id=45180526". And in https://news.ycombinator.com/item?id=45180526 ("Windows 11 SSD issues blamed on reviewers using 'early versions of firmware'") is "Copy the text from PCDIY! (cited in the article) at https://archive.is/sVCjC and use https://translate.google.com to translate it."

In the English translation of https://archive.is/sVCjC look for "The reason why SSDs slow down"; https://archive.is/sVCjC is the archive link for
https://www.facebook.com/groups/hkepc/permalink/2301321666981945/ (from the article "Windows 11 SSD issues blamed on reviewers using ‘early versions of firmware’" at https://www.theverge.com/report/774201/phison-windows-11-ssd-issues-early-firmware).

no-usernames-left Nov 7, 2025

This comment is almost impossible to understand.

jpiszcz · 2025-12-01T14:21:13Z

jpiszcz
Dec 1, 2025

Wanted to document WD Red SN700 NVME drives randomly drop out of the array (MDADM RAID-6) with BTRFS on top of it on an Asustor Gen1 NVME NAS. Western Digital/Sandisk provides a generic answer and Asustor noted they do see other customers with this issue but do not have any type of fix or workaround for this issue currently. The workaround on my side is automating the shutdown and waking up (WOL) the NAS and it will rebuild itself automatically. This can be frustrating though as it adds a lot of latency during the rebuilds.

[482320.067256] nvme nvme1: I/O 11 QID 0 timeout, reset controller
[482391.502239] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[482401.528646] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[482401.535147] nvme nvme1: Disabling device after reset failure: -19
[482401.554256] md/raid:md1: Disk failure on nvme1n1p4, disabling device.
[482401.560787] md/raid:md1: Operation continuing on 11 devices.
[482401.574251] md: remove_and_add_spares bdev /dev/nvme1n1p4 exist
[482401.574262] md: remove_and_add_spares rdev nvme1n1p4 remove sucess
[483606.206337] md/raid1:md0: nvme1n1p2: rescheduling sector 762232
[483606.212824] md/raid1:md0: redirecting sector 762232 to other mirror: nvme6n1p2
[483606.216190] md/raid1:md0: nvme1n1p2: rescheduling sector 450728
[483606.220164] md/raid1:md0: Disk failure on nvme1n1p2, disabling device.
[483606.220164] md/raid1:md0: Operation continuing on 11 devices.
[483606.240221] md: remove_and_add_spares bdev /dev/nvme1n1p2 exist
[483606.240503] md/raid1:md0: redirecting sector 450728 to other mirror: nvme6n1p2
[483606.256136] md: remove_and_add_spares bdev /dev/nvme1n1p2 exist
[483606.256143] md: remove_and_add_spares rdev nvme1n1p2 remove sucess

0 replies

sammy2k8-lab · 2025-12-19T11:58:28Z

sammy2k8-lab
Dec 19, 2025

Hi guys
As it is given/marked with the first post here. I do experience the same problem with 9x Lexar NM790 NVME as Raidz config.
Under normal daily operation no I/O errors (sometimes checksum errors) happen. Under copy/write load the error came up. Mostly one drive. (last time with two drives within several days)

Some test with mdadm raid5 (btrfs format) does give me faulty state under load too! (Only one drive)

I think maybe we do have following problem fields:
ZFS filesystem
ASPM handling
Kernel bug

Aspm is a big pain for me so far. To Disable it by grub doesn’t work (besides long boot times). Disable aspm via uefi doesn’t fix the problem either.

the smart conditions looking good and the firmware are all at its newest for the nvmes.

I could hand out some specific logs too if needed.

0 replies

tonyhutter · 2025-12-19T20:42:29Z

tonyhutter
Dec 19, 2025
Maintainer

Aspm is a big pain for me so far. To Disable it by grub doesn’t work (besides long boot times). Disable aspm via uefi doesn’t fix the problem either.

My desktop has an NVMe drive that was randomly locking up my system with IO errors. I disabled Autonomous Power State Transitions (APST) and the problems went away. Here's how I (non-persistently) disabled it:

$ sudo nvme set-feature /dev/nvme0 -f 0xc -v 0

$ sudo nvme get-feature /dev/nvme0 -f 0xc -H | grep APSTE
	Autonomous Power State Transition Enable (APSTE): Disabled

3 replies

justinclift Dec 20, 2025

Which model of NVMe drive is this for? (the more specific the better, as it'll help others following along) 😄

sammy2k8-lab Dec 22, 2025

Thanks Tony, I will check this.

my learnings so far:

Did build the raidz with /dev/device setup - bad decision
All-flash isnt only expensive, its not that enterprise solid at high price customer level
I’m not alone 😅

Some side notes: the NM790 are DRAM less as from some guys here mentioned with their problematic drives too.

The drives using host memory (via dma?) how much?

I use a highpoint nvme raid card with 8 slots and one nvme slot nme_1 (should be cpu) for the soft raid with party. ZFS and mdadm (raid 5 btrfs) both are not stable… as far as I can see zfs raidz does hold my system longer in game.

system is a 13th gen Intel with a asrock z690 mobo.
everything extra is disabled (no sound, no intel me, no tpm, no iommu, no vtd…).

Actually I’m a bit confused. At some other corners round the www they’re using bcache to hold up I/O burst storms …

ButterBarTheGr8 Dec 23, 2025

This APSTE stuff....hmmmmm....another knob to turn. At least with my current array, APSTE is set to enabled, albeit on BTRFS. I've got another 12 x SN870X drive array that I can play with this week, with ZFS and APSTE set to off. If this is the trick to ZFS stability....I'll contribute to your next GoFundMe, happily.....

mabod · 2025-12-20T10:27:31Z

mabod
Dec 20, 2025

you can find some more details with links in the arch wiki:

https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Allow_drive_to_enter_low-power_states_(APST)

https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Controller_failure_due_to_broken_APST_support

0 replies

jpiszcz · 2025-12-20T14:55:47Z

jpiszcz
Dec 20, 2025

I checked how APST is set on 4TiB WD SN700s within an Asustor NVME NAS (12 bay/Gen1) , this is off already :(
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled
Autonomous Power State Transition Enable (APSTE): Disabled

Question for those that switched from ZFS/BTRFS (CoW) to EXT4 or XFS for example, do the drive lockups persist?

7 replies

ButterBarTheGr8 Dec 23, 2025

With BTRFS, are you scrubbing with kdave's scripts? So i had one drive go offline after a large file transfer about six months ago, but now I scrub, trim, and re-balance weekly. I could absolutely just be masking a larger underlying problem, but since setting up the systemd job to spawn this weekly on sunday nights - I am error free (knock on wood). I also scrub the member drives individually (as you have to with BTRFS). It takes about 8 hours on 12 x 4TB WD SN870X drives in a RAID 5 array. The array is more dependable and thus transfers aren't continuously going as it has "caught up" to its clone. But again....maybe this maintenance is just postponing the inevitable.

jpiszcz Dec 23, 2025

I have two different systems that I used with BTRFS:

W680 board with 2 x 4TiB Samsung 990 PROs
Also have a Gen1 Flashstor with SN700 WD 4TB x 12 drives.

Over the last year, one drive dropped off the bus on the W680 board shortly after the BTRFS scrub and the file system went read only.

On my Flashstor, which I've had for over a year, a random drive drops out of the array each month. Then workaround for this what I had been doing is powering off the device and sending a WOL packet and then jt starts rebuilding. This is frustrating as there is downtime and the array is slow during the rebuild, which takes several hours.

Last weekend, I've moved over to ext4 on the W680 (mdadm/raid1, rootFS) and the Flashstor-> ext4 on mdadm/raid6.

I am very curious if this issue recurs with ext4. It seems like some NVME drives cannot handle BTRFS (or ZFS).

ButterBarTheGr8 Dec 23, 2025

mdadm always produced the same results for me. The array build on raid5 for 12x4TB SN870X drives would crap out at roughly the 80% mark, while mdadm wrote to all the member drives and created/discovered its bitmap. Since I'm a big mdadm believer, I must have attempted this ten times at least. Always the same result (though I never messed with this APSTE stuff). This is what leads me down the path of believing that there is something bursty, and big, that these drives don't like. I figured it was a part of ZFS. My guess is that mdadm will eventually yield similar results, irrespective of the file system.

jpiszcz Dec 23, 2025

With BTRFS, are you scrubbing with kdave's scripts? So i had one drive go offline after a large file transfer about six months ago, but now I scrub, trim, and re-balance weekly. I could absolutely just be masking a larger underlying problem, but since setting up the systemd job to spawn this weekly on sunday nights - I am error free (knock on wood). I also scrub the member drives individually (as you have to with BTRFS). It takes about 8 hours on 12 x 4TB WD SN870X drives in a RAID 5 array. The array is more dependable and thus transfers aren't continuously going as it has "caught up" to its clone. But again....maybe this maintenance is just postponing the inevitable.

I have an open case with ASUSTOR who attempted to work with Western Digital’s engineering team. They determined that the random drive dropouts occur during the transition from idle to an active power state, but they did not receive any response from WD. Based on their findings, the issue appears to be related to the SSD firmware, which led ASUSTOR to remove the drive from their compatibility list. What I find interesting is that Samsung continues to actively release firmware updates for their NVMe SSDs, whereas the firmware for the SN700 drives has not been updated since I purchased them over 1–2 years ago. This raises the question of whether there is an underlying hardware limitation that cannot be resolved through firmware updates.

justinclift Dec 24, 2025

Yeah, that sounds very much like how WD operates. 🤦

jpiszcz · 2025-12-24T01:16:23Z

jpiszcz
Dec 24, 2025

Something that I found interesting since moving to MDADM with RAID-1:

Dec 21 19:37:01 box1 mdadm[1093]: mdadm: DeviceDisappeared event detected on md device /dev/md/md0
Dec 22 17:14:24 box1 sh[1407151]: mdadm: DeviceDisappeared event detected on md device /dev/md/md0
Dec 23 16:16:19 box1 mdadm[855]: mdadm: DeviceDisappeared event detected on md device /dev/md/md0

This appears to be a device changing power state, it was suggested to boot with the following:
nvme_core.default_ps_max_latency_us=0

Now waiting to see if this issue recurs.

2 replies

ButterBarTheGr8 Dec 24, 2025

Been there. You'll eventually get to these kernel params. "nvme_core.default_ps_max_latency_us=0 pcie_port_pm=off pcie_aspm=off" None ever seem to be effective or ones the kernel (PVE 8) will listen to. Check it with "cat /sys/module/nvme_core/parameters/default_ps_max_latency_us" to see what I mean....

jpiszcz Dec 24, 2025

Been there. You'll eventually get to these kernel params. "nvme_core.default_ps_max_latency_us=0 pcie_port_pm=off pcie_aspm=off" None ever seem to be effective or ones the kernel (PVE 8) will listen to. Check it with "cat /sys/module/nvme_core/parameters/default_ps_max_latency_us" to see what I mean....

You were right, even with nvme_core.default_ps_max_latency_us=0, I am still seeing this:
Dec 24 13:50:10 box1 sh[1245786]: mdadm: DeviceDisappeared event detected on md device /dev/md/md0

jpiszcz · 2025-12-24T18:42:46Z

jpiszcz
Dec 24, 2025

Since migrating my NAS volume from BTRFS to EXT4 a couple days ago, I just had a random drive go off-line (Dec 24, 2025) on the Asus Flashstor Gen1, so with regards to the filesystem BTRFS vs. EXT4 for these WD SN700 drives, it did not make any difference. Does anyone on this thread have WD SN700 drives working in a stable configuration?

On EXT4:
[178853.526420] nvme nvme2: I/O 29 QID 0 timeout, reset controller
[178924.960641] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[178934.982291] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[180867.727307] md/raid1:md0: Disk failure on nvme2n1p2, disabling device.
[180867.727307] md/raid1:md0: Operation continuing on 11 devices.

0 replies

jpiszcz · 2025-12-25T00:42:08Z

jpiszcz
Dec 25, 2025

Researching this further, I came across a very interesting suggestion for a workaround, if these WD SN700 drives have aggressive power management (5 seconds) and the drives do not respect the kernel options, I am going test the following: writing 1 byte every second to a file via a Linux systemd unit to the NFS share (NAS array) and see if this issue recurs. I have seen this issue occur almost ALWAYS when the system (Asus Flashstor is idle). Right now my "workaround" is automated script/tooling to shutdown the machine and wake it up via WOL packet but the rebuild time takes forever on these systems and there is very bad latency during the rebuild. If there is an acceptable workaround that will make these drives usable, this would be the best case scenario. This may be wishful thinking but I am going to test the following next:

Create a script, e.g., /usr/local/bin/nvme_heartbeat.sh:

#!/bin/bash
# nvme_heartbeat.sh
# Minimal heartbeat for SN700 RAID drives

# Path on your RAID array — choose a persistent location
HEARTBEAT_FILE="/nfs/volume1/.heartbeat"

# Create the file if it doesn't exist
touch "$HEARTBEAT_FILE"

# Loop forever
while true; do
    # Append a timestamp (or just write a byte)
    # This keeps the drive active without large writes
    echo -n "." >> "$HEARTBEAT_FILE"
    
    # Keep file size manageable (truncate after ~1000 bytes)
    if [ $(stat -c%s "$HEARTBEAT_FILE") -gt 1000 ]; then
        : > "$HEARTBEAT_FILE"
    fi

    # Sleep 1 second between writes
    sleep 1
done

Make the Script Executable

sudo chmod +x /usr/local/bin/nvme_heartbeat.sh

Create a systemd unit: /etc/systemd/system/nvme_heartbeat.service

[Unit]
Description=NVMe Heartbeat Service for SN700 drives
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/nvme_heartbeat.sh
Restart=always

[Install]
WantedBy=multi-user.target

Enable and Start the Service

sudo systemctl daemon-reload
sudo systemctl enable nvme_heartbeat
sudo systemctl start nvme_heartbeat

5 replies

no-usernames-left Dec 25, 2025

Don't write! Perform an uncached read instead!

gmelikov Dec 25, 2025
Collaborator

there's ioping (with -write arg if you want writes) util which does nearly the same - writes something periodically, may be easier to test a hypothesis.

jpiszcz Jan 8, 2026

Status as of January 8th, 2026:

I enabled the nvme_heartbeat systemd unit on December 26th (13 days) ago and I am still waiting for a drive to timeout/drop out of the array, I plan on giving this another 2-3 months (the issue usually presents itself once a month). If successful, I will look into an uncached read and/or a more robust option with ioping and test some other options as well.

jpiszcz Jan 12, 2026

Short-lived test, one of the drive's just went off-line again-- so writing to the array once per second does not work around the issue with these drives.

[1417440.040694] nvme nvme10: I/O 223 (I/O Cmd) QID 2 timeout, aborting
[1417453.353693] nvme nvme10: I/O 2 QID 0 timeout, reset controller
[1417527.349034] nvme nvme10: Device not ready; aborting reset, CSTS=0x1
[1417527.361993] nvme nvme10: Abort status: 0x371
[1417537.371359] nvme nvme10: Device not ready; aborting reset, CSTS=0x1
[1417537.378028] nvme nvme10: Disabling device after reset failure: -19
[1417537.389696] md/raid:md1: Disk failure on nvme10n1p4, disabling device.
[1417537.396400] md/raid:md1: Operation continuing on 11 devices.
[1417537.410677] md: remove_and_add_spares bdev /dev/nvme10n1p4 exist
[1417537.410685] md: remove_and_add_spares rdev nvme10n1p4 remove sucess

ButterBarTheGr8 Jan 12, 2026

Ugh, bummer, but I can't say I'm surprised. There must be something deep in the controller firmware which just doesn't tolerate software RAID topologies. Personally I think it's some kind of burst-write negotiation protocol, like in a RAID topology the kernel floods the drive with data before it can initialize itself, and when it doesn't respond - it flags the drive as missing and deactivates it. Sooooo, frustrating because you'd think we could just flip a bit somewhere.

ButterBarTheGr8 · 2025-12-29T18:04:56Z

ButterBarTheGr8
Dec 29, 2025

Update to my 8/15/25 post..

BTRFS - Meh. It's pseudo-functional. I've had the drive drops and failed drive problems such as with ZFS, but not nearly at the same volume/frequency. Weekly scrubbing and re-balancing helps greatly. Thrice in the last six months I've had a drive disconnect from the array, and one of those times, BTRFS still reported a healthy array. Nothing in dmesg other than a missing drive during a weekly scrub. In two cases the array "healed" automatically. In the third, I lost data, though the array still continued to operate in RO mode. This was a slight improvement over ZFS, by which even after losing a single drive in a RAIDZ1 array, "recoverability" was troublesome. The good news is that BTRFS historical performance issues seem fixed, the bad news is that it's more of the same for NVMe drives in a parity topology. For all block level storage systems, I've had the best experiences with BTRFS (also using it as a file system) but this isn't ready for production.

Kernel Flags - pcie_aspm=off. This doesn't "turn ASPM off". Instead it just tells the kernel to not enforce an ASPM policy on compatible devices, almost always on the PCI bus. To truly disable ASPM, use the BIOS if yours supports it. But I've tried that too and nothing changes. ZFS, BTRFS, CephFS, mdadm, any block level storage (yes, some of these are file level too) in a parity model fails for me. Interestingly, non-parity topologies such as a two drive ZFS mirror using Samsung 980 and 990 drives is rock solid. Or a stripe. But parity is the problem for me. pcie_port_pm=off - this will tell the kernel to disable power management at the port level for each PCIe port. Essentially this forces PCI bus devices to stay in an active power state because the kernel will not honor state changes. I tested on kernels 6.8-6.17, but this never had an effect on the NVMe drives. nvme_core.default_ps_max_latency_us=0 - this will reduce the maximum power state transition time acceptable to the kernel to 0, thus preventing any NVMe device from transitioning to any power state. It's another insular kernel parameter like pcie_port_pm=off. Sadly, no changes on WD SN850X drives. I also tried disabled PCIe AER, but that's needed for advanced IOMMU operations and ASPM functionality. My dmesg is clogged with corrected PCIe errors, usually TLP and DLLP errors. Most of which come from these WD NVMe drives.

Da Bus - I've even tried downclocking WD drives to PCIe 3.0. I figured, maybe my system is "too fast" for these drives in such a topology. Well, after switching to a RHEL distro just to be able to use a compiled version of setpci that would reduce the PCIe version to 3.0, nothing changed. I briefly dabbled in downclocking the bus itself, and rate-limiting the data transfer rate. But these changes would have effected all the bus devices, and not just NVMe's. And if this is really a solution, I'd rather just buy different drives or beg WD for a firmware update.

What works - Non-parity topologies. Right now I've got a simple 12 drive RAID0 stripe on WD SN850X drives. I know....yuck. But I backup in triplicate and though expensive, it's the most stable I've found. ZFS complains about a multi-drive stripe so I use mdadm for assembly at the block level. Though this IS technically possible with ZFS, and I HAVE made it work. Other options would be a JBOD LVM physical volume, and then build an LV on top of that and place a file system on the LV. That's worked fine for me as well (so far). The best/fastest combo seems to be a simple multi-drive mdadm stripe, then create a partition and format to EXT4. A LUKS encrypted partition worked fine, as well as file overlay systems such as ecryptFS. The good news is that such a stripe on WD 850X drives is FAST....SUPER FAST. The bad news is that there is no parity, no drive redundancy, and I'm forced to have a second array for backups.

3 replies

Vantomas Dec 29, 2025

Have you tried RAID10? Do you have 512B or 4k blocks?

sammy2k8-lab Dec 29, 2025

Thanks for your detailed reporting.

I can confirm, that all non parity filesystems work with the lexar nvme - here actually running a build of Mdadm stripe set with ext4 filesystem.

The nvme list does show 512b blocks.
Filesystems like zfs and btrfs only accept 4k or greater with mkfs.xxx

ButterBarTheGr8 Dec 29, 2025

Good Q. I've tried with 512B and 4K sector sizes configured on the drive controller, and multiple ashift and block level parameters that mirror the sector geometry set on the controller. Unfortunately, nothing ever changed the problem. For instance, I tried setting the drive controller to 4K, but then using a 512B based topology when building the array, and vice versa. Seizing off internet propaganda around drive manufacturers "faking" the drive geometry, I dug deep in to this. It is an interesting paradigm, since Samsung 990's use a 512B based physical sector topology, but WD SN850's use 4K physical by default.

Default 990 (from 990 Pro 2TB) via 'nvme id-ns -H /dev/nvmeXnY'
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

Default WD SN850:
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better

As for RAID10 - my first results were hit and miss, and I couldn't always attribute drives failing to other hardware calamities. Because of the storage size penalty of RAID10, I never pursued it as viable. I'd LOVE to hear if anyone has results?

no-usernames-left · 2025-12-29T18:34:59Z

no-usernames-left
Dec 29, 2025

tl;dr: faulty drives are faulty. Replace it with one that doesn't have controller firmware with showstopping bugs.

1 reply

sammy2k8-lab Dec 29, 2025

Clown

richardm1 · 2026-01-03T00:02:27Z

richardm1
Jan 3, 2026

Anyone try: echo on | sudo tee /sys/block/nvme0n1/device/power/control?

1 reply

frederickjansen Jan 8, 2026

Just did, the drive dropped after 3 days (SN850X, RAIDZ1).

Unsuitable SSD/NVMe hardware for ZFS - WD BLACK SN770 and others #14793

Uh oh!

Uh oh!

Hardware

Issue observed

What has been tried so far

Some thoughts / ideas of tests to try

Replies: 68 comments · 215 replies

Uh oh!

Uh oh!

admnd Apr 25, 2023 Author

Uh oh!

Uh oh!

admnd Apr 25, 2023 Author

Uh oh!

Uh oh!

Uh oh!

admnd Apr 25, 2023 Author

Uh oh!

Uh oh!

admnd Apr 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

admnd Apr 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

admnd Apr 27, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

admnd Apr 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

admnd Apr 28, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 68 comments 215 replies

admnd
Apr 25, 2023
Author

admnd
Apr 25, 2023
Author

admnd
Apr 25, 2023
Author

admnd Apr 26, 2023
Author

admnd Apr 26, 2023
Author

admnd Apr 27, 2023
Author

admnd
Apr 26, 2023
Author

admnd
Apr 28, 2023
Author