What's going on with the multitude of issues resulting in kernel panics? #17811

BRUrban · 2025-10-02T16:44:53Z

BRUrban
Oct 2, 2025

I started encountering one, and when I went to begin investigating, I see that there appear to be a multitude of different issues which all result in a very similar kernel panic, many of which seem to stem from "corruption" of pools that don't actually have any corruption in them, with solutions ranging from "I wrote a script to make it send a warning instead and now it works fine" to "I had to buy an additional set of drives to transfer to and then back and then everything worked fine".

Why are so many distinct issues resulting in kernel panics, especially ones which seem to be caused by "false positives" with completely healthy pools and hardware? Is there a larger look being taken into what's going on there beyond handling the individual issues, and if so is there any roadmap or ETA for it?

Answered by IvanVolosyuk

Oct 3, 2025

I think the confusion here is about readonly importable pools, but causing crash on writable import. OpenZFS code is complex with a lot of layers and lots of internal consistency checks. For simpler filesystem, like ext4 the fsck can detect inconsistencies and handle them - creating lost files, untangle overlapping ones, etc. In ZFS it is hard to do tat due to the many layers, data structures complexity and way things can go wrong. If something is not consistent, say overlapping files any further writes can do a lot of damage and consistency checks trigger to protect what is left of the pool. Some of the check can be disabled and potentially spread the damage, others can be avoided by imp…

View full answer

amotin · 2025-10-02T18:45:09Z

amotin
Oct 2, 2025
Collaborator

To the general question here can be only a general answer. Who said that pools and hardware were healthy? Who said all the issues have the same cause?

5 replies

BRUrban Oct 2, 2025
Author

It's clear that there is a wide range of open issues which are resulting in kernel panics, which is very concerning. That is what my question is about. I did not say all hardware or pools are healthy, however reading through the numerous issues, many of the ones with more than one reply have people reporting that they tested their drives or pools in other machines, or migrated and then migrated back, without any further issue. My concern is that if there's a focus on fixing individual issues without a wider look as to what's causing so many kernel panics, that whatever is resulting in a multitude of issues coming to a kernel panic will not be addressed.
Also, as a note I specifically was saying the issues had different causes and were resulting in the kernel panic, as opposed to being the same cause. I'm posting this in the spirit of collaboration, I'd hope to get responses in good faith.

amotin Oct 2, 2025
Collaborator

And what response do you expect to get for a multitude of random issues coming to a kernel panic? Yes, we care. Yes, we try to look into individual ones to find something actionable or some patterns. No, I don't understand how can anybody "focus" on multitude of unrelated issues. Since this is a community project, you are welcome to focus on it -- please help by analyzing the issues, fixing them or sponsoring somebody to fix them.

BRUrban Oct 2, 2025
Author

I suppose I was hoping to get a response that indicated an awareness of the overarching issue and an understanding or planned approach being taken to address it. And no thanks, I find your responses off-putting and overly defensive in ways that remind me of jobs I've hated. I'll use a different solution and you can not have to read more of my questions 👍

amotin Oct 2, 2025
Collaborator

Awareness of what? That kernel panics happen for different reasons, including plenty of desktop-grade system with non-ECC RAM? Sure.

BRUrban Oct 2, 2025
Author

Yes! That would have been fine! If you'd replied with something like "A lot of them are due to desktop-grade systems and memory and there isn't much we can do about it" that would have been an answer to what I was asking and allowed me to follow up if I had further questions, or have a place to begin taking action from instead of driving me away from the project.
There are close to 200 open issues on this piece of software about different problems resulting in kernel panics, with no noticeable documentation or notes calling out anything about them beyond the scope of any given individual issue/bug report.

robn · 2025-10-03T01:10:34Z

robn
Oct 3, 2025
Collaborator

There are close to 200 open issues on this piece of software about different problems resulting in kernel panics, with no noticeable documentation or notes calling out anything about them beyond the scope of any given individual issue/bug report.

Because no one has done the triage work. They need someone to go through them, figure out if they're the same or different things, categorise them, etc. And if they're for an older version, re-check if they're still an issue.

My concern is that if there's a focus on fixing individual issues without a wider look as to what's causing so many kernel panics, that whatever is resulting in a multitude of issues coming to a kernel panic will not be addressed.

It's worth noting that a lot of what people report as "panic" is actually the Linux kernel "hung task" timer noting that a task has not made progress in a while, and showing a stack trace. That is almost always the txg_sync task, which is a central coordinating task inside OpenZFS that waits for the other subsystems to complete their work before writing a transaction to disk. Because of this, different and unrelated issues can often look similar, because they all cause the same "uhoh, generic error" light to flash.

Also, I know you didn't say that nothing happens, but it is easy to look at the open issues and think that they just accumulate forever, without also looking at the closed issues also. As an example, consider recent PR #17658, and the long list of old issues that that closed. Stuff does happen, as time and interest permit.

I'll also note that some of the biggest successes in finding and fixing often-reported issues have come after many people got together and worked through all the information available to arrive at a fix or a reliable reproducer. Some of my recent favourites are #15526, #12014 and #15646.

@amotin is right to say that this is a community project; it's great having reports, but if there's no one doing the work to collect them and understand, then they'll sit in the tracker until someone gets around to it.

2 replies

BRUrban Oct 3, 2025
Author

@amotin is right to say that this is a community project; it's great having reports, but if there's no one doing the work to collect them and understand, then they'll sit in the tracker until someone gets around to it.

This would hold a lot more weight if it hadn't been delivered as one of multiple snarky replies dismissing my question as worthless.
I'm absolutely not trying to take away from any of the successful fixes which have occurred, or imply that no progress is being made (Which is a reason why I didn't say anything about either of those). I'm aware this is a community project, that's why I asked a question to try and understand the context of what was going on better when I encountered something which puzzled and concerned me and why I found it off putting to be met with immediate push-back and defensiveness in response.

I was not making up issues here, I'm not trying to troll. An overwhelming amount of the open issues on the project are presenting as very similar and severe errors which do not seem to match the severity of the various causes being reported. I wanted to understand before getting involved. I protest the attempt to spin this back as a failure or an unwillingness to contribute on my part because I see giant flashing warning signs from those responses and am making the decision to turn away. If you want to push for community contributions, don't drive people away.

robn Oct 3, 2025
Collaborator

I found it off putting to be met with immediate push-back and defensiveness in response.

For whatever its worth, I think that's not unreasonable.

Honestly, I think in this case this is just "communication on the internet is hard". Based on my own experience in this project (~4yrs), I'd guess that there's at least two things going on in this thread.

One is that it is very common in OpenZFS (if not in open source generally) for people to turn up out of nowhere and criticise us for not taking user reports seriously (and much worse) which can lead to a knee-jerk assumption that a question like yours is not being asked in good faith. This was my gut feel too, If I'm being very honest. I recognise that is not fair to you, and for myself, I try to take a breath before engaging (which is why I hadn't initially responded at all).

The other thing is that there is often a difference in native language and expression that can make it difficult to convey understand intent (to be almost insultingly reductive: many people from eastern Europe often come across as brusque to many native English speakers).

I stress I'm not trying to defend my colleague out of hand, nor put words in his mouth (maybe today he is being a jerk!), and certainly not claim that a defensive or hostile posture is a good default. Mostly, I'm (probably hamfistedly) trying to acknowledge that, regardless of intent, you did feel put off, and that sucks, and here is at least a plausible explanation. My hope is that you'll (perhaps grudgingly) nod and we can try again.

So if I may return to the question, I think it's been answered, at least on the surface: the various "kernel panic"-shaped aren't necessarily related, but some might be, and need to be studied further to determine that. There's no one (that I'm aware of) doing that in a systematic way though; different people drop in as they have time and interest. There aren't any other "official" notes or plans on anything other than what is written in the tickets, and I guess what individuals may have in their heads.

For everyone reading, we could really use help with triage. The hardest and most time-consuming part of bug hunting is pulling all the information together, asking the questions that fill out the gaps in understanding, and getting something that can be reproduce the problem reliably. Once we can reproduce an issue, the fix is usually a short hop away. It's something you can help with without deep knowledge of the OpenZFS internals. I'd personally be very willing to assist on the code side; I just need someone to take point.

IvanVolosyuk · 2025-10-03T16:34:26Z

IvanVolosyuk
Oct 3, 2025

I think the confusion here is about readonly importable pools, but causing crash on writable import. OpenZFS code is complex with a lot of layers and lots of internal consistency checks. For simpler filesystem, like ext4 the fsck can detect inconsistencies and handle them - creating lost files, untangle overlapping ones, etc. In ZFS it is hard to do tat due to the many layers, data structures complexity and way things can go wrong. If something is not consistent, say overlapping files any further writes can do a lot of damage and consistency checks trigger to protect what is left of the pool. Some of the check can be disabled and potentially spread the damage, others can be avoided by importing pool in readonly mode.

Ideally inconsistencies should be handled gracefully, but usually proper error handling require developer time, X times more code and may still be ambiguous: "should we ignore failed checksum or find a block with that checksum somewhere else"?

I would imagine that the two main sources of the inconsistencies are bugs and memory corruption. The latter can cause quite random check blow up unexpectedly or causing hidden damage to a pool before they get detected. Bug have kinda similar effect, but probably more repeatable. I would imagine it to be quite taxing to try to reason about an impossible failed precondition caused by a bitflip in RAM. Thus, repeatable bugs with a reproducer can be way easier to fix, the rest are waiting for their chance to be understood or may be already fixed by fixing more reproducible ones.

1 reply

BRUrban Oct 8, 2025
Author

I really appreciate this answer, and I think it's likely correct. The one thing I would call out is that while I agree error handling requires developer time and is not guaranteed accurate, that doesn't make it less important or required for usable software. Error handling isn't fun logic or new exciting features but it is incredibly important.

What's going on with the multitude of issues resulting in kernel panics? #17811

Uh oh!

BRUrban Oct 2, 2025

Replies: 3 comments · 8 replies

Uh oh!

amotin Oct 2, 2025 Collaborator

Uh oh!

BRUrban Oct 2, 2025 Author

Uh oh!

amotin Oct 2, 2025 Collaborator

Uh oh!

BRUrban Oct 2, 2025 Author

Uh oh!

amotin Oct 2, 2025 Collaborator

Uh oh!

BRUrban Oct 2, 2025 Author

Uh oh!

robn Oct 3, 2025 Collaborator

Uh oh!

Uh oh!

BRUrban Oct 3, 2025 Author

Uh oh!

robn Oct 3, 2025 Collaborator

Uh oh!

IvanVolosyuk Oct 3, 2025

Uh oh!

BRUrban Oct 8, 2025 Author

BRUrban
Oct 2, 2025

Replies: 3 comments 8 replies

amotin
Oct 2, 2025
Collaborator

BRUrban Oct 2, 2025
Author

amotin Oct 2, 2025
Collaborator

BRUrban Oct 2, 2025
Author

amotin Oct 2, 2025
Collaborator

BRUrban Oct 2, 2025
Author

robn
Oct 3, 2025
Collaborator

BRUrban Oct 3, 2025
Author

robn Oct 3, 2025
Collaborator

IvanVolosyuk
Oct 3, 2025

BRUrban Oct 8, 2025
Author