What's going on with the multitude of issues resulting in kernel panics? #17811
-
|
I started encountering one, and when I went to begin investigating, I see that there appear to be a multitude of different issues which all result in a very similar kernel panic, many of which seem to stem from "corruption" of pools that don't actually have any corruption in them, with solutions ranging from "I wrote a script to make it send a warning instead and now it works fine" to "I had to buy an additional set of drives to transfer to and then back and then everything worked fine". Why are so many distinct issues resulting in kernel panics, especially ones which seem to be caused by "false positives" with completely healthy pools and hardware? Is there a larger look being taken into what's going on there beyond handling the individual issues, and if so is there any roadmap or ETA for it? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
|
To the general question here can be only a general answer. Who said that pools and hardware were healthy? Who said all the issues have the same cause? |
Beta Was this translation helpful? Give feedback.
-
Because no one has done the triage work. They need someone to go through them, figure out if they're the same or different things, categorise them, etc. And if they're for an older version, re-check if they're still an issue.
It's worth noting that a lot of what people report as "panic" is actually the Linux kernel "hung task" timer noting that a task has not made progress in a while, and showing a stack trace. That is almost always the Also, I know you didn't say that nothing happens, but it is easy to look at the open issues and think that they just accumulate forever, without also looking at the closed issues also. As an example, consider recent PR #17658, and the long list of old issues that that closed. Stuff does happen, as time and interest permit. I'll also note that some of the biggest successes in finding and fixing often-reported issues have come after many people got together and worked through all the information available to arrive at a fix or a reliable reproducer. Some of my recent favourites are #15526, #12014 and #15646. @amotin is right to say that this is a community project; it's great having reports, but if there's no one doing the work to collect them and understand, then they'll sit in the tracker until someone gets around to it. |
Beta Was this translation helpful? Give feedback.
-
|
I think the confusion here is about readonly importable pools, but causing crash on writable import. OpenZFS code is complex with a lot of layers and lots of internal consistency checks. For simpler filesystem, like ext4 the fsck can detect inconsistencies and handle them - creating lost files, untangle overlapping ones, etc. In ZFS it is hard to do tat due to the many layers, data structures complexity and way things can go wrong. If something is not consistent, say overlapping files any further writes can do a lot of damage and consistency checks trigger to protect what is left of the pool. Some of the check can be disabled and potentially spread the damage, others can be avoided by importing pool in readonly mode. Ideally inconsistencies should be handled gracefully, but usually proper error handling require developer time, X times more code and may still be ambiguous: "should we ignore failed checksum or find a block with that checksum somewhere else"? I would imagine that the two main sources of the inconsistencies are bugs and memory corruption. The latter can cause quite random check blow up unexpectedly or causing hidden damage to a pool before they get detected. Bug have kinda similar effect, but probably more repeatable. I would imagine it to be quite taxing to try to reason about an impossible failed precondition caused by a bitflip in RAM. Thus, repeatable bugs with a reproducer can be way easier to fix, the rest are waiting for their chance to be understood or may be already fixed by fixing more reproducible ones. |
Beta Was this translation helpful? Give feedback.
I think the confusion here is about readonly importable pools, but causing crash on writable import. OpenZFS code is complex with a lot of layers and lots of internal consistency checks. For simpler filesystem, like ext4 the fsck can detect inconsistencies and handle them - creating lost files, untangle overlapping ones, etc. In ZFS it is hard to do tat due to the many layers, data structures complexity and way things can go wrong. If something is not consistent, say overlapping files any further writes can do a lot of damage and consistency checks trigger to protect what is left of the pool. Some of the check can be disabled and potentially spread the damage, others can be avoided by imp…