Update dnode_next_offset_level to accept blkid instead of offset #17792

rrevans · 2025-09-26T04:18:30Z

Currently dnode_next_offset_level uses L0 offsets as input and output which:

is hard to read since it maps offsets to blkid and back each call
necessitates dnode_next_block to handle edge cases at limits
obscures loop invariants required for traversal to always terminate
entangles iteration with integer overflow for large objects

This PR updates dnode_next_offset to use lvl, blkid, and index as the iteration position.

Together these three variables point uniquely to an iteration position in some block of an object.

lvl and blkid point to a block in the object, and
index points to some dnode/BP (if 0 <= index < N)
... or to one past the end (if index == N)
... or to the one past the beginning (if index == -1)

Unlike offsets, these:

never run out of precision, even for objects with maximum indirection
can point at positions with L0 offset < 0 or ≥ 2⁶⁴
can distiguish past-the-end of one block vs. the start of the next

After this, dnode_next_offset_level only uses offset as an output to return the resulting offset to the caller of dnode_next_offset.

To search upwards, instead of dnode_next_block, the lvl+1 index is set to the low bits of the blkid plus one to point to the position of the current block's pointer sibling -- or one past the end if it was the last child of that block (and similarly minus one for backwards search).

This PR has three minor effects beyond refactoring:

Upwards search no longer quits as soon as the L0 offset is < 0 or ≥ 2⁶⁴

This is no longer needed since blkid and index can correctly represent positions outside of the normal range of offsets. Removing this condition simplifies the iteration.

When such a condition occurs, the search will proceed up to maxlvl and terminate with ESRCH.

There is no effect on the search outcome since objects cannot have offsets ≥ 2⁶⁴.
Upwards search no longer spills into the parent's sibling when searching the last (or first) child block.

This is because index can point at one past the end (or beginning).

Consider searching a block tree with nlevels == 3 and datablkshift=12 and indblkshift=17.
- Suppose search starts forward from offset 0xfffff000 at lvl=1
- This is L1 block 1023, at index 1023 in that block.
- ... which is the child at index 1023 in L2 block 0
- If nothing is found, dnode_next_offset_level returns with *offset == 0x100000000
- ... which is L1 block 1024 at index 0 in that block
- ... which is the child at index 0 in L2 block 1
Before this PR, the search proceeds at L2 block 1 from offset 0.
- If the result is found in this block, then search continues downward.
- If not, search goes upwards to L3 block 0 and searches from index 2.
After this PR, the search proceeds at L2 block 0 at index 1024 (one past its end).
- This always returns ESRCH since 1024 is greater than the number of BPs.
- Search then goes upwards to L3 block 0 and searches from index 1.
This difference doesn't change what is found, but it does eliminate the work to load and search the L2 block 1 if it was never going to match.

Instead the cached L3 block will point to the correct next block.

This matters less for hole search (no I/O), but the extra steps are wasteful and unnecessary.
For ESRCH, this restores the logic to return the same *offset as before backtracking.

For error == 0 and most ESRCH cases, the offset is the same as before dnode_next_offset: backtrack if lower level does not match #16025.

But for error == ESRCH case, the result is different for exactly the case above when all subsequent indirect blocks are holes.

Before, the search would continue from offset 0x100000000:
- All indirect blocks beyond that offset are holes
- So for L2 block 1, dnode_hold_impl returns ENOENT
- Then offset is unmodified, and the result is 0x100000000
After, the search again continues from offset 0x100000000:
- After L2 block 1, so dnode_next_block updates offset to 0x200000000
The result differs since dnode_next_block unconditionally adds 1 at each level searching up the tree, while before it was only changed if an indirect block was scanned.

This difference was observed using ZFS_IOC_NEXT_OBJ.
- After the last dnode, the result is 1<<45 == 35184372088832.
- ... unless the starting position is 35149978763231
  - == 0b111111111011111111101111111110111111111011111
- Because dnode_next_block and dmu_object_next add:
  - 0b100000000010000000001000000000100001
- ... which yields 2<<45 == 70368744177664
- But only if all of the 10-bit subwords are == 1023. If not there's no overflow.
- This is most curious as this means some values above 35149978763231 also return 1<<45.
The return value from dnode_next_offset on ESRCH does not appear to be used except for:
- the virtual hole case (which should be unaffected since it deals only in populated blocks)
- ZFS_IOC_NEXT_OBJ which returns the value to userspace
This PR restores the ESRCH semantics back to how they were. This happens naturally with index plus one because the search will not spill into the next block during upwards traversal.

Meanwhile, the value itself is underspecified and of questionable utility.
- It is the first offset at or after which there are no physically allocated indirect blocks
- ... unless that offset would be greater than or equal 2⁶⁴
- ... in which case it is the first offset after which there are no indirect blocks at nlevels-1
- ... unless that offset would be greater than or equal 2⁶⁴
- ... in which case it is the first offset after which there are no indirect blocks at nlevels-2
- ... etc.
- ... unless all levels down to minlvl have such offsets that would be greater than or equal 2⁶⁴
- ... in which case the result is simply the initial offset
Or for backwards search:
- It is the last offset at or before which there are no physically allocated indirect blocks
- ... unless the search ends at offset zero
- ... in which case it is the last offset covered by the indirect block at whatever level offset zero is reached
- ... which happens because blkid is clamped to zero when searching backwards
Neither of these seem to be deliberately implemented; they are instead side-effects of setting *offset to the larger (or smaller) of the initial offset or the resulting offset along with the the clamp to zero behavior.

For forward search, when the blkid is too large, the shift overflows to zero which means that the initial offset is returned instead.

Luckily, the result is never used for backwards search. This PR maintains the same semantics to minimize change.

Future ideas:

Remove the ESRCH result so that the initial *offset is returned instead.
Implement dnode_next_offset variant that returns blkids natively. Many callers want to iterate over blocks but have to deal with L0 offsets.

Motivation and Context

Code cleanup, readability, and minor changes to edge cases.

Description

Refactored to iterate by blkid instead of offsets.

See above for details of minor changes to edge cases.

How Has This Been Tested?

ztest, ZTS, llseek stressor

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

rrevans · 2025-09-29T12:54:40Z

Converting this back to draft as I've been staring at the offset calculations for the new version and found an oddity in 8a8970e.

That commit works for the case where a match occurs, but it returns a higher than expected offset in the non-matching case when 1) the starting offset points at an indirect hole and 2) the effect of dnode_next_block accumulates and carries the offset into the next block at lvl+2 or higher.

The code prior to the commit above was leaving the offset unchanged when searching up the tree when dnode_next_offet_level encounters sparse indirect blocks. That left the returned offset set to the first L0 offset that has no physically allocated BP ignoring pointers in indirect blocks whose next sibling would start at an L0 offset >= 2**64 or the initial offset if no such BP exists.

Meanwhile all callers of dnode_next_offset ignore the offset upon ESRCH result except ZFS_IOC_NEXT_OBJ and the "virtual hole" case in dnode_next_offset (but it should be unaffected by the above as it deals only in populated indirect blocks).

TL;DR I'm going to study this a bit more before proposing the final form of this PR. I think the blkid + index means the next/previous behavior of dnode_next_block is both unnecessary and undesirable, but I'll need a minute to convince myself and write up my findings.

Currently this function uses L0 offsets which: 1. is hard to read since it maps offsets to blkid and back each call 2. necessitates dnode_next_block to handle edge cases at limits 3. makes it hard to tell if the traversal can loop infinitely Instead, update this and dnode_next_offset to work in (blkid, index). This way the blkid manipulations are clear, and it's also clear that the traversal always terminates since blkid goes one direction. I've also considered updating dnode_next_offset to operate on blkid. Callers use both patterns, so maybe another PR can split the cases? While here tidy up dnode_next_offset_level comments. Signed-off-by: Robert Evans <[email protected]>

rrevans · 2025-10-09T05:34:59Z

After much staring, this is ready for review.

See the top comment for the full analysis. TL;DR: iterating by (blkid, index) is clearer, simpler; and also helps uncover and address rough edges around offset handling

PTAL @behlendorf when you get a chance; thanks in advance.

amotin · 2025-10-09T18:44:12Z

module/zfs/dnode.c

 			 */
+			index = BF64_GET(blkid, 0, epbs) +
+			    ((flags & DNODE_FIND_BACKWARDS) ? -1 : 1);
+			blkid = blkid >> epbs;


As I understand, when searching backwards, once it reach blkid == 0, this will start climbing levels until lvl hit maxlvl. Previous code exited earlier once dnode_next_block() saw DNODE_FIND_BACKWARDS and blkid == 0.

Indeed. After this PR search always ends at maxlvl (error == ESRCH) or minlvl (error == 0).

The previous code had to break to prevent a loop for all the cases where *offset ends up the same at the higher level. Now that's avoided directly, and the loop conditions are simpler.

I was just thinking about unneeded indirects accesses. Don't think it is a problem though, just wonder about performance.

Performance should not be worse than before #16025 when this also always stopped at maxlvl for ESRCH.

That said, it's straightforward to add this in dnode_next_offset just after calling dnode_next_offset_level should we want to keep the early out behavior for the offset < 0 or offset >= 2⁶⁴ conditions:

if (lvl > 0) { int span = (lvl - 1) * epbs + dn->dn_datablkshift; int maxnblkbits = span < 8 * sizeof (*offset) ? 8 * sizeof (*offset) - span : 0 if ((blkid == 0 && index < 0) || (((blkid << epbs) + index) >> maxnblkbits) != 0) { // Search went beyond max (or min) offset. ASSERT3S(error, !=, 0); break; } }

It does seem like a nice addition as dmu_free_long_range ends up hitting the < 0 case on every file deletion.

Otherwise it only benefits forward searches at the end of giant sparse files. Alternatively, this check could also limit to blkids covering dn_maxblkid if we want to break early for normal-sized files too?

Edit: Fix math in code above (Edit: again)

I am not sure MAX(0, 8 * sizeof (*offset) - span) will work, if sizeof() is unsigned.

Nicely spotted. Updated the above to

int maxnblkbits = span < 8 * sizeof (*offset) ? 8 * sizeof (*offset) - span : 0;

LMK if you think we should include that logic in this PR to keep the break-beyond-limit logic vs. visiting all the indirects?

While for backwards searches it is not required, it is also pretty trivial for any performance gain we may get. I'd prefer it in, probably.

For forward searches it is more complicated, while unlikely to ever happen. I'd feel better would we have some reasonable behavior in case of 2^^64 file size, if it is even specified somehow, but not as a "performance" optimization. I don't insist it here.

Looking at this more, I think it's feasible to stop at dn->dn_maxblkid since dn_struct_rwlock is held here? That seems like it would allow forward search to stop without visiting maxlvl in many cases.

Edit: Unfortunately this doesn't work. dmu_free_long_range sets dn->dn_maxblkid = 0, so sync context calling dnode_next_offset subsequently fails if open context frees an entire object.

If it did work the test would be simply:

if (lvl > 0 && ((blkid == 0 && index < 0) || (blkid << epbs) + index > (dn->dn_maxblkid >> (lvl - 1) * epbs))) { ASSERT3S(error, !=, 0); break; }

Meanwhile that doesn't deal with the 2⁶⁴ offset limit, but maybe an ASSERT or maybe even VERIFY is good enough for now? ZFS does not deal in objects above that size, and many of the callers of this function would fail badly if processing an oversize object.

Ideally this function would be refactored to return blkids. Then at least some of the overflow problems go away since it becomes defined to for example iterate over the L1 blocks that would cover offsets beyond 2⁶⁴.

Meanwhile that doesn't deal with the 2^^64 offset limit, but maybe an ASSERT or maybe even VERIFY is good enough for now?

I don't insist on it happening now, if it was broken forever and you don't have a good solution.

module/zfs/dnode.c

amotin · 2025-10-17T18:53:34Z

module/zfs/dnode.c

 			 */
+			index = BF64_GET(blkid, 0, epbs) +
+			    ((flags & DNODE_FIND_BACKWARDS) ? -1 : 1);
+			blkid = blkid >> epbs;


I was just thinking about unneeded indirects accesses. Don't think it is a problem though, just wonder about performance.

github-actions bot added the Status: Work in Progress Not yet ready for general review label Sep 26, 2025

rrevans marked this pull request as ready for review September 26, 2025 12:06

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Sep 26, 2025

rrevans mentioned this pull request Sep 26, 2025

dnode_next_offset: backtrack if lower level does not match #16025

Merged

13 tasks

behlendorf self-requested a review September 26, 2025 22:32

rrevans marked this pull request as draft September 29, 2025 12:28

github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Sep 29, 2025

rrevans force-pushed the level_blkid branch from 4220ff7 to a7333bd Compare October 9, 2025 05:06

rrevans marked this pull request as ready for review October 9, 2025 05:26

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Oct 9, 2025

amotin reviewed Oct 9, 2025

View reviewed changes

amotin approved these changes Oct 17, 2025

View reviewed changes

Update dnode_next_offset_level to accept blkid instead of offset #17792

Are you sure you want to change the base?

Update dnode_next_offset_level to accept blkid instead of offset #17792

Conversation

rrevans commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

rrevans commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rrevans commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rrevans Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rrevans Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rrevans commented Sep 26, 2025 •

edited

Loading

rrevans commented Sep 29, 2025 •

edited

Loading

rrevans Oct 18, 2025 •

edited

Loading

rrevans Oct 24, 2025 •

edited

Loading