-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Update dnode_next_offset_level to accept blkid instead of offset #17792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Converting this back to draft as I've been staring at the offset calculations for the new version and found an oddity in 8a8970e. That commit works for the case where a match occurs, but it returns a higher than expected offset in the non-matching case when 1) the starting offset points at an indirect hole and 2) the effect of The code prior to the commit above was leaving the offset unchanged when searching up the tree when Meanwhile all callers of TL;DR I'm going to study this a bit more before proposing the final form of this PR. I think the blkid + index means the next/previous behavior of |
Currently this function uses L0 offsets which: 1. is hard to read since it maps offsets to blkid and back each call 2. necessitates dnode_next_block to handle edge cases at limits 3. makes it hard to tell if the traversal can loop infinitely Instead, update this and dnode_next_offset to work in (blkid, index). This way the blkid manipulations are clear, and it's also clear that the traversal always terminates since blkid goes one direction. I've also considered updating dnode_next_offset to operate on blkid. Callers use both patterns, so maybe another PR can split the cases? While here tidy up dnode_next_offset_level comments. Signed-off-by: Robert Evans <[email protected]>
|
After much staring, this is ready for review. See the top comment for the full analysis. TL;DR: iterating by (blkid, index) is clearer, simpler; and also helps uncover and address rough edges around offset handling PTAL @behlendorf when you get a chance; thanks in advance. |
| */ | ||
| index = BF64_GET(blkid, 0, epbs) + | ||
| ((flags & DNODE_FIND_BACKWARDS) ? -1 : 1); | ||
| blkid = blkid >> epbs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, when searching backwards, once it reach blkid == 0, this will start climbing levels until lvl hit maxlvl. Previous code exited earlier once dnode_next_block() saw DNODE_FIND_BACKWARDS and blkid == 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. After this PR search always ends at maxlvl (error == ESRCH) or minlvl (error == 0).
The previous code had to break to prevent a loop for all the cases where *offset ends up the same at the higher level. Now that's avoided directly, and the loop conditions are simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just thinking about unneeded indirects accesses. Don't think it is a problem though, just wonder about performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance should not be worse than before #16025 when this also always stopped at maxlvl for ESRCH.
That said, it's straightforward to add this in dnode_next_offset just after calling dnode_next_offset_level should we want to keep the early out behavior for the offset < 0 or offset >= 264 conditions:
if (lvl > 0) {
int span = (lvl - 1) * epbs + dn->dn_datablkshift;
int maxnblkbits = span < 8 * sizeof (*offset) ?
8 * sizeof (*offset) - span : 0
if ((blkid == 0 && index < 0) ||
(((blkid << epbs) + index) >> maxnblkbits) != 0) {
// Search went beyond max (or min) offset.
ASSERT3S(error, !=, 0);
break;
}
}It does seem like a nice addition as dmu_free_long_range ends up hitting the < 0 case on every file deletion.
Otherwise it only benefits forward searches at the end of giant sparse files. Alternatively, this check could also limit to blkids covering dn_maxblkid if we want to break early for normal-sized files too?
Edit: Fix math in code above (Edit: again)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure MAX(0, 8 * sizeof (*offset) - span) will work, if sizeof() is unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely spotted. Updated the above to
int maxnblkbits = span < 8 * sizeof (*offset) ?
8 * sizeof (*offset) - span : 0;LMK if you think we should include that logic in this PR to keep the break-beyond-limit logic vs. visiting all the indirects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While for backwards searches it is not required, it is also pretty trivial for any performance gain we may get. I'd prefer it in, probably.
For forward searches it is more complicated, while unlikely to ever happen. I'd feel better would we have some reasonable behavior in case of 2^^64 file size, if it is even specified somehow, but not as a "performance" optimization. I don't insist it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this more, I think it's feasible to stop at dn->dn_maxblkid since dn_struct_rwlock is held here? That seems like it would allow forward search to stop without visiting maxlvl in many cases.
Edit: Unfortunately this doesn't work. dmu_free_long_range sets dn->dn_maxblkid = 0, so sync context calling dnode_next_offset subsequently fails if open context frees an entire object.
If it did work the test would be simply:
if (lvl > 0 && ((blkid == 0 && index < 0) ||
(blkid << epbs) + index >
(dn->dn_maxblkid >> (lvl - 1) * epbs))) {
ASSERT3S(error, !=, 0);
break;
}Meanwhile that doesn't deal with the 264 offset limit, but maybe an ASSERT or maybe even VERIFY is good enough for now? ZFS does not deal in objects above that size, and many of the callers of this function would fail badly if processing an oversize object.
Ideally this function would be refactored to return blkids. Then at least some of the overflow problems go away since it becomes defined to for example iterate over the L1 blocks that would cover offsets beyond 264.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile that doesn't deal with the 2^^64 offset limit, but maybe an ASSERT or maybe even VERIFY is good enough for now?
I don't insist on it happening now, if it was broken forever and you don't have a good solution.
| */ | ||
| index = BF64_GET(blkid, 0, epbs) + | ||
| ((flags & DNODE_FIND_BACKWARDS) ? -1 : 1); | ||
| blkid = blkid >> epbs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just thinking about unneeded indirects accesses. Don't think it is a problem though, just wonder about performance.
Currently
dnode_next_offset_leveluses L0 offsets as input and output which:dnode_next_blockto handle edge cases at limitsThis PR updates
dnode_next_offsetto uselvl,blkid, andindexas the iteration position.Together these three variables point uniquely to an iteration position in some block of an object.
lvlandblkidpoint to a block in the object, andindexpoints to some dnode/BP (if0 <= index < N)index == N)index == -1)Unlike offsets, these:
After this,
dnode_next_offset_levelonly usesoffsetas an output to return the resulting offset to the caller ofdnode_next_offset.To search upwards, instead of
dnode_next_block, the lvl+1indexis set to the low bits of theblkidplus one to point to the position of the current block's pointer sibling -- or one past the end if it was the last child of that block (and similarly minus one for backwards search).This PR has three minor effects beyond refactoring:
Upwards search no longer quits as soon as the L0 offset is < 0 or ≥ 264
This is no longer needed since
blkidandindexcan correctly represent positions outside of the normal range of offsets. Removing this condition simplifies the iteration.When such a condition occurs, the search will proceed up to
maxlvland terminate withESRCH.There is no effect on the search outcome since objects cannot have offsets ≥ 264.
Upwards search no longer spills into the parent's sibling when searching the last (or first) child block.
This is because
indexcan point at one past the end (or beginning).Consider searching a block tree with nlevels == 3 and datablkshift=12 and indblkshift=17.
dnode_next_offset_levelreturns with*offset == 0x100000000Before this PR, the search proceeds at L2 block 1 from offset 0.
After this PR, the search proceeds at L2 block 0 at index 1024 (one past its end).
This difference doesn't change what is found, but it does eliminate the work to load and search the L2 block 1 if it was never going to match.
Instead the cached L3 block will point to the correct next block.
This matters less for hole search (no I/O), but the extra steps are wasteful and unnecessary.
For
ESRCH, this restores the logic to return the same *offset as before backtracking.For
error == 0and mostESRCHcases, the offset is the same as before dnode_next_offset: backtrack if lower level does not match #16025.But for
error == ESRCHcase, the result is different for exactly the case above when all subsequent indirect blocks are holes.Before, the search would continue from offset
0x100000000:dnode_hold_implreturns ENOENT0x100000000After, the search again continues from offset
0x100000000:dnode_next_blockupdates offset to0x200000000The result differs since
dnode_next_blockunconditionally adds 1 at each level searching up the tree, while before it was only changed if an indirect block was scanned.This difference was observed using
ZFS_IOC_NEXT_OBJ.1<<45 == 35184372088832.35149978763231== 0b111111111011111111101111111110111111111011111dnode_next_blockanddmu_object_nextadd:0b1000000000100000000010000000001000012<<45 == 703687441776641<<45.The return value from
dnode_next_offsetonESRCHdoes not appear to be used except for:virtual holecase (which should be unaffected since it deals only in populated blocks)ZFS_IOC_NEXT_OBJwhich returns the value to userspaceThis PR restores the
ESRCHsemantics back to how they were. This happens naturally withindexplus one because the search will not spill into the next block during upwards traversal.Meanwhile, the value itself is underspecified and of questionable utility.
minlvlhave such offsets that would be greater than or equal 264Or for backwards search:
blkidis clamped to zero when searching backwardsNeither of these seem to be deliberately implemented; they are instead side-effects of setting *offset to the larger (or smaller) of the initial offset or the resulting offset along with the the clamp to zero behavior.
For forward search, when the blkid is too large, the shift overflows to zero which means that the initial offset is returned instead.
Luckily, the result is never used for backwards search. This PR maintains the same semantics to minimize change.
Future ideas:
ESRCHresult so that the initial*offsetis returned instead.dnode_next_offsetvariant that returns blkids natively. Many callers want to iterate over blocks but have to deal with L0 offsets.Motivation and Context
Code cleanup, readability, and minor changes to edge cases.
Description
Refactored to iterate by blkid instead of offsets.
See above for details of minor changes to edge cases.
How Has This Been Tested?
ztest, ZTS, llseek stressor
Types of changes
Checklist:
Signed-off-by.