Skip to content

Conversation

@oshogbo
Copy link
Contributor

@oshogbo oshogbo commented Sep 18, 2025

Motivation and Context

When a disk goes offline and is later brought back online with zpool clear, the user may also want to trigger a scrub of the recently written data to ensure there is no corruption.

For user convenience, we provide a "recent" scrub option in zpool scrub.

Description

We use scn_max_txg and scn_min_txg, which are already part of the scrub mechanism, to implement this feature.

Since we (developers) don’t want to expose a configuration that specifies "the number of TXGs" (as discussed in #16301
), we instead use a time-based definition. Users can set the time using one of three units (seconds, hours, or days) for convenience. A TXG database is used to map these time ranges to valid TXG numbers.

Using time instead of TXG introduces an interesting problem: if the pool has been offline for days or weeks, and the user runs zpool clear -s, it may do nothing, since no recent data has been written. To handle this case, we decided to take the last known TXG from the database and use that as the starting point for scrubbing.

How Has This Been Tested?

New tests have been added to the this feature.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Sep 18, 2025
@oshogbo oshogbo changed the title Oshogbo/scrub recent Scrub recent data Sep 18, 2025
@oshogbo oshogbo force-pushed the oshogbo/scrub_recent branch 4 times, most recently from 4214a52 to d4de3ca Compare September 18, 2025 11:20
@oshogbo oshogbo marked this pull request as ready for review September 18, 2025 14:56
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Sep 18, 2025
@robn robn self-requested a review September 30, 2025 00:33
Sponsored-By: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
Signed-off-by: Mariusz Zaborski <[email protected]>
Sponsored-By: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
Signed-off-by: Mariusz Zaborski <[email protected]>
@robn robn force-pushed the oshogbo/scrub_recent branch from d4de3ca to 30edf88 Compare October 20, 2025 21:48
@robn
Copy link
Member

robn commented Oct 20, 2025

Last push rebasing to master to get fresh test results.

Recent data is defined as the last known point in the TXG database,
minus a user-defined time interval (default: 4h).

This feature can be triggered using either of the following commands:
`zpool clean -s` or `zpool scrub -R`.

Sponsored-By: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
Signed-off-by: Mariusz Zaborski <[email protected]>
@amotin
Copy link
Member

amotin commented Oct 22, 2025

I am not sure about this change motivation, and respectively adequate solution choice. If some device got lost, but it was redundant and did not cause pool suspension, then ZFS should already record TXGs when the device was offline and not properly updated to start resilver for that period when zpool clear restores it. If the pool was suspended, then I don't think it should be measured in terms of time, but instead in terms of last one or several(?) TXGs that could possibly be affected by the device cache loss. And I'd think it should not be optional, but done on any pool resume of that kind. Though if some writes were lost and the pool does not have redundancy, I am not sure how much good can this scrub make, other than notify that the pool may be (fatally) corrupted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants