Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove executed_in_epoch table. #21477

Merged
merged 4 commits into from
Mar 13, 2025
Merged

Conversation

mystenmark
Copy link
Contributor

The table is replaced with:

  • An in-memory "dirty set" which holds executed but un-checkpointed
    transaction digests. Transactions are removed from the dirty set by
    CheckpointExecutor.
  • An additional bounded cache intended to lessen the number of db reads
    by CheckpointBuilder
  • Last-resort reads go to the executed_transactions_to_checkpoint table.

The only purpose of the table was to allow CheckpointBuilder to prune
transaction dependencies from prior epochs, and the above approach
suffices while removing a surprisingly expensive source of db writes.

@mystenmark mystenmark requested review from aschran and andll March 12, 2025 21:49
Copy link

vercel bot commented Mar 12, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 13, 2025 9:14pm
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Mar 13, 2025 9:14pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Mar 13, 2025 9:14pm

}
batch.write()?;

if !matches!(tx_key, TransactionKey::Digest(_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this now be combined with above?

}

// TODO: should this be debug_fatal? Its potentially very serious in that it could
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the question of "do we want this to halt the network?" my initial/gut feeling is yes, so that further damage will not continue to accrue that makes it even harder to recover, while we are trying to debug what happened and verify that it will not happen again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the point at which this happens we are already most of the way through reconfig. The only damage will be that a transaction which achieved settlement finality has been dropped. Which is very bad for the sender, but it doesn't actually "damage" the chain in any other respect.

I'm comfortable leaving this as fatal because I believe the tests will catch such a bug before it makes it to prod. But if it did happen, i'm pretty sure the only thing we could do would be to push a fix which removes this fatal!. So that makes me wonder if there's any point.

self.executed_in_epoch_cache.insert(tx_digest, ());
}

// Called by CheckpointExecutor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - sort of? proximately it's called by authority per epoch store. I think this comment makes it a bit confusing, to expect to see a call to this fn in CheckpointExecutor but it's not there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess I meant to indicate which component has a need for this method. I can rewrite the comment to be clearer.

@andll
Copy link
Contributor

andll commented Mar 13, 2025

Looking at this PR from a perspective "how can we test it better" - should we try to reduce the cache size to say 1-2 elements for the simtest runs to make sure we exercise path with cache eviction? It feels like with 50k cache capacity it might never even try this code path

@mystenmark
Copy link
Contributor Author

Looking at this PR from a perspective "how can we test it better" - should we try to reduce the cache size to say 1-2 elements for the simtest runs to make sure we exercise path with cache eviction? It feels like with 50k cache capacity it might never even try this code path

very good point

@mystenmark mystenmark force-pushed the mlogan-remove-executed-in-epoch branch from 270ae79 to eb5eaee Compare March 13, 2025 20:40
@mystenmark mystenmark temporarily deployed to sui-typescript-aws-kms-test-env March 13, 2025 20:40 — with GitHub Actions Inactive
@mystenmark mystenmark enabled auto-merge (squash) March 13, 2025 20:41
The table is replaced with:
- An in-memory "dirty set" which holds executed but un-checkpointed
  transaction digests. Transactions are removed from the dirty set by
  CheckpointExecutor.
- An additional bounded cache intended to lessen the number of db reads
  by CheckpointBuilder
- Last-resort reads go to the `executed_transactions_to_checkpoint` table.

The only purpose of the table was to allow CheckpointBuilder to prune
transaction dependencies from prior epochs, and the above approach
suffices while removing a surprisingly expensive source of db writes.
@mystenmark mystenmark force-pushed the mlogan-remove-executed-in-epoch branch from eb5eaee to 7947247 Compare March 13, 2025 21:13
@mystenmark mystenmark temporarily deployed to sui-typescript-aws-kms-test-env March 13, 2025 21:13 — with GitHub Actions Inactive
@mystenmark mystenmark merged commit efdbe99 into main Mar 13, 2025
46 of 47 checks passed
@mystenmark mystenmark deleted the mlogan-remove-executed-in-epoch branch March 13, 2025 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants