Skip to content

perf(tree): StateRootTask performance regression on high-throughput chains after TrieInput removal #22823

@constwz

Description

@constwz

Description

After the architectural migration from ConsistentDbView + TrieInput to OverlayStateProviderFactory + LazyOverlay (around v1.10.x), state root computation performance has significantly regressed on high-throughput chains (e.g., BSC with 0.45-second block time and 1000+ TPS).

Observed behavior:

  • reth_sync_block_validation_state_root reaches ~7 seconds at 1000 TPS (~450 tx/block)
  • The same workload ran at 1800 TPS on the previous architecture (pre-v1.10, using ConsistentDbView + TrieInput)
  • State root must complete within 0.45s to keep up with chain tip, but takes ~15x longer than the budget

Root Cause Analysis

1. Removal of the prefix_sets guard for StateRootTask

In the previous architecture, there was a critical guard in validate_block_with_state:

// Old code (pre-OverlayFactory migration)
let trie_input = self.compute_trie_input(persisting_kind, ...);

// Use state root task only if prefix sets are empty, otherwise proof generation is
// too expensive because it requires walking over the paths in the prefix set in every proof.
if trie_input.prefix_sets.is_empty() {
    self.payload_processor.spawn(/* StateRootTask */)
} else {
    use_state_root_task = false;
    self.payload_processor.spawn_cache_exclusive(/* fallback to Parallel */)
}

This guard was added to fix #14683 (closed by #14729) which identified that large prefix_sets from ancestor blocks cause each multiproof request to walk all paths in the prefix set, making proof generation extremely expensive.

In the current architecture, plan_state_root_computation() no longer checks prefix sets:

// Current code (v1.10+)
const fn plan_state_root_computation(&self) -> StateRootStrategy {
    if self.config.skip_state_root_validation() || self.config.state_root_fallback() {
        StateRootStrategy::Synchronous
    } else if self.config.use_state_root_task() {
        StateRootStrategy::StateRootTask  // Always used, no prefix_sets check
    } else {
        StateRootStrategy::Parallel
    }
}

The StateRootTask is now always selected when use_state_root_task() is true, regardless of how many unpersisted ancestor blocks exist in memory.

2. Per-worker provider creation overhead

The old ProofTaskManager shared ConsistentDbView read transactions across proof workers and used pre-computed nodes_sorted/state_sorted from TrieInput via Arc.

The new ProofWorkerHandle creates a new OverlayStateProviderFactory::database_provider_ro() for each proof worker, which involves:

  • Creating a new DB read transaction
  • Resolving the LazyOverlay (computing trie data from in-memory ancestor blocks)
  • Constructing the overlay state provider

With ~450 transactions per block at 0.45s block time generating many proof requests, this overhead becomes significant.

3. Why high-throughput chains are disproportionately affected

Parameter Ethereum Mainnet BSC (current)
Block time 12 seconds 0.45 seconds
TX per block ~200 ~450 (at 1000 TPS)
Blocks per second ~0.08 ~2.2
Unpersisted ancestor blocks Few Very many (persistence can't keep up)
State root time budget 12s 0.45s

With 0.45s block time, the chain produces ~2.2 blocks/second. The persistence service cannot write to disk fast enough, causing many blocks to accumulate in memory. Each block's overlay data compounds, making the OverlayStateProviderFactory's LazyOverlay resolution increasingly expensive.

The new architecture works well for Ethereum mainnet's workload (~200 tx/block, 12s block time). But on chains with sub-second block times, the accumulated overlay overhead and per-worker provider creation cost become a critical bottleneck.

Relationship to Previous Issues

  • #14683: "TrieInput with large prefix sets slows down State Root Task multiproofs" — this was the original bug, fixed by adding the prefix_sets guard in #14729
  • #14417: "State Root Task has >500ms spikes of newPayload latency on Base" — tracking issue for the same class of problems

The fix in #14729 was effective, but when TrieInput was replaced by OverlayStateProviderFactory, the guard was not ported to the new architecture.

Suggested Solutions

  1. Reintroduce a prefix_sets-like guard: When the overlay contains significant trie data from ancestor blocks (many unpersisted blocks), fall back to StateRootStrategy::Parallel instead of StateRootTask.

  2. Optimize OverlayStateProviderFactory for proof workers: Cache the resolved overlay and share it across proof workers instead of creating independent providers per worker. The old ConsistentDbView approach was more efficient because it shared data via Arc.

  3. Add a configurable threshold: Allow chains to configure when to use StateRootTask vs Parallel based on expected transaction throughput or block time.

Platform

  • reth version: v1.10.2 (via bnb-chain/reth fork), confirmed same pattern in v1.11.1 main branch
  • Chain: BSC (BNB Smart Chain), 0.45-second block time
  • Workload: 1000 TPS benchmark (~450 tx/block)
  • Hardware: Standard validator-grade server

Metrics

# At 1000 TPS on BSC (0.45s block time, ~450 tx/block):
reth_sync_block_validation_state_root_duration: ~7s  (budget: 0.45s)

# Previous version (pre-OverlayFactory, with prefix_sets guard):
# Successfully handled 1800 TPS on same hardware

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions