perf(tree): StateRootTask performance regression on high-throughput chains after TrieInput removal

## Description

After the architectural migration from `ConsistentDbView` + `TrieInput` to `OverlayStateProviderFactory` + `LazyOverlay` (around v1.10.x), state root computation performance has significantly regressed on high-throughput chains (e.g., BSC with **0.45-second block time** and 1000+ TPS).

**Observed behavior:**
- `reth_sync_block_validation_state_root` reaches **~7 seconds** at 1000 TPS (~450 tx/block)
- The same workload ran at **1800 TPS** on the previous architecture (pre-v1.10, using `ConsistentDbView` + `TrieInput`)
- State root must complete within 0.45s to keep up with chain tip, but takes **~15x longer** than the budget

## Root Cause Analysis

### 1. Removal of the `prefix_sets` guard for `StateRootTask`

In the previous architecture, there was a critical guard in `validate_block_with_state`:

```rust
// Old code (pre-OverlayFactory migration)
let trie_input = self.compute_trie_input(persisting_kind, ...);

// Use state root task only if prefix sets are empty, otherwise proof generation is
// too expensive because it requires walking over the paths in the prefix set in every proof.
if trie_input.prefix_sets.is_empty() {
    self.payload_processor.spawn(/* StateRootTask */)
} else {
    use_state_root_task = false;
    self.payload_processor.spawn_cache_exclusive(/* fallback to Parallel */)
}
```

This guard was added to fix [#14683](https://github.com/paradigmxyz/reth/issues/14683) (closed by [#14729](https://github.com/paradigmxyz/reth/pull/14729)) which identified that large `prefix_sets` from ancestor blocks cause **each multiproof request to walk all paths in the prefix set**, making proof generation extremely expensive.

In the current architecture, `plan_state_root_computation()` no longer checks prefix sets:

```rust
// Current code (v1.10+)
const fn plan_state_root_computation(&self) -> StateRootStrategy {
    if self.config.skip_state_root_validation() || self.config.state_root_fallback() {
        StateRootStrategy::Synchronous
    } else if self.config.use_state_root_task() {
        StateRootStrategy::StateRootTask  // Always used, no prefix_sets check
    } else {
        StateRootStrategy::Parallel
    }
}
```

The `StateRootTask` is now always selected when `use_state_root_task()` is true, regardless of how many unpersisted ancestor blocks exist in memory.

### 2. Per-worker provider creation overhead

The old `ProofTaskManager` shared `ConsistentDbView` read transactions across proof workers and used pre-computed `nodes_sorted`/`state_sorted` from `TrieInput` via `Arc`.

The new `ProofWorkerHandle` creates a new `OverlayStateProviderFactory::database_provider_ro()` for each proof worker, which involves:
- Creating a new DB read transaction
- Resolving the `LazyOverlay` (computing trie data from in-memory ancestor blocks)
- Constructing the overlay state provider

With ~450 transactions per block at 0.45s block time generating many proof requests, this overhead becomes significant.

### 3. Why high-throughput chains are disproportionately affected

| Parameter | Ethereum Mainnet | BSC (current) |
|-----------|-----------------|---------------|
| Block time | 12 seconds | **0.45 seconds** |
| TX per block | ~200 | ~450 (at 1000 TPS) |
| Blocks per second | ~0.08 | **~2.2** |
| Unpersisted ancestor blocks | Few | **Very many** (persistence can't keep up) |
| State root time budget | 12s | **0.45s** |

With 0.45s block time, the chain produces ~2.2 blocks/second. The persistence service cannot write to disk fast enough, causing many blocks to accumulate in memory. Each block's overlay data compounds, making the `OverlayStateProviderFactory`'s `LazyOverlay` resolution increasingly expensive.

The new architecture works well for Ethereum mainnet's workload (~200 tx/block, 12s block time). But on chains with sub-second block times, the accumulated overlay overhead and per-worker provider creation cost become a critical bottleneck.

## Relationship to Previous Issues

- [#14683](https://github.com/paradigmxyz/reth/issues/14683): "`TrieInput` with large prefix sets slows down State Root Task multiproofs" — this was the original bug, fixed by adding the prefix_sets guard in [#14729](https://github.com/paradigmxyz/reth/pull/14729)
- [#14417](https://github.com/paradigmxyz/reth/issues/14417): "State Root Task has >500ms spikes of `newPayload` latency on Base" — tracking issue for the same class of problems

The fix in #14729 was effective, but when `TrieInput` was replaced by `OverlayStateProviderFactory`, the guard was not ported to the new architecture.

## Suggested Solutions

1. **Reintroduce a prefix_sets-like guard**: When the overlay contains significant trie data from ancestor blocks (many unpersisted blocks), fall back to `StateRootStrategy::Parallel` instead of `StateRootTask`.

2. **Optimize `OverlayStateProviderFactory` for proof workers**: Cache the resolved overlay and share it across proof workers instead of creating independent providers per worker. The old `ConsistentDbView` approach was more efficient because it shared data via `Arc`.

3. **Add a configurable threshold**: Allow chains to configure when to use `StateRootTask` vs `Parallel` based on expected transaction throughput or block time.

## Platform

- reth version: v1.10.2 (via bnb-chain/reth fork), confirmed same pattern in v1.11.1 main branch
- Chain: BSC (BNB Smart Chain), **0.45-second block time**
- Workload: 1000 TPS benchmark (~450 tx/block)
- Hardware: Standard validator-grade server

## Metrics

```
# At 1000 TPS on BSC (0.45s block time, ~450 tx/block):
reth_sync_block_validation_state_root_duration: ~7s  (budget: 0.45s)

# Previous version (pre-OverlayFactory, with prefix_sets guard):
# Successfully handled 1800 TPS on same hardware
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(tree): StateRootTask performance regression on high-throughput chains after TrieInput removal #22823

Description

Root Cause Analysis

1. Removal of the `prefix_sets` guard for `StateRootTask`

2. Per-worker provider creation overhead

3. Why high-throughput chains are disproportionately affected

Relationship to Previous Issues

Suggested Solutions

Platform

Metrics

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parameter	Ethereum Mainnet	BSC (current)
Block time	12 seconds	0.45 seconds
TX per block	~200	~450 (at 1000 TPS)
Blocks per second	~0.08	~2.2
Unpersisted ancestor blocks	Few	Very many (persistence can't keep up)
State root time budget	12s	0.45s

perf(tree): StateRootTask performance regression on high-throughput chains after TrieInput removal #22823

Description

Description

Root Cause Analysis

1. Removal of the prefix_sets guard for StateRootTask

2. Per-worker provider creation overhead

3. Why high-throughput chains are disproportionately affected

Relationship to Previous Issues

Suggested Solutions

Platform

Metrics

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Removal of the `prefix_sets` guard for `StateRootTask`