feat: sequencer auto recover when meet an unexpected shutdown #166

krish-nr · 2024-09-09T02:14:05Z

Description

This PR aims to fix the issue where the sequencer node fails to recover after a crash (specifically when the sequencer is a PBSS node).
related node PR

Rationale

When a sequencer node crashes, it may fail to persist the journal in time. As a result, when Geth is restarted, the journal data cannot be read, leading to the loss of recent state data. This causes the sequencer to fail during the buildPayload process, rendering it unable to continue operating. The diagram below illustrates the sequencer block production flow, with the red sections highlighting the logic that cannot proceed due to the crash.

In the diagram, sequencerAction alternates between stages (1) and (2). In stage (1), the payload is constructed, and in stage (2), the data is persisted. This process is not synchronously blocked; in stage (1), operations such as filling the payload's transactions (txs) and some other tasks are performed asynchronously, while the payload is returned synchronously and immediately, controlled by a condition lock (cond). Therefore, before the update in stage (1) is completed, the getPayload in stage (2) will be in a cond.wait state.

After the sequencer crashes, due to the loss of state data (for example, the sequencer crashes at block height 34123456 and the next block to be built is 34123457), the prepareWork phase depends on the state data of block 34123456. However, after the crash, the state data of block 34123456 is lost, leading to failure in this phase and consequently preventing the system from entering the update logic. As a result, the process will remain indefinitely blocked at getPayload, unable to make progress.

The fix involves adding two routines to handle the recovery process and monitor the recovery progress. Upon a failure in the generate phase, a fix routine is started based on specific error conditions. To avoid blocking the main process, a separate routine is also initiated to monitor the specific block being repaired. Once the data recovery is complete, a retry of the update process is triggered, allowing the system to recover the state from before the crash and continue making progress.

There are two scenarios for recovery: recovering from local data or from peers (for the sequencer, peers are its backup nodes). In most cases, the data can be recovered locally. However, there is a corner case where local recovery fails: if the sequencer has already gossiped the latest block to peer nodes but crashes during the local persistence process, the sequencer may fall behind the peer by one block. In this extreme situation, the sequencer must recover from the peer.

This is the complete sequencer recovery flow, as illustrated below.

Example

add an example CLI or API response...

Changes

Notable changes:

fix_manager object has been added to the worker to manage the status and progress during the fix process.
Added a method to recover the state from local data.
Introduced a new downloader method in the Backend interface.

Prepare for v0.5.0 release

miner/fix_manager.go

miner/payload_building.go

owen-reorg and others added 2 commits September 2, 2024 17:30

Merge pull request bnb-chain#155 from bnb-chain/develop

f3bc8ce

Prepare for v0.5.0 release

fix sequencer recover

25c1d01

github-actions bot requested review from andyzhang2023 and owen-reorg September 9, 2024 02:14

fix ut

5aa9b46

krish-nr force-pushed the sequencer_recover_fix branch from b0dca36 to 5aa9b46 Compare September 10, 2024 04:06

krish-nr requested a review from bnoieh September 11, 2024 08:56

bnoieh reviewed Sep 14, 2024

View reviewed changes

krish-nr and others added 4 commits October 14, 2024 16:42

Merge branch 'bnb-chain:main' into sequencer_recover_fix

f25299a

refine: simplified the fix logic from asynchronous to synchronous

13b52ea

Merge remote-tracking branch 'origin/develop' into sequencer_recover_fix

87c4614

remove unused code

a11dbb4

krish-nr force-pushed the sequencer_recover_fix branch from 38d7a87 to a11dbb4 Compare October 29, 2024 10:52

Merge branch 'develop' into sequencer_recover_fix

d0fedcb

krish-nr force-pushed the sequencer_recover_fix branch 3 times, most recently from 1327903 to 074a40f Compare November 5, 2024 09:13

krish-nr mentioned this pull request Nov 5, 2024

feat: handle sequencer recover related logic bnb-chain/opbnb#250

Open

refine: simplify fix logic

487b722

krish-nr force-pushed the sequencer_recover_fix branch from 074a40f to 487b722 Compare November 5, 2024 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: sequencer auto recover when meet an unexpected shutdown #166

feat: sequencer auto recover when meet an unexpected shutdown #166

krish-nr commented Sep 9, 2024 •

edited

Loading

feat: sequencer auto recover when meet an unexpected shutdown #166

Are you sure you want to change the base?

feat: sequencer auto recover when meet an unexpected shutdown #166

Conversation

krish-nr commented Sep 9, 2024 • edited Loading

Description

Rationale

Example

Changes

krish-nr commented Sep 9, 2024 •

edited

Loading