Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sequencer auto recover when meet an unexpected shutdown #166

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

krish-nr
Copy link
Contributor

@krish-nr krish-nr commented Sep 9, 2024

Description

This PR aims to fix the issue where the sequencer node fails to recover after a crash (specifically when the sequencer is a PBSS node).
related node PR

Rationale

When a sequencer node crashes, it may fail to persist the journal in time. As a result, when Geth is restarted, the journal data cannot be read, leading to the loss of recent state data. This causes the sequencer to fail during the buildPayload process, rendering it unable to continue operating. The diagram below illustrates the sequencer block production flow, with the red sections highlighting the logic that cannot proceed due to the crash.

In the diagram, sequencerAction alternates between stages (1) and (2). In stage (1), the payload is constructed, and in stage (2), the data is persisted. This process is not synchronously blocked; in stage (1), operations such as filling the payload's transactions (txs) and some other tasks are performed asynchronously, while the payload is returned synchronously and immediately, controlled by a condition lock (cond). Therefore, before the update in stage (1) is completed, the getPayload in stage (2) will be in a cond.wait state.

After the sequencer crashes, due to the loss of state data (for example, the sequencer crashes at block height 34123456 and the next block to be built is 34123457), the prepareWork phase depends on the state data of block 34123456. However, after the crash, the state data of block 34123456 is lost, leading to failure in this phase and consequently preventing the system from entering the update logic. As a result, the process will remain indefinitely blocked at getPayload, unable to make progress.

image

The fix involves adding two routines to handle the recovery process and monitor the recovery progress. Upon a failure in the generate phase, a fix routine is started based on specific error conditions. To avoid blocking the main process, a separate routine is also initiated to monitor the specific block being repaired. Once the data recovery is complete, a retry of the update process is triggered, allowing the system to recover the state from before the crash and continue making progress.

There are two scenarios for recovery: recovering from local data or from peers (for the sequencer, peers are its backup nodes). In most cases, the data can be recovered locally. However, there is a corner case where local recovery fails: if the sequencer has already gossiped the latest block to peer nodes but crashes during the local persistence process, the sequencer may fall behind the peer by one block. In this extreme situation, the sequencer must recover from the peer.

This is the complete sequencer recovery flow, as illustrated below.

image

Example

add an example CLI or API response...

Changes

Notable changes:

  • fix_manager object has been added to the worker to manage the status and progress during the fix process.
  • Added a method to recover the state from local data.
  • Introduced a new downloader method in the Backend interface.

miner/fix_manager.go Outdated Show resolved Hide resolved
miner/payload_building.go Outdated Show resolved Hide resolved
miner/payload_building.go Outdated Show resolved Hide resolved
miner/payload_building.go Outdated Show resolved Hide resolved
miner/payload_building.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants