Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Holesky rescue + hot tree states #7087

Open
wants to merge 4 commits into
base: holesky-rescue
Choose a base branch
from

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented Mar 7, 2025

Merged these two PRs

into

it's an alternative to

Hoped to be stable but not yet proven. So far I've

@dapplion dapplion added do-not-merge v7.0.0-beta.2 Second hacky Holesky rescue release labels Mar 7, 2025
@dapplion
Copy link
Collaborator Author

dapplion commented Mar 7, 2025

Some misc notes from these tests

  • store_beacon_hdiff_buffer_from_state_seconds is pretty high, at avg 0.5 sec
  • store_beacon_hdiff_buffer_from_state_seconds registers 788, when there are just 350 write counts in store_disk_db_write_count_total
  • store_beacon_hdiff_buffer_into_state_seconds is also high at avg 0.5 sec
  • store_hdiff_read_seconds_bucket is really low avg 1 ms
  • store_hdiff_decode_seconds is low avg 10 ms
  • 8 occurrences of MissingHotStateSummary error
  • When syncing I got FC nodes 976 heads 1, is that expected? How can we sync the random heads?
  • Is this a problem Skipping storage of cached state? I noticed this log happening before MissingHotStateSummary
    • We got 397 occurrences of "Skipping storage of cached state" and 14818 of "Stored hot state summary and diffs". @michaelsproul
      • NOTE I modified that code to never skip writes 0f1ef0c
  • We should log whenever a state is added and evicted from the cache so we can trace when a specific state that is cache missed was booted
  • Sync got stuck syncing from the same 2 peers 16Uiu2HAmSZcBZCvScHGrGLxnEWrQep6vgnQ2Pwa8PF7Bs1axkges, 16Uiu2HAm1ZMgSpQVGz3yhNMnXKVHEXeNtUeYwvHFH4CNyKQgc9JC both are Nimbus and a bit behind the Holesky head. They returned empty epochs forever so sync got stuck there. We shouldn't fetch from the same peer over and over as this can stall sync.
  • In my current Holesky sync I have two head chains that are importing the same set of blocks

@dapplion
Copy link
Collaborator Author

dapplion commented Mar 8, 2025

Completed sync to head with this branch. Recapping things found in the logs

Version log

Mar 07 05:35:41.363 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-017d2b8, module: lighthouse:721
Mar 07 07:41:15.919 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-f3a43fc, module: lighthouse:721
Mar 07 08:12:59.395 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-5b84094, module: lighthouse:721
Mar 07 08:13:32.115 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-5b84094, module: lighthouse:721
Mar 07 14:55:12.443 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-5b84094, module: lighthouse:721
Mar 07 16:31:29.535 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-5b84094, module: lighthouse:721
Mar 07 16:32:54.779 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-5b84094, module: lighthouse:721
Mar 07 17:40:48.953 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-5b84094, module: lighthouse:721
Mar 07 18:10:01.667 INFO Lighthouse started, version: Lighthouse/v7.0.0-beta.2-5b84094, module: lighthouse:721

CRIT

Missing hot state summary

These CRITs below should be fixed with 5b84094 and they didn't happen again

Mar 07 06:28:24.178 CRIT Beacon block processing error, error: DBError(LoadingHotHdiffBufferError("store state as diff 0x13ecacbe44114e10200653b6f690d7e65ad1cbe8e83a03369370a734ec5c7e89 3719200", 0x0d6fa05917c8c1bb2dffa7577e4276b2564859d44386380ae4e00bed40da9b92, MissingHotStateSummary { state_root: 0x0d6fa05917c8c1bb2dffa7577e4276b2564859d44386380ae4e00bed40da9b92, existing_summaries: [(0x994684a
Mar 07 06:28:38.995 CRIT Beacon block processing error, error: DBError(LoadingHotHdiffBufferError("store state as diff 0xf22972a85f15a72129f7df2b72a354ec47534f3bfe641dc0cb20ff9f7eac1b86 3719232", 0x0d6fa05917c8c1bb2dffa7577e4276b2564859d44386380ae4e00bed40da9b92, MissingHotStateSummary { state_root: 0x0d6fa05917c8c1bb2dffa7577e4276b2564859d44386380ae4e00bed40da9b92, existing_summaries: [(0x994684a
Mar 07 06:28:42.523 CRIT Beacon block processing error, error: DBError(LoadingHotStateError("get advanced 0x0824a8028b50c4c1bff6ce5fd82149d185d32a134aa3f3cbc3256736615560b8 3719228", 0xaa3f95893cdcea24bdea4149a9212cb45122b94cc93b3f18bd3c47bd8dc00b5b, HotColdDBError(MissingHotState { state_root: 0x13ecacbe44114e10200653b6f690d7e65ad1cbe8e83a03369370a734ec5c7e89, requested_by_state_summary: (0xaa3f9

Task panics

Mar 07 09:03:22.321 CRIT Task panic. This is a bug!, advice: Please check above for a backtrace and notify the developers, backtrace:    0: lighthouse::run::{{closure}}
Mar 07 09:03:22.321 CRIT Task panic. This is a bug!, advice: Please check above for a backtrace and notify the developers, backtrace:    0: lighthouse::run::{{closure}}
Mar 07 15:25:41.962 CRIT Task panic. This is a bug!, advice: Please check above for a backtrace and notify the developers, backtrace:    0: lighthouse::run::{{closure}}
Mar 07 15:25:41.962 CRIT Task panic. This is a bug!, advice: Please check above for a backtrace and notify the developers, backtrace:    0: lighthouse::run::{{closure}}

Grabbing the logs around it, seems a slog panic when shutting down? All other surrounding logs are very normal. Same backtrace and message for both panics

Mar 07 09:03:22.321 INFO Shutting down.., reason: Success("Received SIGHUP"), module: lighthouse:784
Mar 07 09:03:22.321 CRIT Task panic. This is a bug!, advice: Please check above for a backtrace and notify the developers, backtrace:    0: lighthouse::run::{{closure}}
   1: std::panicking::rust_panic_with_hook
   2: std::panicking::begin_panic_handler::{{closure}}
   3: std::sys::backtrace::__rust_end_short_backtrace
   4: rust_begin_unwind
   5: core::panicking::panic_fmt
   6: std::sys::backtrace::__rust_begin_short_backtrace
   7: core::ops::function::FnOnce::call_once{{vtable.shim}}
   8: std::sys::pal::unix::thread::Thread::new::thread_start
   9: <unknown>
  10: <unknown>
, message: slog::Fuse Drain: Os { code: 5, kind: Uncategorized, message: "Input/output error" }, location: /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/slog-2.7.0/src/lib.rs:1944:33, module: lighthouse:598
Mar 07 09:03:22.321 CRIT Task panic. This is a bug!, advice: Please check above for a backtrace and notify the developers, backtrace:    0: lighthouse::run::{{closure}}

ERR

Lots of error computing light_client updates but this will be fixed soon by @eserilev . Else nothing relevant

WARN

Some WARN BlockProcessingFailure for the same reason as the CRITs above.

~1550 State cache missed, with similar frequency in all timeframes and tested commits.

Comment on lines 1537 to 1538
// Write the state anyway. Testing of this branch shows that states may exist in the
// cache but not be yet ever stored in the DB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really suss. If this is true, we broke something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge v7.0.0-beta.2 Second hacky Holesky rescue release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants