Skip to content

Tracing cleanup #7168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 1, 2025
Merged

Tracing cleanup #7168

merged 11 commits into from
Apr 1, 2025

Conversation

macladson
Copy link
Member

@macladson macladson commented Mar 18, 2025

Issue Addressed

#7153
#7146
#7147
#7148 -> Thanks to @ackintosh

Proposed Changes

This PR does the following:

  1. Disable logging to file when using either --logfile-max-number 0 or --logfile-max-size 0. Note that disabling the log file in this way will also disable discv5 and libp2p logging.
  2. discv5 and libp2p logging will be disabled by default unless running beacon_node or boot_node. This also should fix the VC panic we were seeing.
  3. Removes log rotation and compression from libp2p and discv5 logs. It is now limited to 1 file and will rotate based on the value of the --logfile-max-size flag. We could potentially add flags specifically to control the size/number of these, however I felt a single log file was sufficient. Perhaps @AgeManning has opinions about this?
  4. Removes all dependency logging and references to dep_log.
  5. Introduces workspace filtering to file and stdout. This explicitly allows logs from members of the Lighthouse workspace, disallowing all others. It uses a proc macro which pulls the member list from cargo metadata at compile time. This might be over-engineered but my hope is that this list will not require maintenance.
  6. Unifies file and stdout JSON format. With slog, the formats were slightly different. @ThreeHrSleep worked to maintain that format difference, to ensure there was no breaking changes. If these format differences are actually problematic we can restore it, however I felt the added complexity wasn't worth it.
  7. General code improvements and cleanup.

Outstanding Issues

@macladson macladson added ready-for-review The code is ready for review UX-and-logs v7.0.0 New release c. Q1 2025 labels Mar 18, 2025
@macladson macladson requested a review from jimmygchen March 18, 2025 14:54
@macladson macladson changed the title Cleanup Tracing cleanup Mar 18, 2025
@ackintosh
Copy link
Member

@macladson I've filed a PR to fix missing libp2p(gossipsub) logs. macladson#4 Please have a look when you have time.

@macladson
Copy link
Member Author

@macladson I've filed a PR to fix missing libp2p(gossipsub) logs. macladson#4 Please have a look when you have time.

Amazing, thank you so much for noticing this! ❤️

@ackintosh
Copy link
Member

Additionally, I noticed that the --logfile-debug-level option does not affect to discv5.log or libp2p.log. Currently both logs contain TRACE logs so the file size increases very quickly.

I've created macladson#5 to fix this. 🙏

@michaelsproul michaelsproul added v7.1.0 Post-Electra release and removed v7.0.0 New release c. Q1 2025 labels Mar 23, 2025
@michaelsproul
Copy link
Member

Removed the v7.0.0 label as we aren't releasing v7.0.0 from unstable, so it won't include tracing at all

Copy link

mergify bot commented Mar 23, 2025

This pull request has merge conflicts. Could you please resolve them @macladson? 🙏

@AgeManning
Copy link
Member

Single log file is fine with me :)

@macladson
Copy link
Member Author

Additionally, I noticed that the --logfile-debug-level option does not affect to discv5.log or libp2p.log. Currently both logs contain TRACE logs so the file size increases very quickly.

I wonder if defaulting to DEBUG is actually fine here, then we can use RUST_LOG to control the verbosity of the discv5 and libp2p logs? Seems to me that --logfile-debug-level info (or higher) would yield mostly empty discv5/libp2p logfiles anyway, whereas --logfile-debug-level info is perfectly sensible for beacon.log. I agree that we should not default to TRACE though

Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @macladson

Thanks for the clean up and fixing up the issues! 🙏

I've added some comments, let me know your thoughts.

With the default debug-level and logfile-debug-level, I noticed a large amount of trace level logs in the discv5.log and libp2p.log files. @ackintosh has noticed this and created a fix for this (see his comment above). However, even debug level logs for these are quite huge, and libp2p and gossip log files were INFO level by default previously (different to beacon.log), because they're very noisy, and we had to use the RUST_LOG env var to enable deps debug logs. I think we might want to maintain this behaviour due to rate of file size growth and disk IO?

pub fn create_libp2p_discv5_tracing_layer(
base_tracing_log_path: Option<PathBuf>,
max_log_size: u64,
) -> Option<Libp2pDiscv5TracingLayer> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API for this function can be simplified - we know base_tracing_log_path when calling this function, so this can be simplified by removing Option here and only call this function if the log path is Some.

@macladson
Copy link
Member Author

Thanks for the review, I'll fix these up soon!

However, even debug level logs for these are quite huge, and libp2p and gossip log files were INFO level by default previously (different to beacon.log), because they're very noisy, and we had to use the RUST_LOG env var to enable deps debug logs. I think we might want to maintain this behaviour due to rate of file size growth and disk IO?

I think the main thing to consider is what the most useful information in these logs is. It could be the case that INFO logs generally don't provide anything worthwhile for debugging and so we have to set RUST_LOG=debug manually after discovering some issue. Seems to me this friction could be avoided if we logged at debug in the first place. I'm not convinced the old behaviour was better but I'm also not really in a position to make this determination.

As an aside about file size growth and IO, my understanding is that database read/writes dwarf disk IO to the point where logging is pretty insignificant. And the files themselves are each limited to 10MB so there's no concern about unbounded file growth.

fn on_event(&self, event: &tracing::Event<'_>, _ctx: Context<S>) {
let meta = event.metadata();
let log_level = meta.level();
let timestamp = Local::now().format("%Y-%m-%d %H:%M:%S").to_string();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently use this approach Local::now().format() for formatting timestamp in 3 places in logging, and this format function seems relatively expensive - it adds up significantly given it gets called very frequently.

I did a brief profiling and it showed that ~11.44% of total CPU time is spent in chrono formatting functions, if I calculated correctly:

$ grep chrono lighthouse_2025-03-31-182111.collapsed | wc -l
1426
$ wc -l lighthouse_2025-03-31-182111.collapsed
12468 lighthouse_2025-03-31-182111.collapsed

Might be worth optimising a bit here, e.g. consider alternatives, e.g. StrftimeItems or the time crate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to create a separate issue to address this, given this PR already fixed a bunch of things and would be good to merge soon.

Copy link
Member

@jimmygchen jimmygchen Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an attempt using StrftimeItems and it's showing much better results (~3.58% overhead) - although still quite high just for logging timestamps.
macladson#6

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much of that impact is due to using Utc::now vs using StrftimeItems. It would be nice if we could keep local time timestamps. From memory I believe this has been a point of feedback from users in the past

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I reckon most of the impact is from switching to StrftimeItems, DateTime::format seems to be the more expensive operation (The former is recommended in thechrono repo over DateTime::format for performance-critical applications). I tried using Utc::now to avoid timezone conversion to see how much improvements we can get.

IMO local timezone is useful for users (mostly consuming stdout rather than the debug files), and debug files is mostly for us - in that case I think UTC would be less confusing - but yeah don't have a strong opinion on timezone. I think it would be useful to have a summary of logging changes from tracing somewhere, e.g. default logging level change, timestamp logging, CLI flags etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #7232 to address this.

Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @macladson ,

Thanks for the updates. It's looking much cleaner now!

I'm keen to merge this in soon and happy to leave the remaining comments if you prefer to address them separately, since none of them were introduced from your PR?

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Apr 1, 2025
@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Apr 1, 2025
mergify bot added a commit that referenced this pull request Apr 1, 2025
Copy link

mergify bot commented Apr 1, 2025

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

mergify bot added a commit that referenced this pull request Apr 1, 2025
@jimmygchen
Copy link
Member

@mergify dequeue

Copy link

mergify bot commented Apr 1, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #7168 has been dequeued by a dequeue command

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.
If you do update this pull request, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

Copy link

mergify bot commented Apr 1, 2025

dequeue

✅ The pull request has been removed from the queue default

@jimmygchen
Copy link
Member

Waiting for #7235 to be merged to unstable to unblock CI.

@jimmygchen
Copy link
Member

@mergify queue

Copy link

mergify bot commented Apr 1, 2025

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 4839ed6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge. UX-and-logs v7.1.0 Post-Electra release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants