Adding monitoring on slashing metrics #166

apentori · 2024-02-14T13:35:50Z

What

Adding monitoring on the metrics validator_monitor_slashed

Why

The issue follow the incident of the 2024/01/31 on holesky where Validator were slashed due to BN and VC using the same validator keys.

The text was updated successfully, but these errors were encountered:

apentori · 2024-02-20T13:41:48Z

Monitoring already available: https://github.com/status-im/infra-hq/blob/3b0e8240fb83fd84d1a19f43f2d0de45b0ded316/ansible/roles/prometheus-slave/files/rules/nimbus.yml#L6.

I couldn't find im the VictorOps history if the alert was correctly created

jakubgs · 2024-02-20T17:06:02Z

What about metrics? Has the count risen during the Holesky fleet slashing?

apentori · 2024-02-21T09:14:50Z

yes, the metrics had some change over time, for example on geth-07, between the 30/01 and 02/02:

We could simplify the alert condition to:

validator_monitor_slashed > 0

jakubgs · 2024-02-21T13:03:52Z

Okay, I see how this works, there's a `` label which shows the whole count for the node:

validator_monitor_slashed{fleet="nimbus.prater", validator="total"}

And rest is for per-validator using beginning of public key. But we don't need to look at that, it's easier to look at the total.

The issue with validator_monitor_slashed > 0 is that it will trigger for those hosts that already have slashed validators:

Which means that alone is not useful. We need to detect an increase in that metric, not just above zero value.

jakubgs · 2024-02-21T13:38:56Z

For example, here we can see that the count just goes up:

validator_monitor_slashed{
    instance="geth-09.ih-eu-mda1.nimbus.holesky", fleet="nimbus.holesky", validator="total"
}

But here we can see it's just a spike when we use delta() or increase():

increase(validator_monitor_slashed{
    instance="geth-09.ih-eu-mda1.nimbus.holesky", fleet="nimbus.holesky", validator="total"
}[1m])

If we incease the time range to 15 minutes we can see more spikes:

But it doesn't last, so the alert clears quickly.

jakubgs · 2024-02-21T14:00:15Z

What we use right now is:

sum by (instance,container)(delta(validator_monitor_slashed[15m])) > 0

There's at least two ways we can improve this:

sum by (instance,container)(increase(validator_monitor_slashed{validator="total"}[15m])) > 0

By using increase() and adding validator="total". But that doesn't really solve the issue of it clearing in 15 minutes.

Prometheus sends alerts every time it evaluates the check and it's false, but once it clear the alert disappears:

Prometheus's alerting rules are good at figuring what is broken right now, but they are not a fully-fledged notification solution. Another layer is needed to add summarization, notification rate limiting, silencing and alert dependencies on top of the simple alert definitions.

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#sending-alert-notifications

jakubgs · 2024-02-21T14:19:50Z

I see no clear way to make such an alert last using Alertmanager configuration:
https://prometheus.io/docs/alerting/latest/configuration/

So it seems the only way to make it last longer is doing some metric query wizardry:
https://prometheus.io/docs/prometheus/latest/querying/functions/

jakubgs · 2024-02-21T15:18:52Z

Also, there is another way, which is not great, but we could ask for another metric:

# HELP validator_monitor_slashed_since_start Incremented by 1 for each slashed validator during this run.
# TYPE validator_monitor_slashed_since_start gauge
validator_monitor_slashed_since_start 0.0

Not sure if this would be accepted by nimbus-eth2 developers, but the only other alternatives are:

Dark magic with Prometheus query to keep the value >0 for longer.
- Not sure if at all doable and probably would be a horrible hack.
Removing slashed validator keys from affected nodes.
- Jacek expressed approval for running nodes with slashed validators to test that part of the code.

apentori added the metrics Scraping and graphing metrics label Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding monitoring on slashing metrics #166

Adding monitoring on slashing metrics #166

apentori commented Feb 14, 2024

apentori commented Feb 20, 2024

jakubgs commented Feb 20, 2024

apentori commented Feb 21, 2024

jakubgs commented Feb 21, 2024

jakubgs commented Feb 21, 2024 •

edited

jakubgs commented Feb 21, 2024

jakubgs commented Feb 21, 2024

jakubgs commented Feb 21, 2024

Adding monitoring on slashing metrics #166

Adding monitoring on slashing metrics #166

Comments

apentori commented Feb 14, 2024

What

Why

apentori commented Feb 20, 2024

jakubgs commented Feb 20, 2024

apentori commented Feb 21, 2024

jakubgs commented Feb 21, 2024

jakubgs commented Feb 21, 2024 • edited

jakubgs commented Feb 21, 2024

jakubgs commented Feb 21, 2024

jakubgs commented Feb 21, 2024

jakubgs commented Feb 21, 2024 •

edited