-
Notifications
You must be signed in to change notification settings - Fork 621
Addition of metrics to track when self-healing started/ended #2268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits.
Can you give an example output of these metrics? I'm not that familiar with dropwizard metrics. Does the dropwizard counter emit a timestamp when something is added to the internal map? I was curious how this gives us the time and type of anomaly healing being triggered.
Also, can you fix the typos in the PR description?
@@ -141,6 +146,14 @@ protected boolean removeEldestEntry(Map.Entry<String, AnomalyState> eldest) { | |||
Timer timer = dropwizardMetricRegistry.timer( | |||
MetricRegistry.name(ANOMALY_DETECTOR_SENSOR, String.format("%s-detect-to-fix-complete-timer", anomalyType.toString().toLowerCase()))); | |||
_anomalyDetectToFixCompleteTimer.put(anomalyType, timer); | |||
|
|||
Counter counter = dropwizardMetricRegistry.counter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these counter
and counter1
variables are not actually used anywhere other than the map.put
call, can you just in-line them?
Same with timer
variable above?
@@ -105,6 +108,8 @@ protected boolean removeEldestEntry(Map.Entry<String, AnomalyState> eldest) { | |||
_ongoingAnomalyDurationSumForAverageMs = 0; | |||
_numSelfHealingStarted = new AtomicLong(0L); | |||
_numSelfHealingFailedToStart = new AtomicLong(0L); | |||
_numSelfHealingStartedPerAnomaly = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to do anything in the refreshMetrics
method to reset these maps? There seems to be a numSelfHealingStarted()
call in there as part of creating a new AnomalyMetrics
instance.
I don't know if that is relevant but you should check if these maps need to be reset.
Looks like you have CI failures in the
|
Signed-off-by: ShubhamRwt <[email protected]>
I had a look but it looks to me we are adding kind of "historical" anomaly metrics, so for example how many anomalies (per type) started and ended. But what about what's going on in a specific time? How many anomaly fixes are actually running and on which detected anomalies? |
To know if self healing still running, we can have the sum(self healing started) minus sum(self healing ended) that can tell if the self healing is still in progress for a particular anomaly. |
@CCisGG Hi, If you get some time, can you have a look at this. It would be great to know your views on this. Thankyou |
I'm pretty out of bandwidth recently. @mhratson @allenxwang could you please help to take a look? |
Summary
New metrics:
Categorization
This PR resolves #2266.