Skip to content

Addition of metrics to track when self-healing started/ended #2268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ShubhamRwt
Copy link
Contributor

@ShubhamRwt ShubhamRwt commented Apr 22, 2025

Summary

  1. Why: We earlier had the metrics in self healing for knowing the timestamp at which the self-healing started but it was not classified based on the anomalies. We also didn't had any metrics that can tell us when self healing was ended
  2. What: This PR adds new metrics regarding self-healing which can show the no. of self healing started per anomaly as well as the timestamp when healing started and also we have metrics which can now tell us the self healing ended timestamp per anomaly

New metrics:

kafka_cruisecontrol_anomalydetector_num_of_self_healing_ended_for_broker_failure_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_ended_for_disk_failure_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_ended_for_goal_violation_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_ended_for_maintenance_event_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_ended_for_metric_anomaly_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_ended_for_topic_anomaly_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_started_for_broker_failure_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_started_for_disk_failure_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_started_for_goal_violation_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_started_for_maintenance_event_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_started_for_metric_anomaly_count
kafka_cruisecontrol_anomalydetector_num_of_self_healing_started_for_topic_anomaly_count

Categorization

  • new feature

This PR resolves #2266.

Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
@ShubhamRwt
Copy link
Contributor Author

CC @ppatierno @tomncooper

Copy link
Contributor

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits.

Can you give an example output of these metrics? I'm not that familiar with dropwizard metrics. Does the dropwizard counter emit a timestamp when something is added to the internal map? I was curious how this gives us the time and type of anomaly healing being triggered.

Also, can you fix the typos in the PR description?

@@ -141,6 +146,14 @@ protected boolean removeEldestEntry(Map.Entry<String, AnomalyState> eldest) {
Timer timer = dropwizardMetricRegistry.timer(
MetricRegistry.name(ANOMALY_DETECTOR_SENSOR, String.format("%s-detect-to-fix-complete-timer", anomalyType.toString().toLowerCase())));
_anomalyDetectToFixCompleteTimer.put(anomalyType, timer);

Counter counter = dropwizardMetricRegistry.counter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these counter and counter1 variables are not actually used anywhere other than the map.put call, can you just in-line them?

Same with timer variable above?

@@ -105,6 +108,8 @@ protected boolean removeEldestEntry(Map.Entry<String, AnomalyState> eldest) {
_ongoingAnomalyDurationSumForAverageMs = 0;
_numSelfHealingStarted = new AtomicLong(0L);
_numSelfHealingFailedToStart = new AtomicLong(0L);
_numSelfHealingStartedPerAnomaly = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to do anything in the refreshMetrics method to reset these maps? There seems to be a numSelfHealingStarted() call in there as part of creating a new AnomalyMetrics instance.

I don't know if that is relevant but you should check if these maps need to be reset.

@tomncooper
Copy link
Contributor

Looks like you have CI failures in the AnomalyDetectorManagerTest, did they pass locally for you?

com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest > testFixDiskFailure FAILED
    java.lang.AssertionError: On mock #1 (zero indexed): 
      Expectation failure on verify:
        ScheduledExecutorService.schedule(isA(java.lang.Runnable), 0 (long), MILLISECONDS): expected: 1, actual: 0
        at org.easymock.EasyMock.getAssertionError(EasyMock.java:2230)
        at org.easymock.EasyMock.verify(EasyMock.java:2058)
        at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest.testFixAnomaly(AnomalyDetectorManagerTest.java:507)
        at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest.testFixDiskFailure(AnomalyDetectorManagerTest.java:238)

com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest > testShutdown PASSED

com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest > testFixSlowBroker FAILED
    java.lang.AssertionError: On mock #1 (zero indexed): 
      Expectation failure on verify:
        ScheduledExecutorService.schedule(isA(java.lang.Runnable), 0 (long), MILLISECONDS): expected: 1, actual: 0
        at org.easymock.EasyMock.getAssertionError(EasyMock.java:2230)
        at org.easymock.EasyMock.verify(EasyMock.java:2058)
        at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest.testFixAnomaly(AnomalyDetectorManagerTest.java:507)
        at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest.testFixSlowBroker(AnomalyDetectorManagerTest.java:244)

com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest > testFixTopicAnomaly FAILED
    java.lang.AssertionError: On mock #1 (zero indexed): 
      Expectation failure on verify:
        ScheduledExecutorService.schedule(isA(java.lang.Runnable), 0 (long), MILLISECONDS): expected: 1, actual: 0
        at org.easymock.EasyMock.getAssertionError(EasyMock.java:2230)
        at org.easymock.EasyMock.verify(EasyMock.java:2058)
        at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest.testFixAnomaly(AnomalyDetectorManagerTest.java:507)
        at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest.testFixTopicAnomaly(AnomalyDetectorManagerTest.java:250)

com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest > testFixGoalViolation FAILED
    java.lang.AssertionError: On mock #1 (zero indexed): 
      Expectation failure on verify:
        ScheduledExecutorService.schedule(isA(java.lang.Runnable), 0 (long), MILLISECONDS): expected: 1, actual: 0
        at org.easymock.EasyMock.getAssertionError(EasyMock.java:2230)
        at org.easymock.EasyMock.verify(EasyMock.java:2058)
        at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest.testFixAnomaly(AnomalyDetectorManagerTest.java:507)
        at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManagerTest.testFixGoalViolation(AnomalyDetectorManagerTest.java:232)

@ppatierno
Copy link
Contributor

I had a look but it looks to me we are adding kind of "historical" anomaly metrics, so for example how many anomalies (per type) started and ended. But what about what's going on in a specific time? How many anomaly fixes are actually running and on which detected anomalies?

@ShubhamRwt
Copy link
Contributor Author

To know if self healing still running, we can have the sum(self healing started) minus sum(self healing ended) that can tell if the self healing is still in progress for a particular anomaly.

@ShubhamRwt
Copy link
Contributor Author

ShubhamRwt commented May 12, 2025

@CCisGG Hi, If you get some time, can you have a look at this. It would be great to know your views on this. Thankyou

@CCisGG
Copy link
Contributor

CCisGG commented May 12, 2025

I'm pretty out of bandwidth recently. @mhratson @allenxwang could you please help to take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Addition of metrics to track when self-healing started/ended
4 participants