Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the situation where metric collection stops when NPE occours #710

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

codings-dan
Copy link

We use Prometheus to monitor Alluxio system metrics. When some metrics are empty for various reasons, the system no longer monitors the metrics. We can catch this exception, so as not to affect the collection and monitoring of other metrics.

Signed-off-by: Yaolong Liu <[email protected]>
@fstab
Copy link
Member

fstab commented Oct 18, 2021

Thanks a lot for the PR.

My understanding is that the root cause for the Exception is a bug in how Alluxio uses Dropwizard, resulting in invalid metrics.

The question is: What is the best way to deal with bugs like these? The current implementation is making the scrape fail. Your PR would change this to silently ignoring the invalid metric and making the scrape successful for other metrics.

Both approaches have their drawbacks: Dealing with partial metrics is difficult, especially when it's silently missing, and having no metrics at all because of a bug in Alluxio is difficult as well.

What do you think about implementing the SampleNameFilter for the DropwizardExports? This should work by overriding

public List<MetricFamilySamples> collect(Predicate<String> sampleNameFilter);

Then users could explicitly skip metrics by configuring that in the exporter (HTTPServer and MetricsServlet both support the filter). That way, there is a workaround for the bug in Alluxio without silently dropping metrics.

@codings-dan
Copy link
Author

Thanks Reply,

My understanding is that, we can implement the SampleNameFilter , and then call collect(Predicate<String> sampleNameFilter) to bypass this bug explictly, I implemented the code logic, please help me review the code to see if it exists problem,thx.

In addition, I think the null pointer exception is a hidden danger of many systems, and it is not limited to Alluxio. We provide the SampleNameFilter to solve the problem, but the price is that many systems that rely on Prometheus for monitoring need to be modified to call Prometheus collect() method., that is not very user-friendly. So I think that the silent ignorance in this pr is meaningful. This modification only requires users to update their own Prometheus dependent version to solve the problem. As for the problem of silently ignoring metrics, we can alert the user when an exception occurs, such as the way of printing exception stack information in this pr, or we can continue to discuss other ways.

Looking forward for your response, thx!

@fstab
Copy link
Member

fstab commented Oct 18, 2021

The default implementation first collects all metrics, and then applies the filter. So if the NullPointerException happens during metric collection the default implementation will not work. What you need is to apply the filter first, before a sample is collected.

This can be achieved with an explicit implementation of

DropwizardExports.collect(Predicate<String> sampleNameFilter);

The goal is to prevent filtered metrics from being collected in the first place. This will require a bit of refactoring, because you need to pass the filter down to where the sample name is created and apply the filter. It's more work than just catching and ignoring the NullPointerException, but it sounds like a cleaner solution. See ClassLoadingExports for a simple example of a collect(Predicate<String> nameFilter) method.

@codings-dan
Copy link
Author

I think I probably understood your idea, but there may be a difference. We can't rewrite collect(Predicate<String> sampleNameFilter), because the situation of reporting a null pointer exception requires a method similar to collect(Predicate<T extends Metric> sampleNameFilter) , this may need to change the code of Collector.class. Or, did I not understand your suggestion
Looking forward for your response, thx!

@fstab
Copy link
Member

fstab commented Oct 19, 2021

I was thinking you add a method like this to DropwizardExports:

@Override
public List<MetricFamilySamples> collect(Predicate<String> nameFilter) {
    // TODO: Collect only metrics where nameFilter.test(name) is true
}

With such a method, DropwizardExports will be capable of filtering metrics by name before they are collected. Then you could use that to exclude the erroneous Alluxio metrics.

@codings-dan
Copy link
Author

codings-dan commented Oct 22, 2021

Sorry for the late reply,
This is indeed a solution to the problem, but if we do not know in advance which metric cause a null pointer exception, using this solution may need to modify the code again to exclude the metric, and then recompile and run the system to solve the problem. Can we choose one of the following two solutions to solve the problem.

  1. Can we combine these two solutions to allow users to choose whether to silently ignore various abnormal metrics?
  2. Implement a method
public <T> List<MetricFamilySamples> collect(Predicate<T ? extends Metric> exceptionFilter) {
    // TODO: Collect only metrics where exceptionFilter.test(name) is true
}

The user can call this method like this

Predicate<? extends Metric> exceptionFilter = (metric) -> {
      try {
       Object o =  metric.getValue()
         return true;
      } catch (Exception e) {
        return false;
      }
    }

In this way, the corresponding metrics can be ignored.
Looking forward for your response, thx!

@fstab
Copy link
Member

fstab commented Oct 24, 2021

I don't think it's a good idea to have an "implicitly ignore all erroneous metrics" option. It would be better to ignore them explicitly. If the NullPointerException does not provide enough information for the user to know what went wrong with which metric, maybe we can just throw a NullPointerException with an explicit message telling the user what was null and which metric caused that. Hopefully this will trigger the user to fix the NPE, but if that's not possible because it's in a 3rd party product then the user can ignore that metric explicitly, and at least know what's missing.

@codings-dan
Copy link
Author

So, do you think we can throw exception information directly in the collect method?

@dhoard
Copy link
Collaborator

dhoard commented Jan 7, 2022

@codings-dan I feel a combination of @fstab 's proposals would be ideal.

  1. Implement the SampleNameFilter...
@Override
public List<MetricFamilySamples> collect(Predicate<String> nameFilter) {
    // TODO: Collect only metrics where nameFilter.test(name) is true
}
  1. catch/throw an unchecked RuntimeException when performing the collect...

Note: Using a RuntimeException since the implementation of the Metric could potentially be performing some work, etc... (really depends on the implementation of the Metric, which we don't know from this code's point of view.)


Possible/example code (Not tested)

import static io.prometheus.client.SampleNameFilter.ALLOW_ALL;
    @Override
    public List<MetricFamilySamples> collect() {
        return collect(null);
    }

    @Override
    public List<MetricFamilySamples> collect(Predicate<String> nameFilter) {
        nameFilter = nameFilter == null ? ALLOW_ALL : nameFilter;
        Map<String, MetricFamilySamples> mfSamplesMap = new HashMap<String, MetricFamilySamples>();
        String type = null;
        String name = null;
        try {
            type = "Gauge";
            for (SortedMap.Entry<String, Gauge> entry : registry.getGauges(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromGauge(name, entry.getValue()));
                }
            }
            type = "Counter";
            for (SortedMap.Entry<String, Counter> entry : registry.getCounters(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromCounter(name, entry.getValue()));
                }
            }
            type = "Histogram";
            for (SortedMap.Entry<String, Histogram> entry : registry.getHistograms(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromHistogram(name, entry.getValue()));
                }
            }
            type = "Timer";
            for (SortedMap.Entry<String, Timer> entry : registry.getTimers(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromTimer(name, entry.getValue()));
                }
            }
            type = "Meter";
            for (SortedMap.Entry<String, Meter> entry : registry.getMeters(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromMeter(name, entry.getValue()));
                }
            }
        } catch (Exception e) {
            throw new RuntimeException("Exception processing " + type + " " + name, e);
        }

        return new ArrayList<MetricFamilySamples>(mfSamplesMap.values());
    }

@codings-dan
Copy link
Author

@dhoard Thanks for reviewing the code and making suggestions, I'll try to fix this again based on your suggestion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants