Fix the situation where metric collection stops when NPE occours #710

codings-dan · 2021-10-18T03:52:56Z

We use Prometheus to monitor Alluxio system metrics. When some metrics are empty for various reasons, the system no longer monitors the metrics. We can catch this exception, so as not to affect the collection and monitoring of other metrics.

Signed-off-by: Yaolong Liu <[email protected]>

fstab · 2021-10-18T10:55:10Z

Thanks a lot for the PR.

My understanding is that the root cause for the Exception is a bug in how Alluxio uses Dropwizard, resulting in invalid metrics.

The question is: What is the best way to deal with bugs like these? The current implementation is making the scrape fail. Your PR would change this to silently ignoring the invalid metric and making the scrape successful for other metrics.

Both approaches have their drawbacks: Dealing with partial metrics is difficult, especially when it's silently missing, and having no metrics at all because of a bug in Alluxio is difficult as well.

What do you think about implementing the SampleNameFilter for the DropwizardExports? This should work by overriding

public List<MetricFamilySamples> collect(Predicate<String> sampleNameFilter);

Then users could explicitly skip metrics by configuring that in the exporter (HTTPServer and MetricsServlet both support the filter). That way, there is a workaround for the bug in Alluxio without silently dropping metrics.

Signed-off-by: Yaolong Liu <[email protected]>

codings-dan · 2021-10-18T15:45:53Z

Thanks Reply,

My understanding is that, we can implement the SampleNameFilter , and then call collect(Predicate<String> sampleNameFilter) to bypass this bug explictly, I implemented the code logic, please help me review the code to see if it exists problem,thx.

In addition, I think the null pointer exception is a hidden danger of many systems, and it is not limited to Alluxio. We provide the SampleNameFilter to solve the problem, but the price is that many systems that rely on Prometheus for monitoring need to be modified to call Prometheus collect() method., that is not very user-friendly. So I think that the silent ignorance in this pr is meaningful. This modification only requires users to update their own Prometheus dependent version to solve the problem. As for the problem of silently ignoring metrics, we can alert the user when an exception occurs, such as the way of printing exception stack information in this pr, or we can continue to discuss other ways.

Looking forward for your response, thx!

fstab · 2021-10-18T19:24:39Z

The default implementation first collects all metrics, and then applies the filter. So if the NullPointerException happens during metric collection the default implementation will not work. What you need is to apply the filter first, before a sample is collected.

This can be achieved with an explicit implementation of

DropwizardExports.collect(Predicate<String> sampleNameFilter);

The goal is to prevent filtered metrics from being collected in the first place. This will require a bit of refactoring, because you need to pass the filter down to where the sample name is created and apply the filter. It's more work than just catching and ignoring the NullPointerException, but it sounds like a cleaner solution. See ClassLoadingExports for a simple example of a collect(Predicate<String> nameFilter) method.

codings-dan · 2021-10-19T11:59:16Z

I think I probably understood your idea, but there may be a difference. We can't rewrite collect(Predicate<String> sampleNameFilter), because the situation of reporting a null pointer exception requires a method similar to collect(Predicate<T extends Metric> sampleNameFilter) , this may need to change the code of Collector.class. Or, did I not understand your suggestion
Looking forward for your response, thx!

fstab · 2021-10-19T21:18:51Z

I was thinking you add a method like this to DropwizardExports:

@Override
public List<MetricFamilySamples> collect(Predicate<String> nameFilter) {
    // TODO: Collect only metrics where nameFilter.test(name) is true
}

With such a method, DropwizardExports will be capable of filtering metrics by name before they are collected. Then you could use that to exclude the erroneous Alluxio metrics.

codings-dan · 2021-10-22T12:17:11Z

Sorry for the late reply,
This is indeed a solution to the problem, but if we do not know in advance which metric cause a null pointer exception, using this solution may need to modify the code again to exclude the metric, and then recompile and run the system to solve the problem. Can we choose one of the following two solutions to solve the problem.

Can we combine these two solutions to allow users to choose whether to silently ignore various abnormal metrics?
Implement a method

public <T> List<MetricFamilySamples> collect(Predicate<T ? extends Metric> exceptionFilter) {
    // TODO: Collect only metrics where exceptionFilter.test(name) is true
}

The user can call this method like this

Predicate<? extends Metric> exceptionFilter = (metric) -> {
      try {
       Object o =  metric.getValue()
         return true;
      } catch (Exception e) {
        return false;
      }
    }

In this way, the corresponding metrics can be ignored.
Looking forward for your response, thx!

fstab · 2021-10-24T10:05:29Z

I don't think it's a good idea to have an "implicitly ignore all erroneous metrics" option. It would be better to ignore them explicitly. If the NullPointerException does not provide enough information for the user to know what went wrong with which metric, maybe we can just throw a NullPointerException with an explicit message telling the user what was null and which metric caused that. Hopefully this will trigger the user to fix the NPE, but if that's not possible because it's in a 3rd party product then the user can ignore that metric explicitly, and at least know what's missing.

codings-dan · 2021-10-26T02:02:42Z

So, do you think we can throw exception information directly in the collect method?

dhoard · 2022-01-07T14:22:00Z

@codings-dan I feel a combination of @fstab 's proposals would be ideal.

Implement the SampleNameFilter...

@Override
public List<MetricFamilySamples> collect(Predicate<String> nameFilter) {
    // TODO: Collect only metrics where nameFilter.test(name) is true
}

catch/throw an unchecked RuntimeException when performing the collect...

Note: Using a RuntimeException since the implementation of the Metric could potentially be performing some work, etc... (really depends on the implementation of the Metric, which we don't know from this code's point of view.)

Possible/example code (Not tested)

import static io.prometheus.client.SampleNameFilter.ALLOW_ALL;

    @Override
    public List<MetricFamilySamples> collect() {
        return collect(null);
    }

    @Override
    public List<MetricFamilySamples> collect(Predicate<String> nameFilter) {
        nameFilter = nameFilter == null ? ALLOW_ALL : nameFilter;
        Map<String, MetricFamilySamples> mfSamplesMap = new HashMap<String, MetricFamilySamples>();
        String type = null;
        String name = null;
        try {
            type = "Gauge";
            for (SortedMap.Entry<String, Gauge> entry : registry.getGauges(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromGauge(name, entry.getValue()));
                }
            }
            type = "Counter";
            for (SortedMap.Entry<String, Counter> entry : registry.getCounters(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromCounter(name, entry.getValue()));
                }
            }
            type = "Histogram";
            for (SortedMap.Entry<String, Histogram> entry : registry.getHistograms(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromHistogram(name, entry.getValue()));
                }
            }
            type = "Timer";
            for (SortedMap.Entry<String, Timer> entry : registry.getTimers(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromTimer(name, entry.getValue()));
                }
            }
            type = "Meter";
            for (SortedMap.Entry<String, Meter> entry : registry.getMeters(metricFilter).entrySet()) {
                name = entry.getKey();
                if (nameFilter.test(name)) {
                    addToMap(mfSamplesMap, fromMeter(name, entry.getValue()));
                }
            }
        } catch (Exception e) {
            throw new RuntimeException("Exception processing " + type + " " + name, e);
        }

        return new ArrayList<MetricFamilySamples>(mfSamplesMap.values());
    }

codings-dan · 2022-01-10T11:16:33Z

@dhoard Thanks for reviewing the code and making suggestions, I'll try to fix this again based on your suggestion

fix nullpointer exception

14337f7

Signed-off-by: Yaolong Liu <[email protected]>

codings-dan force-pushed the npe branch from 5e34615 to 14337f7 Compare October 18, 2021 03:58

add sampleNameFilter to collect metric

98293b1

Signed-off-by: Yaolong Liu <[email protected]>

codings-dan force-pushed the npe branch from ed92b5f to 98293b1 Compare October 18, 2021 15:40

fstab force-pushed the master branch from 3d6f699 to c83877a Compare January 30, 2022 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the situation where metric collection stops when NPE occours #710

Fix the situation where metric collection stops when NPE occours #710

codings-dan commented Oct 18, 2021

fstab commented Oct 18, 2021

codings-dan commented Oct 18, 2021

fstab commented Oct 18, 2021

codings-dan commented Oct 19, 2021

fstab commented Oct 19, 2021

codings-dan commented Oct 22, 2021 •

edited

Loading

fstab commented Oct 24, 2021 •

edited

Loading

codings-dan commented Oct 26, 2021

dhoard commented Jan 7, 2022

codings-dan commented Jan 10, 2022

Fix the situation where metric collection stops when NPE occours #710

Are you sure you want to change the base?

Fix the situation where metric collection stops when NPE occours #710

Conversation

codings-dan commented Oct 18, 2021

fstab commented Oct 18, 2021

codings-dan commented Oct 18, 2021

fstab commented Oct 18, 2021

codings-dan commented Oct 19, 2021

fstab commented Oct 19, 2021

codings-dan commented Oct 22, 2021 • edited Loading

fstab commented Oct 24, 2021 • edited Loading

codings-dan commented Oct 26, 2021

dhoard commented Jan 7, 2022

codings-dan commented Jan 10, 2022

codings-dan commented Oct 22, 2021 •

edited

Loading

fstab commented Oct 24, 2021 •

edited

Loading