Improve Exception handling messages and stat counter #3511

dhrubo-os · 2025-02-06T21:38:07Z

Summary

ML Commons primarily relies on MLException and its derived classes:

ExecuteException
MLLimitExceededException
MLResourceNotFoundException
MLValidationException

These exceptions define log severity (reference) and contribute to stats updates in the following places:

Stats update code reference 1
Stats update code reference 2

However, this does not cover all exceptions used in ML Commons. The project also frequently uses OpensearchStatusException in multiple places.

This creates an inconsistency, where "not found" exceptions from OpensearchStatusException get incorrectly included in failure stats, even though they should not count as failures.

Problem Statement
Incomplete Exception Categorization

MLException is well-structured for handling ML-specific failures, but OpensearchStatusException is used without proper categorization.
This leads to misclassification of errors, particularly 404 Not Found, which should not contribute to failure metrics.
Inconsistent Logging Severity

ML Commons logs severity based on MLException, but errors from OpensearchStatusException do not follow the same log severity rules.
This results in inconsistent error reporting and debugging challenges.
Misclassified Stats Updates

The system updates failure stats when an MLException occurs, but some errors from OpensearchStatusException should be excluded.
Example: A 404 Not Found from OpensearchStatusException should not be counted as a failure, but it currently is.
Proposed Solution
Enhance MLExceptionUtils to:

Map OpensearchStatusException properly based on HTTP status codes:
404 Not Found → Should not update failure stats.
500 Internal Server Error → Should still be counted as a failure.
Ensure consistent logging levels:
Align OpensearchStatusException with MLException log severity rules.
Refactor all OpensearchStatusException handling through MLExceptionUtils:
Centralize exception handling for unified processing.
Expected Impact
✅ More accurate failure statistics: No longer miscounting expected errors (e.g., 404) as system failures.
✅ Consistent log severity levels: Easier debugging and monitoring.
✅ Unified exception handling: Clearer classification between OpenSearch errors and ML Commons-specific errors.

This will improve system reliability, ensure consistent failure tracking, and reduce unnecessary alerts in logs.

Would love feedback before moving forward with implementation!

The text was updated successfully, but these errors were encountered:

pyek-bot · 2025-02-11T19:06:43Z

Hi, I can take a look at this

dhrubo-os added bug Something isn't working untriaged labels Feb 6, 2025

Zhangxunmt removed the untriaged label Feb 11, 2025

Zhangxunmt assigned dhrubo-os Feb 11, 2025

dhrubo-os assigned pyek-bot and unassigned dhrubo-os Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Exception handling messages and stat counter #3511

Improve Exception handling messages and stat counter #3511

dhrubo-os commented Feb 6, 2025

pyek-bot commented Feb 11, 2025

Improve Exception handling messages and stat counter #3511

Improve Exception handling messages and stat counter #3511

Comments

dhrubo-os commented Feb 6, 2025

pyek-bot commented Feb 11, 2025