Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Exception handling messages and stat counter #3511

Open
dhrubo-os opened this issue Feb 6, 2025 · 1 comment
Open

Improve Exception handling messages and stat counter #3511

dhrubo-os opened this issue Feb 6, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@dhrubo-os
Copy link
Collaborator

Summary

ML Commons primarily relies on MLException and its derived classes:

ExecuteException
MLLimitExceededException
MLResourceNotFoundException
MLValidationException

These exceptions define log severity (reference) and contribute to stats updates in the following places:

Stats update code reference 1
Stats update code reference 2

However, this does not cover all exceptions used in ML Commons. The project also frequently uses OpensearchStatusException in multiple places.

This creates an inconsistency, where "not found" exceptions from OpensearchStatusException get incorrectly included in failure stats, even though they should not count as failures.

Problem Statement
Incomplete Exception Categorization

MLException is well-structured for handling ML-specific failures, but OpensearchStatusException is used without proper categorization.
This leads to misclassification of errors, particularly 404 Not Found, which should not contribute to failure metrics.
Inconsistent Logging Severity

ML Commons logs severity based on MLException, but errors from OpensearchStatusException do not follow the same log severity rules.
This results in inconsistent error reporting and debugging challenges.
Misclassified Stats Updates

The system updates failure stats when an MLException occurs, but some errors from OpensearchStatusException should be excluded.
Example: A 404 Not Found from OpensearchStatusException should not be counted as a failure, but it currently is.
Proposed Solution
Enhance MLExceptionUtils to:

Map OpensearchStatusException properly based on HTTP status codes:
404 Not Found → Should not update failure stats.
500 Internal Server Error → Should still be counted as a failure.
Ensure consistent logging levels:
Align OpensearchStatusException with MLException log severity rules.
Refactor all OpensearchStatusException handling through MLExceptionUtils:
Centralize exception handling for unified processing.
Expected Impact
✅ More accurate failure statistics: No longer miscounting expected errors (e.g., 404) as system failures.
✅ Consistent log severity levels: Easier debugging and monitoring.
✅ Unified exception handling: Clearer classification between OpenSearch errors and ML Commons-specific errors.

This will improve system reliability, ensure consistent failure tracking, and reduce unnecessary alerts in logs.

Would love feedback before moving forward with implementation!

@dhrubo-os dhrubo-os added bug Something isn't working untriaged labels Feb 6, 2025
@pyek-bot
Copy link
Contributor

Hi, I can take a look at this

@dhrubo-os dhrubo-os assigned pyek-bot and unassigned dhrubo-os Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants