-
Notifications
You must be signed in to change notification settings - Fork 164
fix(BA-2370): Handle Storage Proxy connection error #5868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds error handling for storage proxy connection failures by introducing a new exception class and catching connection errors during volume retrieval operations.
- Introduces
StorageProxyConnectionError
exception class for handling storage proxy connectivity issues - Updates the storage proxy manager client to catch and re-raise connection errors as the new exception type
- Modifies the session manager to gracefully handle connection failures by logging warnings and returning empty results
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
src/ai/backend/manager/errors/storage.py | Adds new StorageProxyConnectionError exception class |
src/ai/backend/manager/clients/storage_proxy/session_manager.py | Imports new exception and handles connection errors in volume fetching |
src/ai/backend/manager/clients/storage_proxy/manager_facing_client.py | Catches aiohttp.ClientConnectionError and re-raises as StorageProxyConnectionError |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
) | ||
|
||
|
||
class StorageProxyConnectionError(BackendAIError, web.HTTPClientError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTTP status code should be 503 Service Unavailable (web.HTTPServiceUnavailable) instead of 400 Client Error (web.HTTPClientError) since connection failures indicate the service is temporarily unavailable, not a client request error.
class StorageProxyConnectionError(BackendAIError, web.HTTPClientError): | |
class StorageProxyConnectionError(BackendAIError, web.HTTPServiceUnavailable): |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it is not a ClientError because it is a Server Connection Error.
) | ||
|
||
|
||
class StorageProxyConnectionError(BackendAIError, web.HTTPClientError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it is not a ClientError because it is a Server Connection Error.
try: | ||
reply = await client.get_volumes() | ||
except StorageProxyConnectionError: | ||
log.warning("Failed to connect to storage proxy (name: {})", proxy_name) | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like this is the way to consume the error, is there a reason you worked this way?
resolves #5867 (BA-2370)
Checklist: (if applicable)
ai.backend.test
docs
directory