-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Endpoints with Regional Endpoints #39390
Open
tvaron3
wants to merge
194
commits into
Azure:main
Choose a base branch
from
tvaron3:tvaron3/regionalEndpoints
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+2,365
−1,187
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…morenoh/azure-sdk-for-python into service_response_error_policy
…morenoh/azure-sdk-for-python into tvaron3/regionalEndpoints
…to tvaron3/regionalEndpoints
API change check APIView has identified API level changes in this PR and created following API reviews. |
sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py
Outdated
Show resolved
Hide resolved
simorenoh
reviewed
Jan 24, 2025
simorenoh
reviewed
Jan 24, 2025
jeet1995
reviewed
Jan 24, 2025
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run python - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
…3/azure-sdk-for-python into tvaron3/regionalEndpoints
…3/azure-sdk-for-python into tvaron3/regionalEndpoints
FabianMeiswinkel
approved these changes
Feb 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks!
/azp run python - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
kushagraThapar
approved these changes
Feb 5, 2025
/check-enforcer override |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Design
Currently in the SDK, every read request is retried 3 times in the same region before failing over to other region and marking the current region unavailable. This helps read requests, however, does not help write requests which are not retried at all in the same region and cannot be retired in other regions in case of single master accounts.
With this new feature to improve write requests availability, the SDK will now maintain a fallback endpoint. Both the current and fallback endpoint will point to the same write region. This will allow the SDK to retry write requests on the fallback endpoint if in case the current endpoint is unavailable because of any connectivity issues. Below are some of the feature implementation details and the testing that we have done so far.
The GetDatabaseAccount which gets called during bootstrapping and every 5 mins will now return the following uris for a region:
<account-name>-<region-name>.documents.azure.com
<account-name>.documents.azure.com
The service will randomly send variations of these endpoints in the getDatabaseAccount call (happens every 5 mins from the SDK).
Example – for a Cosmos DB account with name testAccount and hosted in region West US, the service will now send following endpoints in some round robin fashion:
testAccount-westus.documents.azure.com
testAccount-name.documents.azure.com
This allows clients to point to two different VIPs for the Gateway service to improve availability in scenarios when one gateway VIP goes down.
SDK maintains current and previous endpoints for these two endpoints returned by the gateway.
SDK Updated Retry policy:
SDK’s default retries policy has 3 in-region retries in addition to the original request. Default connection timeout is 5 seconds. Default read timeout is 65s.
ServiceRequestError:
This error happens when the client is trying to connect to the server, but for some reason cannot connect. In this case, since the SDK knows that request has not reached the service, SDK will retry both read and write requests.
For the write requests, the SDK will first retry 3 times on the current endpoint and then 3 times on the fallback endpoint. If the request still fails, it will mark the endpoints unavailable for write operations and retry on other regions (if there are any more write regions available (Multi-master case)), otherwise, the write request will fail.
For reads, it will retry 3 times on the current endpoint, and then will mark the region unavailable for reads, and will retry on other regions.
ServiceResponseError:
This error happens when the client has already connected to the server, but for some reason received an error during response. In this case, SDK will only retry read operations, since it's not safe to retry on writes as the SDK does not know if the write operation succeeded or not.
Implementation
Location cache will now have a new
RegionalEndpoint
object that will have a current and previous. The idea is the previous can be used in certain scenarios to retry. There will now be a health check for every 5 minute global database account refresh. The health check will reach out to the different endpoints using a global database account call because it is quick. We also limited database account calls to 3 seconds and 1 retry. If this health check, fails then we set the endpoint as unavailable.Pseudocode of new current and previous logic
Testing:
Testing with bringing the federations down in staging environment:
Account Regions: East US 2, North Central US
SDK Preferred Locations: East US 2, North Central US
Note: Both read and write requests are retried by default 3 times in the current region. By default, the connection / request timeout in the python SDK is 60 seconds.
Note: We have not yet tested Envoy Proxy, however, this is something we are setting up locally in house and trying to see how the new retry policies react to the Envoy Proxy. We would like to test this envoy proxy before shipping this hotfix to make sure this code change works reliably for customers using proxy.
Bootstrapping Scenario:
Global endpoint down:
If the global endpoint is down, the initial topology call fails and because of the default 60 seconds connection timeout and 3 in region retries, it currently takes around 4 mins to retry this topology call in other endpoints. This can be fixed by the customers by lowering down the request timeout - it should be set somewhere between 5 to 10 seconds. Meanwhile, we are also updating the timeout of the getDatabaseAccount call to 5 seconds so that the SDK can recover quicker.
If the global endpoint is down, and the gateway returns the global endpoint for the write region, the SDK will first try on global endpoint, and once it fails, it will construct the regional endpoint during bootstrapping case and will try there as the fallback.
Regional endpoint down:
In this case, the SDK will keep going to the global endpoint and keep refreshing the location cache every 5 mins (as usual) and if receives the regional endpoint, it will mark it unavailable after 4 attempts (1 original request + 3 retries) and will keep falling back to the global endpoint.
Both endpoints healthy:
In this case no issues observed, and the SDK is able to load balance it well.
Runtime Scenario:
Global Endpoint down:
For GetDatabaseAccount call - Because of the default 60 seconds request timeout and 3 number of retries, throughput decreases to a very low value, eventually getting down to almost 0 for few mins since connectivity issues with global endpoint for topology call. This can be fixed by lowering down the request timeout. (We are planning to reduce the timeout of this getDatabaseAccount call to 5 seconds in the current drop of the SDK).
For write requests -> they will retry on the fallback regional endpoint. If there is another write region on the preferred_regions, then writes will further go to those regions after marking the current region unavailable.
For read requests -> they will not be retried on the fallback endpoints and will go to the different region after marking the current region unavailable for further reads.
Regional Endpoint down:
Almost no effect on the throughput on both read requests and write requests, since the global endpoint is used for topology call, and since that's working, one less point of failure.