Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Endpoints with Regional Endpoints #39390

Open
wants to merge 194 commits into
base: main
Choose a base branch
from

Conversation

tvaron3
Copy link
Member

@tvaron3 tvaron3 commented Jan 24, 2025

Design

Currently in the SDK, every read request is retried 3 times in the same region before failing over to other region and marking the current region unavailable. This helps read requests, however, does not help write requests which are not retried at all in the same region and cannot be retired in other regions in case of single master accounts.

With this new feature to improve write requests availability, the SDK will now maintain a fallback endpoint. Both the current and fallback endpoint will point to the same write region. This will allow the SDK to retry write requests on the fallback endpoint if in case the current endpoint is unavailable because of any connectivity issues. Below are some of the feature implementation details and the testing that we have done so far.

The GetDatabaseAccount which gets called during bootstrapping and every 5 mins will now return the following uris for a region:

<account-name>-<region-name>.documents.azure.com

<account-name>.documents.azure.com

The service will randomly send variations of these endpoints in the getDatabaseAccount call (happens every 5 mins from the SDK).

Example – for a Cosmos DB account with name testAccount and hosted in region West US, the service will now send following endpoints in some round robin fashion:

testAccount-westus.documents.azure.com

testAccount-name.documents.azure.com

This allows clients to point to two different VIPs for the Gateway service to improve availability in scenarios when one gateway VIP goes down.

SDK maintains current and previous endpoints for these two endpoints returned by the gateway.

SDK Updated Retry policy:

SDK’s default retries policy has 3 in-region retries in addition to the original request. Default connection timeout is 5 seconds. Default read timeout is 65s.

ServiceRequestError:

This error happens when the client is trying to connect to the server, but for some reason cannot connect. In this case, since the SDK knows that request has not reached the service, SDK will retry both read and write requests.

For the write requests, the SDK will first retry 3 times on the current endpoint and then 3 times on the fallback endpoint. If the request still fails, it will mark the endpoints unavailable for write operations and retry on other regions (if there are any more write regions available (Multi-master case)), otherwise, the write request will fail.

For reads, it will retry 3 times on the current endpoint, and then will mark the region unavailable for reads, and will retry on other regions.

ServiceResponseError:

This error happens when the client has already connected to the server, but for some reason received an error during response. In this case, SDK will only retry read operations, since it's not safe to retry on writes as the SDK does not know if the write operation succeeded or not.

Implementation

Location cache will now have a new RegionalEndpoint object that will have a current and previous. The idea is the previous can be used in certain scenarios to retry. There will now be a health check for every 5 minute global database account refresh. The health check will reach out to the different endpoints using a global database account call because it is quick. We also limited database account calls to 3 seconds and 1 retry. If this health check, fails then we set the endpoint as unavailable.

Pseudocode of new current and previous logic

request in progress
on current:
    success:
        no op
    failure:
        use previous:
 
on previous:
    success:
        temp = current
        current = previous
        previous = temp 
    failure:
        refresh the cache:
            if (current != new value):
                previous = current:
            current = new value
 
on database account refresh - every 5 mins:
    initial:
        current = new value
        if defaulty endpoint != new value:
             previous = default endpoint 
    next:
        if current != new value
            previous = current
        current = new value
    perform health check

Testing:

Testing with bringing the federations down in staging environment:
Account Regions: East US 2, North Central US
SDK Preferred Locations: East US 2, North Central US

Note: Both read and write requests are retried by default 3 times in the current region. By default, the connection / request timeout in the python SDK is 60 seconds.

Note: We have not yet tested Envoy Proxy, however, this is something we are setting up locally in house and trying to see how the new retry policies react to the Envoy Proxy. We would like to test this envoy proxy before shipping this hotfix to make sure this code change works reliably for customers using proxy.

Bootstrapping Scenario:

Global endpoint down:

If the global endpoint is down, the initial topology call fails and because of the default 60 seconds connection timeout and 3 in region retries, it currently takes around 4 mins to retry this topology call in other endpoints. This can be fixed by the customers by lowering down the request timeout - it should be set somewhere between 5 to 10 seconds. Meanwhile, we are also updating the timeout of the getDatabaseAccount call to 5 seconds so that the SDK can recover quicker.

If the global endpoint is down, and the gateway returns the global endpoint for the write region, the SDK will first try on global endpoint, and once it fails, it will construct the regional endpoint during bootstrapping case and will try there as the fallback.

Regional endpoint down:

In this case, the SDK will keep going to the global endpoint and keep refreshing the location cache every 5 mins (as usual) and if receives the regional endpoint, it will mark it unavailable after 4 attempts (1 original request + 3 retries) and will keep falling back to the global endpoint.

Both endpoints healthy:

In this case no issues observed, and the SDK is able to load balance it well.

Runtime Scenario:

Global Endpoint down:

For GetDatabaseAccount call - Because of the default 60 seconds request timeout and 3 number of retries, throughput decreases to a very low value, eventually getting down to almost 0 for few mins since connectivity issues with global endpoint for topology call. This can be fixed by lowering down the request timeout. (We are planning to reduce the timeout of this getDatabaseAccount call to 5 seconds in the current drop of the SDK).

For write requests -> they will retry on the fallback regional endpoint. If there is another write region on the preferred_regions, then writes will further go to those regions after marking the current region unavailable.

For read requests -> they will not be retried on the fallback endpoints and will go to the different region after marking the current region unavailable for further reads.

Regional Endpoint down:

Almost no effect on the throughput on both read requests and write requests, since the global endpoint is used for topology call, and since that's working, one less point of failure.

@tvaron3 tvaron3 requested review from annatisch and a team as code owners January 24, 2025 17:20
@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tvaron3
Copy link
Member Author

tvaron3 commented Feb 4, 2025

/azp run python - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks!

@tvaron3
Copy link
Member Author

tvaron3 commented Feb 5, 2025

/azp run python - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kushagraThapar
Copy link
Member

/check-enforcer override

@kushagraThapar kushagraThapar enabled auto-merge (squash) February 5, 2025 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

8 participants