Skip to content

Azure stability issues in Github V2 API #5790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tonistiigi opened this issue Feb 27, 2025 · 3 comments
Open

Azure stability issues in Github V2 API #5790

tonistiigi opened this issue Feb 27, 2025 · 3 comments
Milestone

Comments

@tonistiigi
Copy link
Member

There has been an increase of errors from Azure URLs via Github cache V2 that now uses Azure library directly.

Examples:

#124 exporting to docker image format
#124 ERROR: failed to copy: GET https://productionresultssa1.blob.core.windows.net/actions-cache/2ce-15325802
--------------------------------------------------------------------------------
RESPONSE 503: 503 The server is busy.
ERROR CODE: ServerBusy
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>ServerBusy</Code><Message>The server is busy.
RequestId:4bd3d70f-601e-006f-4c70-879277000000
Time:2025-02-25T10:30:53.3094510Z</Message></Error>
--------------------------------------------------------------------------------

https://github.com/moby/buildkit/actions/runs/13557868316/job/37895553799

#40 exporting to client directory
#40 ERROR: failed to copy: GET https://productionresultssa1.blob.core.windows.net/actions-cache/358-27208952
--------------------------------------------------------------------------------
RESPONSE 503: 503 The server is busy.
ERROR CODE: ServerBusy
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>ServerBusy</Code><Message>The server is busy.
RequestId:543a61bf-801e-003a-53c2-8882fc000000
Time:2025-02-27T02:54:14.9578046Z</Message></Error>
--------------------------------------------------------------------------------

then we have similar, but 404 😕

#125 preparing build cache for export 75.6s done
#125 writing layer sha256:c83ffb796c379deb0f9b9940c2f63456545c1c228a2300355f3bf2ee3f0e3300 0.1s done
#125 ERROR: error writing layer blob: GET https://productionresultssa7.blob.core.windows.net/actions-cache/861-15325381
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:bbfec0b4-001e-00bf-6172-8735af000000
Time:2025-02-25T10:44:52.3953576Z</Message></Error>
--------------------------------------------------------------------------------

and this “internal” error report from the user without much details #5784

404 is unknown. Seems Github reports Azure URL that doesn't actually exist(or gets lost). Note that BuildKit does handle blobs getting deleted by Github, but in here it seems that Github side of API is reporting blob as existing with Azure URL, but that URL is not functional.

503 though is (surprisingly) mentioned in the docs as "expected".

https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets

When your application reaches the limit of what a partition can handle for your workload, Azure Storage begins to return error code 503 (Server Busy) or error code 500 (Operation Timeout) responses. If 503 errors are occurring, consider modifying your application to use an exponential backoff policy for retries.

@vangarp @amrmahdi Can you take a look and add the backoffs if this isn't something already handled by the library.

@tonistiigi tonistiigi added this to the v0.20.1 milestone Feb 27, 2025
@cpuguy83
Copy link
Member

Looking through the client options, it looks like by default it will retry up to 3 times with a backoff of 4 seconds for each retry (max is 60s by default).
I suppose we could do more retries with a larger backoff?
I'm not sure if this will help since it seems like every client would be hitting the same blob containers.

Maybe we can get someone from github to chime in here.

@tonistiigi
Copy link
Member Author

@cpuguy83 If you look at the logs in https://github.com/moby/buildkit/actions/runs/13557868316/job/37895553799#step:12:1110 then it doesn't look like it retried for 60 sec before giving up. I'm not sure if the retries would get logged, but there isn't 60sec between failure and the previous logs.

@cpuguy83
Copy link
Member

Right, the 60s would be the default max backoff, but the default retries+delay would not hit that.
Looks like there's only 8 seconds in between the last log before the failure, which I would expect to be a bit longer given the default of 3 retries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants