Implement package fetch retries #25120

joshua-holmes · 2025-09-03T00:54:42Z

Problem

Like in #17472, the package manager does not retry fetching when communication to the server fails in a way that could possibly be resolved by retrying the fetch.

Solution

Automatically retry the fetch in these scenarios during http fetches:

Connecting to the server fails
Sending the payload to the server fails
Server returns a non-200 status code

and theses scenarios during git+http fetches:

Discovering remote git server capabilities fails
Creating a fetch stream fails

This does not expose this fetch config to the end user, for reasons you can read about in the next section. Definitely worth the conversation, but getting this PR out there as is doesn't seem like a bad idea to me.

Major Considerations

I thought about how I can expose configuration of the retry behavior to the user (the retry_count and retry_delay_ms). However, this proved difficult, so it is not a part of this PR. Here are some thoughts I had though:
- I cannot expose config in the build.zig file, because (as far as I can tell) the build.zig file is not run until all the packages are fetched and ready to be built with the source code.
- I do not want to include this config for each dep (dependency) because - how do we handle this for deps of deps? We don't want the lib maintainer to determine how many times I fetch the lib's deps. That's the end user's decision. So deps of deps can inherit the config, but then the end user isn't getting the control they are "promised" by having the config attached to each dep.
- I do not want to include the config in the body of the manifest. Feels out of place, since that body isn't supposed to be about fetch config for dependencies, it's supposed to be about defining the current package along with pointing to it's dependent packages.
- Maybe we allow config to be passed in the body of .dependencies, like this (below) so that .retry_count and other props apply to all dependencies, but it's still a part of the .dependencies object. Also, this can easily be parsed by checking the type of the prop. If the type is anytype, it's a real dep, if it's anything else, it's config code, not a real dep. Fun idea, but it's not a part of this PR and shouldn't be. Here is an example of this idea:
```
.dependencies = .{
    .retry_count = 3,
    .retry_delay_ms = 1000,
    .zf = .{
        .url = "https://github.com/natecraddock/zf/archive/7aacbe6d155d64d15937ca95ca6c014905eb531f.tar.gz",
        .hash = "zf-0.10.3-OIRy8aiIAACLrBllz0zjxaH0aOe5oNm3KtEMyCntST-9",
        .lazy = true,
    },
    .vaxis = .{
        .url = "git+https://github.com/rockorager/libvaxis#1f41c121e8fc153d9ce8c6eb64b2bbab68ad7d23",
        .hash = "vaxis-0.1.0-BWNV_FUICQAFZnTCL11TUvnUr1Y0_ZdqtXHhd51d76Rn",
        .lazy = true,
    },
},
```
Would be cool to show the retry progress like this after it fails the first time. But it might be worth extending std.Progress.Node first for that, so this is also not implemented in this PR:
```
Compile Build Script
└─ [1/2] Fetch Packages
   ├─ vaxis (retry 1/2)
   └─ zf
```

Minor Considerations

I intentionally did not attempt to retry during failures that do not seem resolvable by simply refetching, e.g. when receiving the http header because, looking through the possible errors, it seems errors mostly occur because of invalid responses, redirecting, or write failures in that case.
There are some locations where memory is being freed that wasn't being freed before. This is because, in the case of the user possibly configuring Fetch to retry many times, I don't want the allocator to allocate without freeing each time it attempts to refetch the network data, resulting in a memory leak.
The Fetch.initResource() method needs to be refactored. But this PR is big enough, so it made sense to me to merge this, then do a refactor in a future PR.

src/Package/Fetch.zig

ehaas · 2025-09-04T16:12:25Z

It might be useful to look at the defaults of other package managers for ideas - the cargo source has some good info about retries, including links to other resources: https://doc.rust-lang.org/nightly/nightly-rustc/src/cargo/util/network/retry.rs.html

Some notes:

Default retries is 3 (total of 4 attempts)
For HTTP failures, try parsing the "Retry-After" header, which could be a number of seconds or an HTTP date string referring to a date in the future
Otherwise, first delay is a randomized value between 500ms and 1500ms
Subsequent delays are linear backoff 3500ms, 6500ms, etc up to a max of 10 seconds
Classification of spurious (retryable) errors in maybe_spurious:
- Various network-level errors like DNS failure or TCP errors or network timeouts
- HTTP 5xx or HTTP 429

joshua-holmes · 2025-09-05T01:22:48Z

@ehaas Yeah, I also took a look at the uv python package manager and found very similar behavior. I created this PR with the intention of keeping it really simple to start with and adding complexity if requested. Better that than the reverse, and I'm new here so I don't know where the tolerances are.

I like that list of notes you jotted down though, so I'll start there.

joshua-holmes · 2025-09-13T22:14:14Z

Has anyone heard of a need/desire for the Zig HTTP client to have retry logic built into it? If not, I'll continue implementing the retry logic in the Fetch.zig file to avoid abstracting unnecessarily, but if this is a desire, I can implement it at a lower level in std.http.Client.zig.

joshua-holmes · 2025-09-21T02:06:11Z

These are the errors I have decided are spurious:

fn maybe_spurious(err: std.http.Client.RequestError) bool {
    switch (err) {
        // FROM: ConnectTcpError
        // tcp errors
        error.ConnectionTimedOut,
        error.ConnectionResetByPeer,
        error.ConnectionRefused,
        error.NetworkUnreachable,
        error.AddressInUse,
        error.TemporaryNameServerFailure,
        error.SystemResources,
        Allocator.Error.OutOfMemory,

        // FROM: ReceiveHeadError
        // io errors
        error.WriteFailed,
        error.ReadFailed,

        // http errors
        error.HttpRequestTruncated,
        error.HttpConnectionClosing,
        error.HttpChunkTruncated,
        error.HttpChunkInvalid,

        // retrying proxy would be included in this list, since proxy failure returns same errors as normal failures

        // still trying to find where equivalent EOF error might be...
        => return true,
    }
    return false;
}

I mostly used gitoxide (used by cargo) cargo as a reference, plus my own judgement.

Many of the TCP errors were pretty easy to map from gitoxide, and the ones gitoxide chose were spurious made sense to me.

The HTTP errors, on the other hand, did not map well from cargo, which is really curl under the hood. I found several of the errors that curl had implemented don't map well to the way we do things. We just have a different implementation of things, which results in different things that can go wrong. So I derived what I could from cargo + curl, and used my best judgement for the rest.

As the comment in the code above shows, the only thing I need to investigate is how to detect an Zig equivalent of the curl EOF error.

I'm working on this as often as I can, but sometimes that's as infrequently as once per week, so it's hard for me to move fast. Plus I'm learning a lot about low-level networking, which doesn't help. That's all for now

There are some locations where memory is being freed that wasn't being freed before. This is because, in the case of the user possibly configuring `Fetch` to retry many times, I don't want the allocator to allocate without freeing each time it attempts to refetch the network data, resulting in a memory leak.

`ConnectTcpError` is a subset of `ConnectError`, but the 4 additional errors that `ConnectError` comes with: ```zig error{ UnsupportedUriScheme, UriMissingHost, UriHostTooLong, CertificateBundleLoadFailure, } ``` ...are never returned by `Client.connect()`. So we are communicating that this method could return 1 of those 4 types, when in reality, it never can. This commit fixes this communication error.

The two error types look like this, side by side: ```zig pub const RequestError = ConnectTcpError || error{ UnsupportedUriScheme, UriMissingHost, UriHostTooLong, CertificateBundleLoadFailure, }; pub const ConnectError = ConnectTcpError || RequestError; ``` Once evaluated, then end up containing the exact same error types, so it is redundant to have them both. I chose to keep `RequestError` because it is used slightly more and also so the `Client.request()` method gets an error return type that matches the name of the method.

joshua-holmes · 2025-10-05T01:46:22Z

I still need to test it, but jitter is implemented, spurious error checks are implemented, and it does compile. The only thing I need to implement now is reading the "Retry-After" header

joshua-holmes · 2025-10-19T05:07:42Z

I implemented reading and parsing the Retry-After header! Some notes:

I used the MDN Retry-After docs to inform me on how to parse this.
The Retry-After header can return either a positive integer indicating how many seconds to wait, or a timestamp indicating a point in time that the client can try again. I used the MDN Date docs to inform me on how to parse the date. If parsing fails, we skip reading it.
If Retry-After gives us a really long wait time, we cap it at our current max delay, which is 10 sec. Same as Cargo does.

All I need to do now is figure out how to unit test the Fetch.zig changes. It's clear that there isn't a whole lot of testing in that file currently, probably because mocking network calls is tricky. I'll do my best, but if I can't, I'll just mock network calls in a bash script and test locally, even if it's not part of the test suite. Then I should be done and ready for review!

joshua-holmes · 2025-10-25T21:21:23Z

lib/std/http/Client.zig

    port: u16,
    protocol: Protocol,
-) ConnectError!*Connection {
+) ConnectTcpError!*Connection {


ConnectError is the same as RequestError. Both of which are a larger collection of errors than ConnectTcpError is. But this function only returns errors found in ConnectTcpError, so this change will make the possible returned error types more clear.

joshua-holmes · 2025-10-25T21:22:51Z

lib/std/http/Client.zig

    };
 }

-pub const ConnectError = ConnectTcpError || RequestError;


ConnectError is same as RequestError. Is deleting ConnectError the right move? Would it be better to leave it in with a doc comment about how it's being deprecated and what to use instead?

joshua-holmes · 2025-10-26T04:26:04Z

Ok it's ready for review again. Here are the key features, which is very similar to the list of features @ehaas commented earlier.

Features

Default retries is 3 (total of 4 attempts)
- For some HTTP failures (500+, 429, 404), try parsing the "Retry-After" header, which could be a number of seconds or an HTTP date string referring to a date in the future. See my above comment for how I implemented that.
  - Adheres to our 10 sec max, so we will never wait longer than that, even if "Retry-After" tells us to, same as Cargo.
  - If "Retry-After" is set to a point in time in the past, we ignore the header and calculate our delay (with jitter) as normal.
- Otherwise, first delay is a randomized value between 500ms and 1500ms
- Subsequent delays are linear backoff 3000ms, 6000ms, etc up to a max of 10 seconds. Note that Cargo uses 3500 and 6500, etc instead. I just felt this was simpler.
Classification of spurious (retryable) errors can be shown in the code linked here. This is a combination of what Cargo and uv thinks is spurious, plus my common sense. HTTP status codes 500+, 429, and 404 are considered spurious. Others are not.
Extended lib/std/time/epoch.zig, which was necessary to convert date to epoch seconds

Testing

I unit tested quite a bit of the new code, and the tests pass. I also tested this locally with and without network failures, while logging before and after the delays to see that the number matched with real delay. I also tested this with a local endpoint that includes a "Retry-After" header. All works really well.

joshua-holmes marked this pull request as ready for review September 3, 2025 01:06

linusg reviewed Sep 3, 2025

View reviewed changes

src/Package/Fetch.zig Outdated Show resolved Hide resolved

src/Package/Fetch.zig Outdated Show resolved Hide resolved

joshua-holmes force-pushed the master branch from 2d62411 to 58fbf12 Compare September 5, 2025 01:33

joshua-holmes marked this pull request as draft October 4, 2025 21:46

joshua-holmes added 6 commits October 4, 2025 16:43

Use u16 instead of u32 for retry_count

6838006

Always retry when status code is not "ok"

b76fb56

Set default retries to 3 (total of 4 attempts)

4bc9c51

joshua-holmes force-pushed the master branch 2 times, most recently from efc885c to 8b08f0b Compare October 5, 2025 01:16

Use error types considered spurious, implement jitter

4c2de34

joshua-holmes force-pushed the master branch from 8b08f0b to 4c2de34 Compare October 5, 2025 01:44

Extend std.time.epoch to calc seconds from epoch from datetime

b409e41

joshua-holmes force-pushed the master branch 2 times, most recently from 2ad8daa to 7416a9d Compare October 19, 2025 04:59

joshua-holmes force-pushed the master branch 2 times, most recently from 3f749af to 91adcae Compare October 25, 2025 20:29

Implement the use of Retry-After HTTP header to set retry delay

f984f60

joshua-holmes force-pushed the master branch from 91adcae to f984f60 Compare October 25, 2025 20:49

joshua-holmes added 2 commits October 25, 2025 13:51

Add Retry unit tests

6b3f0c4

Remove ConnectError instead of "deprecate" it

533908e

joshua-holmes commented Oct 25, 2025

View reviewed changes

joshua-holmes added 2 commits October 25, 2025 21:20

If "Retry-After" is set to a previous or current time, disregard header

b114de2

Write tests for "Retry-After" header parser

dee31f7

joshua-holmes marked this pull request as ready for review October 26, 2025 04:28

joshua-holmes requested a review from linusg October 26, 2025 04:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement package fetch retries #25120

Implement package fetch retries #25120

joshua-holmes commented Sep 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ehaas commented Sep 4, 2025

Uh oh!

joshua-holmes commented Sep 5, 2025 •

edited

Loading

Uh oh!

joshua-holmes commented Sep 13, 2025

Uh oh!

joshua-holmes commented Sep 21, 2025

Uh oh!

joshua-holmes commented Oct 5, 2025 •

edited

Loading

Uh oh!

joshua-holmes commented Oct 19, 2025

Uh oh!

joshua-holmes Oct 25, 2025

Uh oh!

joshua-holmes Oct 25, 2025

Uh oh!

joshua-holmes commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Implement package fetch retries #25120

Are you sure you want to change the base?

Implement package fetch retries #25120

Conversation

joshua-holmes commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Major Considerations

Minor Considerations

Uh oh!

Uh oh!

Uh oh!

ehaas commented Sep 4, 2025

Uh oh!

joshua-holmes commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshua-holmes commented Sep 13, 2025

Uh oh!

joshua-holmes commented Sep 21, 2025

Uh oh!

joshua-holmes commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshua-holmes commented Oct 19, 2025

Uh oh!

joshua-holmes Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

joshua-holmes Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

joshua-holmes commented Oct 26, 2025

Features

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshua-holmes commented Sep 3, 2025 •

edited

Loading

joshua-holmes commented Sep 5, 2025 •

edited

Loading

joshua-holmes commented Oct 5, 2025 •

edited

Loading