Skip to content

Conversation

@joshua-holmes
Copy link

@joshua-holmes joshua-holmes commented Sep 3, 2025

Problem

Like in #17472, the package manager does not retry fetching when communication to the server fails in a way that could possibly be resolved by retrying the fetch.

Solution

Automatically retry the fetch in these scenarios during http fetches:

  1. Connecting to the server fails
  2. Sending the payload to the server fails
  3. Server returns a non-200 status code

and theses scenarios during git+http fetches:

  1. Discovering remote git server capabilities fails
  2. Creating a fetch stream fails

This does not expose this fetch config to the end user, for reasons you can read about in the next section. Definitely worth the conversation, but getting this PR out there as is doesn't seem like a bad idea to me.

Major Considerations

  • I thought about how I can expose configuration of the retry behavior to the user (the retry_count and retry_delay_ms). However, this proved difficult, so it is not a part of this PR. Here are some thoughts I had though:
    • I cannot expose config in the build.zig file, because (as far as I can tell) the build.zig file is not run until all the packages are fetched and ready to be built with the source code.
    • I do not want to include this config for each dep (dependency) because - how do we handle this for deps of deps? We don't want the lib maintainer to determine how many times I fetch the lib's deps. That's the end user's decision. So deps of deps can inherit the config, but then the end user isn't getting the control they are "promised" by having the config attached to each dep.
    • I do not want to include the config in the body of the manifest. Feels out of place, since that body isn't supposed to be about fetch config for dependencies, it's supposed to be about defining the current package along with pointing to it's dependent packages.
    • Maybe we allow config to be passed in the body of .dependencies, like this (below) so that .retry_count and other props apply to all dependencies, but it's still a part of the .dependencies object. Also, this can easily be parsed by checking the type of the prop. If the type is anytype, it's a real dep, if it's anything else, it's config code, not a real dep. Fun idea, but it's not a part of this PR and shouldn't be. Here is an example of this idea:
      .dependencies = .{
          .retry_count = 3,
          .retry_delay_ms = 1000,
          .zf = .{
              .url = "https://github.com/natecraddock/zf/archive/7aacbe6d155d64d15937ca95ca6c014905eb531f.tar.gz",
              .hash = "zf-0.10.3-OIRy8aiIAACLrBllz0zjxaH0aOe5oNm3KtEMyCntST-9",
              .lazy = true,
          },
          .vaxis = .{
              .url = "git+https://github.com/rockorager/libvaxis#1f41c121e8fc153d9ce8c6eb64b2bbab68ad7d23",
              .hash = "vaxis-0.1.0-BWNV_FUICQAFZnTCL11TUvnUr1Y0_ZdqtXHhd51d76Rn",
              .lazy = true,
          },
      },
  • Would be cool to show the retry progress like this after it fails the first time. But it might be worth extending std.Progress.Node first for that, so this is also not implemented in this PR:
    Compile Build Script
    └─ [1/2] Fetch Packages
       ├─ vaxis (retry 1/2)
       └─ zf
    

Minor Considerations

  • I intentionally did not attempt to retry during failures that do not seem resolvable by simply refetching, e.g. when receiving the http header because, looking through the possible errors, it seems errors mostly occur because of invalid responses, redirecting, or write failures in that case.
  • There are some locations where memory is being freed that wasn't being freed before. This is because, in the case of the user possibly configuring Fetch to retry many times, I don't want the allocator to allocate without freeing each time it attempts to refetch the network data, resulting in a memory leak.
  • The Fetch.initResource() method needs to be refactored. But this PR is big enough, so it made sense to me to merge this, then do a refactor in a future PR.

@joshua-holmes joshua-holmes marked this pull request as ready for review September 3, 2025 01:06
@ehaas
Copy link
Contributor

ehaas commented Sep 4, 2025

It might be useful to look at the defaults of other package managers for ideas - the cargo source has some good info about retries, including links to other resources: https://doc.rust-lang.org/nightly/nightly-rustc/src/cargo/util/network/retry.rs.html

Some notes:

  • Default retries is 3 (total of 4 attempts)
  • For HTTP failures, try parsing the "Retry-After" header, which could be a number of seconds or an HTTP date string referring to a date in the future
  • Otherwise, first delay is a randomized value between 500ms and 1500ms
  • Subsequent delays are linear backoff 3500ms, 6500ms, etc up to a max of 10 seconds
  • Classification of spurious (retryable) errors in maybe_spurious:
    • Various network-level errors like DNS failure or TCP errors or network timeouts
    • HTTP 5xx or HTTP 429

@joshua-holmes
Copy link
Author

joshua-holmes commented Sep 5, 2025

@ehaas Yeah, I also took a look at the uv python package manager and found very similar behavior. I created this PR with the intention of keeping it really simple to start with and adding complexity if requested. Better that than the reverse, and I'm new here so I don't know where the tolerances are.

I like that list of notes you jotted down though, so I'll start there.

@joshua-holmes
Copy link
Author

Has anyone heard of a need/desire for the Zig HTTP client to have retry logic built into it? If not, I'll continue implementing the retry logic in the Fetch.zig file to avoid abstracting unnecessarily, but if this is a desire, I can implement it at a lower level in std.http.Client.zig.

@joshua-holmes
Copy link
Author

These are the errors I have decided are spurious:

fn maybe_spurious(err: std.http.Client.RequestError) bool {
    switch (err) {
        // FROM: ConnectTcpError
        // tcp errors
        error.ConnectionTimedOut,
        error.ConnectionResetByPeer,
        error.ConnectionRefused,
        error.NetworkUnreachable,
        error.AddressInUse,
        error.TemporaryNameServerFailure,
        error.SystemResources,
        Allocator.Error.OutOfMemory,

        // FROM: ReceiveHeadError
        // io errors
        error.WriteFailed,
        error.ReadFailed,

        // http errors
        error.HttpRequestTruncated,
        error.HttpConnectionClosing,
        error.HttpChunkTruncated,
        error.HttpChunkInvalid,

        // retrying proxy would be included in this list, since proxy failure returns same errors as normal failures

        // still trying to find where equivalent EOF error might be...
        => return true,
    }
    return false;
}

I mostly used gitoxide (used by cargo) cargo as a reference, plus my own judgement.

Many of the TCP errors were pretty easy to map from gitoxide, and the ones gitoxide chose were spurious made sense to me.

The HTTP errors, on the other hand, did not map well from cargo, which is really curl under the hood. I found several of the errors that curl had implemented don't map well to the way we do things. We just have a different implementation of things, which results in different things that can go wrong. So I derived what I could from cargo + curl, and used my best judgement for the rest.

As the comment in the code above shows, the only thing I need to investigate is how to detect an Zig equivalent of the curl EOF error.

I'm working on this as often as I can, but sometimes that's as infrequently as once per week, so it's hard for me to move fast. Plus I'm learning a lot about low-level networking, which doesn't help. That's all for now

@joshua-holmes joshua-holmes marked this pull request as draft October 4, 2025 21:46
There are some locations where memory is being freed that wasn't being
freed before. This is because, in the case of the user possibly
configuring `Fetch` to retry many times, I don't want the allocator to
allocate without freeing each time it attempts to refetch the network
data, resulting in a memory leak.
`ConnectTcpError` is a subset of `ConnectError`, but the 4 additional
errors that `ConnectError` comes with:
```zig
error{
    UnsupportedUriScheme,
    UriMissingHost,
    UriHostTooLong,
    CertificateBundleLoadFailure,
}
```
...are never returned by `Client.connect()`. So we are communicating
that this method could return 1 of those 4 types, when in reality, it
never can. This commit fixes this communication error.
The two error types look like this, side by side:
```zig
pub const RequestError = ConnectTcpError || error{
    UnsupportedUriScheme,
    UriMissingHost,
    UriHostTooLong,
    CertificateBundleLoadFailure,
};
pub const ConnectError = ConnectTcpError || RequestError;
```

Once evaluated, then end up containing the exact same error types, so it
is redundant to have them both. I chose to keep `RequestError` because
it is used slightly more and also so the `Client.request()` method gets
an error return type that matches the name of the method.
@joshua-holmes joshua-holmes force-pushed the master branch 2 times, most recently from efc885c to 8b08f0b Compare October 5, 2025 01:16
@joshua-holmes
Copy link
Author

joshua-holmes commented Oct 5, 2025

I still need to test it, but jitter is implemented, spurious error checks are implemented, and it does compile. The only thing I need to implement now is reading the "Retry-After" header

@joshua-holmes joshua-holmes force-pushed the master branch 2 times, most recently from 2ad8daa to 7416a9d Compare October 19, 2025 04:59
@joshua-holmes
Copy link
Author

I implemented reading and parsing the Retry-After header! Some notes:

  • I used the MDN Retry-After docs to inform me on how to parse this.
  • The Retry-After header can return either a positive integer indicating how many seconds to wait, or a timestamp indicating a point in time that the client can try again. I used the MDN Date docs to inform me on how to parse the date. If parsing fails, we skip reading it.
  • If Retry-After gives us a really long wait time, we cap it at our current max delay, which is 10 sec. Same as Cargo does.

All I need to do now is figure out how to unit test the Fetch.zig changes. It's clear that there isn't a whole lot of testing in that file currently, probably because mocking network calls is tricky. I'll do my best, but if I can't, I'll just mock network calls in a bash script and test locally, even if it's not part of the test suite. Then I should be done and ready for review!

@joshua-holmes joshua-holmes force-pushed the master branch 2 times, most recently from 3f749af to 91adcae Compare October 25, 2025 20:29
port: u16,
protocol: Protocol,
) ConnectError!*Connection {
) ConnectTcpError!*Connection {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectError is the same as RequestError. Both of which are a larger collection of errors than ConnectTcpError is. But this function only returns errors found in ConnectTcpError, so this change will make the possible returned error types more clear.

};
}

pub const ConnectError = ConnectTcpError || RequestError;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectError is same as RequestError. Is deleting ConnectError the right move? Would it be better to leave it in with a doc comment about how it's being deprecated and what to use instead?

@joshua-holmes
Copy link
Author

Ok it's ready for review again. Here are the key features, which is very similar to the list of features @ehaas commented earlier.

Features

  • Default retries is 3 (total of 4 attempts)
    • For some HTTP failures (500+, 429, 404), try parsing the "Retry-After" header, which could be a number of seconds or an HTTP date string referring to a date in the future. See my above comment for how I implemented that.
      • Adheres to our 10 sec max, so we will never wait longer than that, even if "Retry-After" tells us to, same as Cargo.
      • If "Retry-After" is set to a point in time in the past, we ignore the header and calculate our delay (with jitter) as normal.
    • Otherwise, first delay is a randomized value between 500ms and 1500ms
    • Subsequent delays are linear backoff 3000ms, 6000ms, etc up to a max of 10 seconds. Note that Cargo uses 3500 and 6500, etc instead. I just felt this was simpler.
  • Classification of spurious (retryable) errors can be shown in the code linked here. This is a combination of what Cargo and uv thinks is spurious, plus my common sense. HTTP status codes 500+, 429, and 404 are considered spurious. Others are not.
  • Extended lib/std/time/epoch.zig, which was necessary to convert date to epoch seconds

Testing

I unit tested quite a bit of the new code, and the tests pass. I also tested this locally with and without network failures, while logging before and after the delays to see that the number matched with real delay. I also tested this with a local endpoint that includes a "Retry-After" header. All works really well.

@joshua-holmes joshua-holmes marked this pull request as ready for review October 26, 2025 04:28
@joshua-holmes joshua-holmes requested a review from linusg October 26, 2025 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants