Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(transport): auto-tune stream receive window #1868

Open
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

mxinden
Copy link
Collaborator

@mxinden mxinden commented May 2, 2024

Previously the stream send and receive window had a hard limit at 1MB. On high latency and/or high bandwidth connections (i.e. large bandwidth-delay product), 1 MB is not enough to exhaust the available bandwidth.

Sample scenario:

delay_s = 0.05
window_bits = 1 * 1024 * 1024 * 8
bandwidth_bits_s = window_bits / delay_s
bandwidth_mbits_s = bandwidth_bits_s / 1024 / 1024 # 160.0

In other words, on a 50 ms connection a 1 MB window can at most achieve 160 Mbit/s.

This commit introduces an auto-tuning algorithm for the stream receive window, increasing the window towards the bandwidth-delay product of the connection.


Fixes #733.

mxinden added 2 commits April 25, 2024 16:43
This commit adds a basic smoke test using the `test-ficture` simulator,
asserting that on a connection with unlimited bandwidth and 50ms round-trip-time
Neqo can eventually achieve > 1 Gbit/s throughput.

Showcases the potential a future stream flow-control auto-tuning algorithm can have.

See mozilla#733.
Previously the stream send and receive window had a hard limit at 1MB. On high
latency and/or high bandwidth connections, 1 MB is not enough to exhaust the
available bandwidth.

Sample scenario:

```
delay_s = 0.05
window_bits = 1 * 1024 * 1024 * 8
bandwidth_bits_s = window_bits / delay_s
bandwidth_mbits_s = bandwidth_bits_s / 1024 / 1024 # 160.0
```

In other words, on a 50 ms connection a 1 MB window can at most achieve 160
Mbit/s.

This commit introduces an auto-tuning algorithm for the stream receive window,
increasing the window towards the bandwidth-delay product of the connection.
Copy link

github-actions bot commented May 7, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 922d266.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Copy link

github-actions bot commented May 7, 2024

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

mxinden added 17 commits May 14, 2024 15:49
This commit adds a basic smoke test using the `test-fixture` simulator,
asserting the expected bandwidth on a 1 gbit link.

Given mozilla#733, the current expected bandwidth
is limited by the fixed sized stream receive buffer (1MiB).
A `Node` (e.g. a `Client`, `Server` or `TailDrop` router) can be in 3 states:

``` rust
enum NodeState {
    /// The node just produced a datagram.  It should be activated again as soon as possible.
    Active,
    /// The node is waiting.
    Waiting(Instant),
    /// The node became idle.
    Idle,
}
```

`NodeHolder::ready()` determines whether a `Node` is ready to be processed
again. When `NodeState::Waiting`, it should only be ready when `t <= now`, i.e.
the waiting time has passed, not `t >= now`.

``` rust
impl NodeHolder {
    fn ready(&self, now: Instant) -> bool {
        match self.state {
            Active => true,
            Waiting(t) => t <= now, // not >=
            Idle => false,
        }
    }
}
```

The previous behavior lead to wastefull non-ready `Node`s being processed and
thus a large test runtime when e.g. simulating a gbit
connection (mozilla#2203).
Copy link

codecov bot commented Dec 29, 2024

Codecov Report

Attention: Patch coverage is 93.65079% with 24 lines in your changes missing coverage. Please review.

Project coverage is 93.34%. Comparing base (922d266) to head (4d70b34).

Files with missing lines Patch % Lines
neqo-transport/src/fc.rs 87.76% 15 Missing and 8 partials ⚠️
neqo-transport/src/connection/mod.rs 94.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #1868    +/-   ##
========================================
  Coverage   93.33%   93.34%            
========================================
  Files         114      114            
  Lines       36896    37177   +281     
  Branches    36896    37177   +281     
========================================
+ Hits        34438    34703   +265     
- Misses       1675     1680     +5     
- Partials      783      794    +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mxinden mxinden changed the title feat: auto-tune stream receive window perf(transport): auto-tune stream receive window Dec 31, 2024
pub const SEND_BUFFER_SIZE: usize = 0x10_0000; // 1 MiB
const MAX_SEND_BUFFER_SIZE: usize = 10 * 1024 * 1024;
Copy link
Collaborator Author

@mxinden mxinden Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously Neqo would buffer at most 1 MB of send data. Now Neqo buffers up to 10 MB. In other words, it supports an up to 10 MB large send window, depending on the receive window updates of the receiver.

Thus, while this pull request is focused on increasing receive (download) throughput, this patch might as well have an impact on send (upload) throughput on high bandwidth-delay product connections.

Concrete const value up for discussion. On a 50 ms connection a 10 MB window can achieve 1.6 Gbit/s.

@@ -494,10 +494,10 @@ impl TxBuffer {

/// Attempt to add some or all of the passed-in buffer to the `TxBuffer`.
pub fn send(&mut self, buf: &[u8]) -> usize {
let can_buffer = min(SEND_BUFFER_SIZE - self.buffered(), buf.len());
let can_buffer = min(MAX_SEND_BUFFER_SIZE - self.buffered(), buf.len());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that while we increase the send buffer up to MAX_SEND_BUFFER_SIZE, it is never shrunk. My rational:

  • The majority of streams are short lived. In other words, even if a stream reaches a send buffer of MAX_SEND_BUFFER_SIZE, the buffer is soon de-allocated.
  • For long lived buffers reaching MAX_SEND_BUFFER_SIZE, my assumption is, that MAX_SEND_BUFFER_SIZE is chosen conservative enough, that the additional allocation doesn't hurt.
  • Intuitively any shrinking heuristic likely leads to more memory churn, rather than decreasing resident memory.

Thoughts?

Comment on lines +382 to +409
// Auto-tune max_active, i.e. the flow control window.
//
// If the sending rate ( window_bytes used / elapsed ) exceeds the rate
// allowed by the maximum flow control window and the current rtt (
// max_active / rtt ), try to increase the maximum flow control window (
// max_active ).
if let Some(max_allowed_sent_at) = self.max_allowed_sent_at {
let elapsed = now.duration_since(max_allowed_sent_at);
let window_bytes_used = self.max_active - (self.max_allowed - self.retired);

// Same as `elapsed / rtt < window_bytes_used / max_active`
// without floating point division.
if elapsed.as_micros() * u128::from(self.max_active)
< rtt.as_micros() * u128::from(window_bytes_used)
{
let prev_max_active = self.max_active;
// Try doubling the flow control window.
//
// Note that the flow control window should grow at least as
// fast as the congestion control window, in order to not
// unnecessarily limit throughput.
self.max_active = min(2 * self.max_active, MAX_RECV_WINDOW_SIZE);
qdebug!(
"Increasing max stream receive window: previous max_active: {} MiB new max_active: {} MiB last update: {:?} rtt: {rtt:?} stream_id: {}",
prev_max_active / 1024 / 1024, self.max_active / 1024 / 1024, now-self.max_allowed_sent_at.unwrap(), self.subject,
);
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-tuning is executed right before sending a window update.

A window update is sent either:

  1. When WINDOW_UPDATE_FRACTION is reached, see fn should_send_flowc_update above.
  2. The remote sends a STREAM_DATA_BLOCKED.

Comment on lines +382 to +409
// Auto-tune max_active, i.e. the flow control window.
//
// If the sending rate ( window_bytes used / elapsed ) exceeds the rate
// allowed by the maximum flow control window and the current rtt (
// max_active / rtt ), try to increase the maximum flow control window (
// max_active ).
if let Some(max_allowed_sent_at) = self.max_allowed_sent_at {
let elapsed = now.duration_since(max_allowed_sent_at);
let window_bytes_used = self.max_active - (self.max_allowed - self.retired);

// Same as `elapsed / rtt < window_bytes_used / max_active`
// without floating point division.
if elapsed.as_micros() * u128::from(self.max_active)
< rtt.as_micros() * u128::from(window_bytes_used)
{
let prev_max_active = self.max_active;
// Try doubling the flow control window.
//
// Note that the flow control window should grow at least as
// fast as the congestion control window, in order to not
// unnecessarily limit throughput.
self.max_active = min(2 * self.max_active, MAX_RECV_WINDOW_SIZE);
qdebug!(
"Increasing max stream receive window: previous max_active: {} MiB new max_active: {} MiB last update: {:?} rtt: {rtt:?} stream_id: {}",
prev_max_active / 1024 / 1024, self.max_active / 1024 / 1024, now-self.max_allowed_sent_at.unwrap(), self.subject,
);
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not the exact algorithm suggested by @martinthomson in #733 (comment).

The algorithm proposed in this pull request adopts Martin's trigger mechanism, namely to increase the window based on the perceived BDP.

Therefore, I suggest that if the rate at which self.retired increases (that is, the change in that value, divided by the time elapsed) exceeds some function of self.max_active / path.rtt,

It does not adopt the increase mechanism, i.e. to increase by the amount of retired data. Instead, the window is simply doubled.

then we can increase self.max_active by the amount that self.retired has increased.

The rational is documented above.

                // Try doubling the flow control window.
                //
                // Note that the flow control window should grow at least as
                // fast as the congestion control window, in order to not
                // unnecessarily limit throughput.

@mxinden mxinden marked this pull request as ready for review January 4, 2025 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better algorithm for stream flow control
1 participant