Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GREASE incompatible with Cloudflare #206

Open
daxpedda opened this issue Jul 14, 2023 · 16 comments
Open

GREASE incompatible with Cloudflare #206

daxpedda opened this issue Jul 14, 2023 · 16 comments
Labels
needs-reproduction Problem cannot be reproduced locally with client and server

Comments

@daxpedda
Copy link
Contributor

I noticed in https://github.com/bluejekyll/trust-dns/pull/1987 that Cloudflare's HTTP/3 implementation seems to be incompatible (or there might be a bug?) with h3.

I made a "minimal" reproducible example here: https://github.com/daxpedda/doh3-test.
But to summarize:
This sends a DoH3 query over POST for www.example.com to Cloudflare's 1.1.1.1 DNS server, which responds with 400 Bad Request and a body with Invalid query..

Couple of observations I already made:

  • Normal HTTP/3 connections to Cloudflare in general work, e.g. I tried connecting to Cloudflare Workers over HTTP/3 and it worked with GREASE.
  • cURL doesn't use GREASE which might be why it worked.
  • Comparing cURL and h3 on Wireshark, they look quite different, so it might not be an issue with GREASE, but some HTTP/3 incompatibility that is triggered by GREASE.
  • This is unlikely to be a DNS issue, other DNS servers with DoH3 support, e.g. Google, don't have an issue here.

I didn't dig deeper into this as I'm not familiar with the HTTP/3 protocol and this I "easily" solved for me by just deactivating the GREASE.
I'm happy to dig deeper into it with some guidance.

Related #205.

@seanmonstar
Copy link
Member

The goal of grease is to prevent ossification of mechanisms that should allow for protocol extension. Cloudflare's implementation does try to support all kinds (that's where I learned of the kinds of grease from in the first place). It could be:

  • We are sending grease wrong.
  • We are doing something else wrong when grease is enabled.
  • (Unlikely, but I guess possible) We found a case where Cloudflare's quiche doesn't handle grease right.

@Ruben2424
Copy link
Contributor

I tried it out and got the same error as you.

In general there are 3 of grease types. One for settings-id, frame type and stream type.

My guess was that the peer received a wrong body, because we got a http response (which says "400 Invlid query") an not a connection closure with a http3 error.
The grease frame type will be sent direct after the data frame with the body. So my thought was that this specific grease type is probably causing the problem.
When I commented out the sending the grease frame type it worked (most of the time).
https://github.com/hyperium/h3/blob/dccb3cdae9d5a9d720fae5f774b53f0bd8a16019/h3/src/connection.rs#L845C15-L847C55

I also cloned and modified the quiche http3 example server a bit to log the received body. When sending the post from your "minimal" reproducible example" i received the same body with and without grease.

Also important to say, i got the error one time when sending no grease frame (with the setting to true but without this https://github.com/hyperium/h3/blob/dccb3cdae9d5a9d720fae5f774b53f0bd8a16019/h3/src/connection.rs#L845C15-L847C55 )

All of this and the fact that the google server worked lets me guess that there is a race condition which get maybe triggered by the way the cloudflare server responds. And which is more likely to happen when sending the grease frame.

But these are just guesses. If I have the time i will look at it tomorrow again to see if i can find out more.

@highlyrecommand
Copy link

highlyrecommand commented Jul 26, 2024

Description

  • This is a very serious issue and it must be fixed. I don't know the underlying cause of this problem.
  • I am building a Rust project for communicating with an HTTP API, using the hyperium/h3 crates.
  • However, when testing communication with the Cloudflare Workers API, even sending a single simple HTTP GET request returns an error HTTP/3.0 500 Internal Server Error, Worker threw exception.
  • The error message observed in Workers Logging is Exception GET, Request with a GET or HEAD method cannot have a body.
  • Initially, I thought it was an issue with Content-Length: 0, but after two days of continuously patching and debugging the code (Content-Length, send_request, StaticTable, ...), I still couldn't solve the problem.
  • Until I found the GREASE incompatible with Cloudflare #206 issue. After commenting out the if self.send_grease_frame { ... } block in h3/h3/src/connection.rs, or send_grease(false) the issue was resolved.
  • This issue is triggered even when using h3/examples/client.rs.
  • Thanks @daxpedda, @Ruben2424, @seanmonstar, ..., I hope this issue will eventually be resolved.

Patch h3/h3/src/connection.rs

    #[allow(missing_docs)]
    #[cfg_attr(feature = "tracing", instrument(skip_all, level = "trace"))]
    pub async fn finish(&mut self) -> Result<(), Error> {
        // ===== Problem =====
        if self.send_grease_frame {
            // send a grease frame once per Connection
            stream::write(&mut self.stream, Frame::Grease)
                .await
                .map_err(|e| self.maybe_conn_err(e))?;
            self.send_grease_frame = false;
        }
        // ===== Problem =====

        future::poll_fn(|cx| self.stream.poll_finish(cx))
            .await
            .map_err(|e| self.maybe_conn_err(e))
    }
}

Or send_grease(false)

    let (mut driver, mut send_request) : (h3::client::Connection<h3_quinn::Connection, bytes::Bytes>, h3::client::SendRequest<h3_quinn::OpenStreams, bytes::Bytes>) = h3::client::builder()
      .send_grease(false)
      .build(quinn_conn).await?;

Client tracing h3/examples/client.rs

2024-07-26T15:14:20.171764Z  INFO project: QUIC connection established
2024-07-26T15:14:20.172047Z  INFO project: sending request ...
2024-07-26T15:14:20.172294Z  INFO project: receiving response ...
2024-07-26T15:14:20.373937Z  INFO project: response: HTTP/3.0 500 Internal Server Error
2024-07-26T15:14:20.373971Z  INFO project: headers: {
    "date": "Fri, 26 Jul 2024 15:14:20 GMT",
    "content-type": "text/html; charset=UTF-8",
    "content-length": "4320",
    "report-to": "{\"endpoints\":[{\"url\":\"xxxxx"}],\"group\":\"cf-nel\",\"max_age\":604800}",
    "nel": "{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}",
    "x-frame-options": "SAMEORIGIN",
    "referrer-policy": "same-origin",
    "cache-control": "private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0",
    "expires": "Thu, 01 Jan 1970 00:00:01 GMT",
    "server": "cloudflare",
    "cf-ray": "xxxxxxxx-LHR",
    "alt-svc": "h3=\":443\"; ma=86400",
}
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Worker threw exception | xx.xx.xx | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" /><![endif]-->
<style>body{margin:0;padding:0}</style>


<!--[if gte IE 10]><!-->
<script>
  if (!navigator.cookieEnabled) {
    window.addEventListener('DOMContentLoaded', function () {
      var cookieEl = document.getElementById('cookie-alert');
      cookieEl.style.display = 'block';
    })
  }
</script>
<!--<![endif]-->


</head>
<body>
  <div id="cf-wrapper">
    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
    <div id="cf-error-details" class="cf-error-details-wrapper">
      <div class="cf-wrapper cf-header cf-error-overview">
        <h1>
          <span class="cf-error-type" data-translate="error">Error</span>
          <span class="cf-error-code">1101</span>
          <small class="heading-ray-id">Ray ID: xxxxxxx &bull; 2024-07-26 15:14:20 UTC</small>
        </h1>
        <h2 class="cf-subheadline" data-translate="error_desc">Worker threw exception</h2>
      </div><!-- /.header -->

      <section></section><!-- spacer -->

      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="what_happened">What happened?</h2>
            <p>You've requested a page on a website (xxxxxxx) that is on the <a href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a> network. An unknown error occurred while rendering the page.</p>
          </div>


          <div class="cf-column">
            <h2 data-translate="what_can_i_do">What can I do?</h2>
            <p><strong>If you are the owner of this website:</strong><br />you should <a href="https://www.cloudflare.com/login?utm_source=error_100x" target="_blank">login to Cloudflare</a> and check the error logs for xxxxxx.</p>
          </div>

        </div>
      </div><!-- /.section -->

      <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
  <p class="text-13">
    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">xxxxxx</strong></span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span id="cf-footer-item-ip" class="cf-footer-item hidden sm:block sm:mb-1">
      Your IP:
      <button type="button" id="cf-footer-ip-reveal" class="cf-footer-ip-reveal-btn">Click to reveal</button>
      <span class="hidden" id="cf-footer-ip">x.x.x.x</span>
      <span class="cf-footer-separator sm:hidden">&bull;</span>
    </span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" target="_blank">Cloudflare</a></span>

  </p>
  <script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>
</div><!-- /.error-footer -->


    </div><!-- /#cf-error-details -->
  </div><!-- /#cf-wrapper -->

  <script>
  window._cf_translation = {};


</script>

</body>
</html>

Cloudflare Workers Logging

{
  "truncated": false,
  "outcome": "exception",
  "scriptVersion": {
    "id": "1dabed6e-929f-4a61-9bd5-2b3f1b8ae14c"
  },
  "scriptName": "xxx",
  "diagnosticsChannelEvents": [],
  "exceptions": [
    {
      "stack": "    at worker.js:7:29",
      "name": "TypeError",
      "message": "Request with a GET or HEAD method cannot have a body.",
      "timestamp": 1722006860556
    }
  ],
  "logs": [],
  "eventTimestamp": 1722006860556,
  "event": {
    "request": {
      "url": "https://xx.xx/",
      "method": "GET",
      "headers": {
        "accept-encoding": "gzip, br",
        "cf-connecting-ip": "xx.xx.xx.xx",
        "cf-ipcountry": "xx",
        "cf-ray": "xxx",
        "cf-visitor": "{\"scheme\":\"https\"}",
        "connection": "Keep-Alive",
        "content-length": "0",
        "host": "xx.xx.xx",
        "x-forwarded-proto": "https",
        "x-real-ip": "x.x.x.x"
      },
      "cf": {
        "longitude": "xxx",
        "latitude": "xx",
        "tlsCipher": "AEAD-AES256-GCM-SHA384",
        "continent": "xx",
        "asn": 0,
        "country": "xx",
        "tlsClientAuth": {
          "certIssuerDNLegacy": "",
          "certIssuerSKI": "",
          "certSubjectDNRFC2253": "",
          "certSubjectDNLegacy": "",
          "certFingerprintSHA256": "",
          "certNotBefore": "",
          "certSKI": "",
          "certSerial": "",
          "certIssuerDN": "",
          "certVerified": "NONE",
          "certNotAfter": "",
          "certSubjectDN": "",
          "certPresented": "0",
          "certRevoked": "0",
          "certIssuerSerial": "",
          "certIssuerDNRFC2253": "",
          "certFingerprintSHA1": ""
        },
        "verifiedBotCategory": "",
        "tlsExportedAuthenticator": {
          "clientFinished": "d1eeefbefffef6e8e5c7962bf697840a4ef91ede655a9f2993850797e336dd8889ba64d0a1059fcb6b39b27a1369ded3",
          "clientHandshake": "79d5b944d92b336f99c60d4c4c39495df8e5d1e941c66ed774f7009386d38b9e1bb8f72d1ef400a55268a0f66271046b",
          "serverHandshake": "3b4203f77ad3d8e0b9d744d5c30ef4103ccb0d1327709bde64258e18440556a95004621d51d314732323ff5fb0b98411",
          "serverFinished": "91de481f75125d683fbcdb68deb620fcbadd01658ccb9a997cdd60f345759e5fc1727929192084a24e7614a7e357c3e3"
        },
        "tlsVersion": "TLSv1.3",
        "colo": "LHR",
        "timezone": "xxx",
        "tlsClientHelloLength": "287",
        "edgeRequestKeepAliveStatus": 1,
        "requestPriority": "",
        "tlsClientExtensionsSha1": "QmsjtEGp7DwafxfPx22mGvomqw0=",
        "region": "xxx",
        "city": "xxx",
        "regionCode": "xx",
        "asOrganization": "xxx",
        "tlsClientRandom": "kg6ZcIEjyezNbzKnorcoEBXZRL0XuJZ+hBD2cjUKYfo=",
        "httpProtocol": "HTTP/3"
      }
    },
    "response": {
      "status": 500
    }
  },
  "id": 0
}

Cargo.toml

[package]
name = "project"
version = "0.1.0"
edition = "2021"

[patch.crates-io]
h3 = { path = "./patchs/h3/h3" }
h3-quinn = { path = "./patchs/h3/h3-quinn" }
quinn = { path = "./patchs/quinn/quinn" }
quinn-proto = { path = "./patchs/quinn/quinn-proto" }
rustls = { path = "./patchs/rustls/rustls"}

[dependencies]
bytes = { version = "1.6.0" }
futures = "0.3.30"
h3 = { version = "0.0.6", features = ["tracing"] }
h3-quinn = { version = "0.0.7", features = ["tracing"] }
http = { version = "1.1.0" }
quinn = { version = "0.11.2", default-features = false, features = ["runtime-tokio", "rustls", "ring"] }
quinn-proto = "0.11.3"
rustls = { version = "0.23.10", default-features = false, features = ["logging", "ring", "std"] }
rustls-native-certs = "0.7.1"
structopt = { version = "0.3.26" }
tokio = { version = "1.38.0", features = ["full"] }
tracing = { version = "0.1.40" }
tracing-subscriber = { version = "0.3.18", features = ["fmt", "ansi", "env-filter", "time", "tracing-log"] }

@highlyrecommand
Copy link

highlyrecommand commented Jul 27, 2024

Update

@Ruben2424 In some complex use cases, failures sometimes occur even if send_grease is false (HTTP/3.0 500 Internal Server Error, Worker threw exception, Exception GET, Request with a GET or HEAD method cannot have a body.)

Good times and bad (Sometimes it works, sometimes it fails), maybe there are data races issues.
Under research... 🧐

Complex cases

use std::{net::{IpAddr, Ipv4Addr, SocketAddr}, path::PathBuf, sync::Arc, time::Duration};
use bytes::{Buf, Bytes, BytesMut};
use http::{header::HOST, uri::PathAndQuery, HeaderValue, Request, StatusCode, Uri};
use rustls::pki_types::{CertificateDer, PrivateKeyDer};
use structopt::StructOpt;
use tokio::{fs::File, io::AsyncReadExt, sync::Mutex, time::sleep};
use tracing::{error, info, trace_span, warn};
use h3::{error::ErrorLevel, quic::BidiStream, server::RequestStream};
use h3_quinn::quinn::{self, crypto::rustls::QuicServerConfig};

#[derive(StructOpt, Debug)]
#[structopt(name = "server")]
struct Opt {
  #[structopt(
    name = "dir",
    short,
    long,
    help = "Root directory of the files to serve. \
        If omitted, server will respond OK."
  )]
  pub root: Option<PathBuf>,

  #[structopt(
    short,
    long,
    default_value = "[::1]:4433",
    help = "What address:port to listen for new connections"
  )]
  pub listen: SocketAddr,

  #[structopt(flatten)]
  pub certs: Certs,

  #[structopt(
    long,
    default_value = "https://example.com",
    help = "Uri"
  )]
  pub uri: String,
}

#[derive(StructOpt, Debug)]
pub struct Certs {
  #[structopt(
    long,
    short,
    default_value = "examples/server.cert",
    help = "Certificate for TLS. If present, `--key` is mandatory."
  )]
  pub cert: PathBuf,

  #[structopt(
    long,
    short,
    default_value = "examples/server.key",
    help = "Private key for the certificate."
  )]
  pub key: PathBuf,
}

static ALPN: &[u8] = b"h3";


async fn create_connection(uri: &str) -> Result<
  (
    h3::client::Connection<h3_quinn::Connection, bytes::Bytes>,
    h3::client::SendRequest<h3_quinn::OpenStreams, bytes::Bytes>
  ),
  Box<dyn std::error::Error>>
{
  let base_uri = uri.parse::<http::Uri>()?;

  if base_uri.scheme() != Some(&http::uri::Scheme::HTTPS) {
    return Err("uri scheme must be 'https'".into());
  }

  let auth = base_uri.authority().ok_or("uri must have a host")?.clone();
  let port = auth.port_u16().unwrap_or(443);

  let addr = tokio::net::lookup_host((auth.host(), port))
    .await?
    .next()
    .ok_or("dns found no addresses")?;

  info!("DNS lookup for {:?}: {:?}", base_uri, addr);

  // create quinn client endpoint
  // load CA certificates stored in the system
  let mut roots = rustls::RootCertStore::empty();
  match rustls_native_certs::load_native_certs() {
    Err(e) => error!("couldn't load any default trust roots: {}", e),
    Ok(certs) => {
      for cert in certs {
        if let Err(e) = roots.add(cert) {
          error!("failed to parse trust anchor: {}", e);
        }
      }
    }
  };

  let mut tls_config = rustls::ClientConfig::builder()
    .with_root_certificates(roots)
    .with_no_client_auth();

  tls_config.enable_early_data = true;
  tls_config.alpn_protocols = vec![ALPN.into()];

  let mut client_endpoint = h3_quinn::quinn::Endpoint::client("[::]:0".parse().unwrap())?;
  let client_config = quinn::ClientConfig::new(Arc::new(
    quinn::crypto::rustls::QuicClientConfig::try_from(tls_config)?,
  ));
  client_endpoint.set_default_client_config(client_config);
  let conn = client_endpoint.connect(addr, auth.host())?.await?;

  info!("QUIC connection established");

  // create h3 client
  let quinn_conn = h3_quinn::Connection::new(conn);

  let part_h3_client : Result<(
    h3::client::Connection<h3_quinn::Connection, bytes::Bytes>,
    h3::client::SendRequest<h3_quinn::OpenStreams, bytes::Bytes>
  ), h3::Error> = h3::client::builder()
    .send_grease(false)
    .build(quinn_conn).await;

  match part_h3_client {
    Ok(value) => Ok(value),
    Err(error) => Err(Box::new(error)),
  }
}

async fn create_endpoint(opt : &Opt) -> Result<h3_quinn::Endpoint, Box<dyn std::error::Error>>
{
  let result_certificate_private_key_text = utils::get_certificate_private_key_text(
    String::from("root_certificate.crt"),
    String::from("root_private_key.pem")
  );

  let (root_certificate, root_private_key) = match result_certificate_private_key_text {
      Ok((root_certificate, root_private_key)) => (root_certificate, root_private_key),
      Err(error) => return Err(error.into()),
  };

  let result_rcgen_certificate_private_key = utils::get_rcgen_certificate_private_key(
    &root_certificate,
    &root_private_key
  );

  let (root_certificate, root_private_key) = match result_rcgen_certificate_private_key {
    Ok((root_certificate, root_private_key)) => (root_certificate, root_private_key),
    Err(error) => return Err(error.into()),
  };

  // create quinn server endpoint and bind UDP socket
  // both cert and key must be DER-encoded
  let Certs { cert, key } = &opt.certs;
  let cert = CertificateDer::from(std::fs::read(cert)?);
  let key = PrivateKeyDer::try_from(std::fs::read(key)?)?;

  let mut tls_config = rustls::ServerConfig::builder()
    .with_no_client_auth()
    .with_single_cert(vec![cert], key)?;

  tls_config.max_early_data_size = u32::MAX;
  tls_config.alpn_protocols = vec![ALPN.into()];

  let server_config = quinn::ServerConfig::with_crypto(
    Arc::new(
      QuicServerConfig::try_from(tls_config)?
    )
  );

  let result_endpoint = quinn::Endpoint::server(
    server_config,
    opt.listen,
  );

  match result_endpoint {
    Ok(value) => Ok(value),
    Err(error) => Err(Box::new(error))
  }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
  tracing_subscriber::fmt()
    .with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
    .with_span_events(tracing_subscriber::fmt::format::FmtSpan::FULL)
    .with_writer(std::io::stderr)
    .with_max_level(tracing::Level::TRACE)
    .with_line_number(true)
    .with_level(true)
    .init();

  // process cli arguments
  let opt = Opt::from_args();

  let root = if let Some(root) = &opt.root {
    if !root.is_dir() {
      return Err(format!("{}: is not a readable directory", root.display()).into());
    } else {
      info!("serving {}", root.display());
      Arc::new(Some(root.clone()))
    }
  } else {
    Arc::new(None)
  };

  info!("listening on {}", opt.listen);
  let endpoint = create_endpoint(&opt).await?;

  // handle incoming connections and requests
  while let Some(new_conn) = endpoint.accept().await {
    let root = root.clone();
    trace_span!("New connection being attempted");

    let (driver, send_request) = create_connection(&opt.uri).await?; // <----------
    let _driver = Arc::new(driver);

    tokio::spawn(async move {
      let conn = match new_conn.await {
        Ok(value) => value,
        Err(error) => {
          error!("accepting connection failed: {:?}", error);
          return
        },
      };

      info!("new connection established");

      let mut h3_conn = match h3::server::Connection::new(h3_quinn::Connection::new(conn)).await {
        Ok(value) => value,
        Err(error) => {
          error!("{:?}", &error);
          return
        }
      };

      loop {
        let (http_request, stream) = match h3_conn.accept().await {
          Ok(Some((req, stream))) => (req, stream),
          // indicating no more streams to be received
          Ok(None) => {
            break;
          },

          Err(err) => {
            error!("error on accept {}", err);
            match err.get_error_level() {
              ErrorLevel::ConnectionError => break,
              ErrorLevel::StreamError => continue,
            }
          },
        };

        let arc_driver = Arc::clone(&_driver);
        let cloned_send_request = send_request.clone(); // <----------
        let root = root.clone();
        info!("new request: {:#?}", http_request);

        tokio::spawn(async {
          let result = handle_request(
            arc_driver,
            cloned_send_request, // <----------
            http_request,
            stream, root
          );

          if let Err(err) = result.await {
            error!("handling request failed: {}", err);
          }
        });
      }
    });
  }

  // shut down gracefully
  // wait for connections to be closed before exiting
  endpoint.wait_idle().await;

  Ok(())
}

async fn handle_request<T>(
  arc_driver: Arc<h3::client::Connection<h3_quinn::Connection, bytes::Bytes>>,
  mut send_request: h3::client::SendRequest<h3_quinn::OpenStreams, bytes::Bytes>,
  mut http_request: Request<()>,
  mut stream: RequestStream<T, Bytes>,
  serve_root: Arc<Option<PathBuf>>,
) -> Result<(), Box<dyn std::error::Error>>
where
  T: BidiStream<Bytes>,
{
  let new_http_request = http::request::Request::builder()
    // .method("GET")
    .uri("https://xxx.worker.dev")
    // .header("Accept", "*/*")
    .body(())
    .unwrap();

  let mut _send_request = match send_request.send_request(new_http_request).await {
    Ok(value) => value,
    Err(error) => {
      warn!("CLC: {:?}", error);
      return Err(Box::new(error))
    },
  };
  // sleep(Duration::from_secs(2));

  match _send_request.finish().await {
    Ok(value) => value,
    Err(error) => {
      warn!("CLA: {:?}", error);
      return Err(Box::new(error))
    },
  };

  // ============================== //
  // ============================== //
  // ============================== //
  // ============================== //
  // ============================== //

  let response = match _send_request.recv_response().await {
    Ok(value) => value,
    Err(error) => {
      warn!("CLT: {:?}", error);
      return Err(Box::new(error))
    },
  };

  if let Err(error) = stream.send_response(response).await {
    warn!("CLT: {:?}", error);
    return Err(Box::new(error))
  };

  loop {
    match _send_request.recv_data().await {
      Ok(None) => { break },
      Ok(Some(mut value)) => {
        warn!("YKYP: {:?}", value.remaining());
        let payload = value.copy_to_bytes(value.remaining());

        if let Err(error) = stream.send_data(payload).await {
          error!("{:?}", &error);
          return Err(Box::new(error))
        }
      },
      Err(error) => {
        error!("{:?}", &error);
        return Err(Box::new(error))
      }
    }
  }

  match _send_request.recv_trailers().await {
    Ok(None) => {},
    Ok(Some(trailers)) => {
      if let Err(error) = stream.send_trailers(trailers).await {
        error!("{:?}", &error);
        return Err(Box::new(error))
      }
    },
    Err(error) => {
      error!("{:?}", &error);
      return Err(Box::new(error))
    }
  }

  match stream.finish().await {
    Ok(value) => value,
    Err(error) => {
      warn!("CLA: {:?}", error);
      return Err(Box::new(error))
    },
  };

  Ok(())
}

@Ruben2424
Copy link
Contributor

Update

@Ruben2424 In some complex use cases, failures sometimes occur even if send_grease is false (HTTP/3.0 500 Internal Server Error, Worker threw exception, Exception GET, Request with a GET or HEAD method cannot have a body.)

Good times and bad (Sometimes it works, sometimes it fails), maybe there are data races issues. Under research... 🧐

Thanks @higlyrecommand for taking a look. What exactly is the "Complex cases" example code you provided? Did you manage to reproduce it with the h3 server?

I hope to find time this weekend to take another look.

@highlyrecommand
Copy link

highlyrecommand commented Aug 3, 2024

Update

@Ruben2424 Thanks for your reply, i can reproduce it, I thought this repository was no longer maintained. :)

I am currently implementing an HTTP/3 middlebox. The principle is to redirect traffic to 127.0.0.1 via DNS while listening with a local H3 server using hyperium/h3 on 127.0.0.1:443. This is achieved by patching quinn_proto to issue self-signed certificates and read SNI field, and simultaneously establish HTTP/3 connections to upstream Cloudflare Workers.

By the way SSL_KEY_LOG_file doesn't seem to work and Wireshark can't decrypt it, which prevents me from seeing the specific content.

CLIENT_HANDSHAKE_TRAFFIC_SECRET 7b1c88f322b984be8c9c4e57f1b0233497c604bae568f82f56991ed98e7b61ad 92294830643684a986ff29584dc5cdb483dab775181ea6946d7c39d3d64bea19
SERVER_HANDSHAKE_TRAFFIC_SECRET 7b1c88f322b984be8c9c4e57f1b0233497c604bae568f82f56991ed98e7b61ad 0e48806ff42cb926cb298a09ce75333b8fbab442f81f2993135e2fe65e2c2471
CLIENT_TRAFFIC_SECRET_0 7b1c88f322b984be8c9c4e57f1b0233497c604bae568f82f56991ed98e7b61ad 9fa36ebee3422d765f77a92ccd21f442436e0e582064150c589774a10d3b6164
SERVER_TRAFFIC_SECRET_0 7b1c88f322b984be8c9c4e57f1b0233497c604bae568f82f56991ed98e7b61ad 7f46e659721f49f12669afd97f90b020e18d0f86f249657d42580c7c8c90dad5
EXPORTER_SECRET 7b1c88f322b984be8c9c4e57f1b0233497c604bae568f82f56991ed98e7b61ad 3126dc4ecbe350f81d0374c7249db11eda6244c55ed2f757f72d03acd59db9dd
CLIENT_HANDSHAKE_TRAFFIC_SECRET 7a3886ad2c4623d3a7c5a79b12af475d3c069d4afeb7db76788e387441e3439b 9d253105913950083021e49303e90b0d53bc831f7676afc5e5876ed42a38a3ea
SERVER_HANDSHAKE_TRAFFIC_SECRET 7a3886ad2c4623d3a7c5a79b12af475d3c069d4afeb7db76788e387441e3439b 8a50896b8d4c9e52771525fa56ba30dc8c2c285843a3ebd87cd783eee56415b3
CLIENT_TRAFFIC_SECRET_0 7a3886ad2c4623d3a7c5a79b12af475d3c069d4afeb7db76788e387441e3439b 8e9a5c0fcc24da50f4c63cb26b9965728b3f6a2f12841a0321e1d8b441cf715e
SERVER_TRAFFIC_SECRET_0 7a3886ad2c4623d3a7c5a79b12af475d3c069d4afeb7db76788e387441e3439b 5a6cf3102c79e19401491e4cd890f0895381d1bd2519c02388812f9d59ff5da2
EXPORTER_SECRET 7a3886ad2c4623d3a7c5a79b12af475d3c069d4afeb7db76788e387441e3439b c635c35c0aeb2729fbca203237bb47bb16c93b781cb0388df386c2447466179d
...

I.e.

Browser DNS <=> LocalDNS (E.g. google.com => 127.0.0.1)
hyperium/h3 Listen UDP Port 127.0.0.1:443
Browser H3 <=> hyperium/h3 <=> Cloudflare Workers

...

Currently, the tracing TRACE logs are extremely verbose, with a single HTTP request generating around 1000-2000 lines. This makes it difficult for me to perform simple diffs. I've collected two log samples: one using a hack method (sleep(Duration::from_nanos(1)).await / tokio::task::yield_now().await) that returns a 200 Success, and another without the hack that returns a 500 Error. I am quite certain this is not my error but rather some data race condition in hyperium/h3.

After another day of studying the logs, I've finally made some progress. It seems that during the connection process for the 500 Error, the logs show two pieces of information:

quinn_proto::connection::streams::state: 543: STREAM id=client bidirectional stream 0 off=0 len=49 fin=false
quinn_proto::connection::streams::state: 543: STREAM id=client bidirectional stream 0 off=49 len=0 fin=true

For a 200 OK, there is only one piece of information:

quinn_proto::connection::streams::state: 543: STREAM id=client bidirectional stream 0 off=0 len=49 fin=true

The data begins with [96, 1] and is a total of 22 bytes (e.g., [96, 1, 61, 199, 121, 211, 71, 158, 162, 163, 63, 39, 121, 165, 71, 180, 119, 35, 24, 37, 64, 3]).

@Ruben2424 I apologize for my limited expertise; I've done my best. I thought this project was no longer maintained and was about to give up, but then I saw your reply :) (Currently, I am using tokio::task::yield_now().await as a temporary fix).

Pseudo-code:

info!("listening on {}", opt.listen);
let endpoint = create_endpoint(&opt).await?;

// handle incoming connections and requests
while Endpoint.accept {
    tokio::spawn(async move {
        h3_conn = match Incoming.await {
            Ok(Connection) => h3::server::Connection::new(h3_quinn::Connection::new(value)),
            Err => ...
        }
        
        // `sleep` or `yield_now` added before create_connection has no effect
        let (endpoint, mut driver, send_request) = match create_connection("xx.com").await {
           Err => ...
        }
        // sleep(Duration::from_nanos(1)).await; // Add sleep to avoid triggering HTTP 500 Error
        // tokio::task::yield_now().await; // Or add this to avoid triggering HTTP 500 Error
        
        tokio::spawn(async move {
            tokio::task::yield_now().await; // Or add this to avoid triggering HTTP 500 Error
            warn!("KAKAC: Request");
            test_request(&mut send_request).await;
            warn!("KAKAD: End");
        });

        let _ = driver.wait_idle().await;
        client_endpoint.wait_idle().await;
    })
}

// shut down gracefully
// wait for connections to be closed before exiting
endpoint.wait_idle().await;

Ok(())

// ==================================== //

async fn test_request(
  send_request: &mut h3::client::SendRequest<h3_quinn::OpenStreams, bytes::Bytes>,
) -> Result<(), Box<dyn std::error::Error>>
{
  let new_http_request = http::request::Request::builder()
    // .method("GET")
    .uri("https://xxx.com")
    // .header("Accept", "*/*")
    .header("Mirror-Host", "example.com")
    .body(())
    .unwrap();

  let mut _send_request = match send_request.send_request(new_http_request).await {
    Ok(value) => value,
    Err(error) => {
      warn!("CLC: {:?}", error);
      return Err(Box::new(error))
    },
  };

  match _send_request.finish().await {
    Ok(value) => value,
    Err(error) => {
      warn!("CLA: {:?}", error);
      return Err(Box::new(error))
    },
  };

  if let Ok(response) = _send_request.recv_response().await {
    println!("{:?}, {:?}", response.headers(), response.status())
  };

  Ok(())
}

Tracing Log (I tried my best to reduce the difference)

The source of the problem? [96, 1] comes from here

* patchs/quinn/quinn-proto/src/connection/mod.rs
[96, 1, ...] #L813-914
 warn!("YAMAHE: {:?}, {:?}, {:?}, {:?}", buf, now, space_id, ack_eliciting);
 2024-08-01T03:07:32.729618Z  WARN drive{id=0}:send{space=Data pn=3}: quinn_proto::connection: 813: YAMAHE: [96, 1, 141, 205, 80, 21, 225, 94, 204, 195, 143, 45, 80, 53, 225, 115, 130, 136, 252, 240, 143, 3], Instant { tv_sec: 45924, tv_nsec: 440011552 }, Data, true


* Maybe it has nothing to do with this.
Connecting::new
let driver = ConnectionDriver(conn.clone());
  ConnectionDriver.poll
  let mut keep_going = conn.drive_transmit(cx)?;
    State.drive_transmit
    self.inner.poll_transmit(now, max_datagrams, &mut self.send_buffer)
      poll_transmit.
      ...?


Error:
 WARN drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::streams::state: 545: HALLOWEEN: [96, 1, 61, 199, 121, 211, 71, 158, 162, 163, 63, 39, 121, 165, 71, 180, 119, 35, 24, 37, 64, 3]
 WARN drive{id=0}:send{space=Data pn=4}: quinn_proto::connection::streams::state: 545: YALLOWEEN: [96, 1, 61, 199, 121, 211, 71, 158, 162, 163, 63, 39, 121, 165, 71, 180, 119, 35, 24, 37, 64, 4]

Ok:
 WARN drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::streams::state: 545: HALLOWEEN: [96, 1, 227, 88, 50, 98, 142, 253, 89, 50, 224, 65, 50, 95, 142, 229, 145, 180, 109, 206, 5, 3]


Err 500:
DEBUG drive{id=0}: quinn::connection: 246: new
DEBUG drive{id=0}: quinn::connection: 246: enter
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::packet_builder: 94: new
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::packet_builder: 94: enter
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::streams::state: 543: STREAM id=client bidirectional stream 0 off=0 len=49 fin=false
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::packet_builder: 94: exit
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::packet_builder: 94: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 1011: sending 90 bytes in 1 datagrams
DEBUG drive{id=0}: quinn::connection: 246: exit
DEBUG drive{id=0}: quinn::connection: 246: close time.busy= time.idle=
DEBUG drive{id=0}: quinn::connection: 246: new
DEBUG drive{id=0}: quinn::connection: 246: enter
TRACE drive{id=0}:send{space=Data pn=4}: quinn_proto::connection::packet_builder: 94: new
TRACE drive{id=0}:send{space=Data pn=4}: quinn_proto::connection::packet_builder: 94: enter
TRACE drive{id=0}:send{space=Data pn=4}: quinn_proto::connection::streams::state: 543: STREAM id=client bidirectional stream 0 off=49 len=0 fin=true
TRACE drive{id=0}:send{space=Data pn=4}: quinn_proto::connection::packet_builder: 94: exit
TRACE drive{id=0}:send{space=Data pn=4}: quinn_proto::connection::packet_builder: 94: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 1011: sending 42 bytes in 1 datagrams
DEBUG drive{id=0}: quinn::connection: 246: exit
DEBUG drive{id=0}: quinn::connection: 246: close time.busy= time.idle=




OK 200:
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::packet_builder: 94: new
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::packet_builder: 94: enter
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::streams::state: 543: STREAM id=client bidirectional stream 0 off=0 len=49 fin=true
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::packet_builder: 94: exit
TRACE drive{id=0}:send{space=Data pn=3}: quinn_proto::connection::packet_builder: 94: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 1011: sending 90 bytes in 1 datagrams
DEBUG drive{id=0}: quinn::connection: 246: exit
DEBUG drive{id=0}: quinn::connection: 246: close time.busy= time.idle=
DEBUG drive{id=0}: quinn::connection: 246: new
DEBUG drive{id=0}: quinn::connection: 246: enter
?
TRACE drive{id=0}:send{space=Data pn=5}: quinn_proto::connection::packet_builder: 94: new
TRACE drive{id=0}:send{space=Data pn=5}: quinn_proto::connection::packet_builder: 94: enter
TRACE drive{id=0}:send{space=Data pn=5}: quinn_proto::connection: 3279: ACK ArrayRangeSet([0..2]), Delay = 1514us
TRACE drive{id=0}:send{space=Data pn=5}: quinn_proto::connection::packet_builder: 94: exit
TRACE drive{id=0}:send{space=Data pn=5}: quinn_proto::connection::packet_builder: 94: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 1011: sending 44 bytes in 1 datagrams
DEBUG drive{id=0}: quinn::connection: 246: exit
DEBUG drive{id=0}: quinn::connection: 246: close time.busy= time.idle=



=======================



Err 500:
TRACE wait_idle: h3::client::connection: 361: enter
TRACE wait_idle:poll_close: h3::client::connection: 367: new
TRACE wait_idle:poll_close: h3::client::connection: 367: enter
DEBUG drive{id=0}: quinn::connection: 246: new
TRACE wait_idle:poll_close:poll_control: h3::connection: 436: new
TRACE wait_idle:poll_close:poll_control: h3::connection: 436: enter
DEBUG drive{id=0}: quinn::connection: 246: enter
TRACE wait_idle:poll_close:poll_control:poll_accept_recv: h3::connection: 347: new
TRACE wait_idle:poll_close:poll_control:poll_accept_recv: h3::connection: 347: enter
TRACE drive{id=0}: quinn_proto::connection: 2163: got Data packet (32 bytes) from [::ffff:172.67.209.69]:443 using id b90e7e2085c74c2a
TRACE wait_idle:poll_close:poll_control:poll_accept_recv:poll_accept_recv: h3_quinn: 179: new
TRACE wait_idle:poll_close:poll_control:poll_accept_recv:poll_accept_recv: h3_quinn: 179: enter


Ok 200:
TRACE wait_idle: h3::client::connection: 361: enter
TRACE wait_idle:poll_close: h3::client::connection: 367: new
TRACE wait_idle:poll_close: h3::client::connection: 367: enter
TRACE wait_idle:poll_close:poll_control: h3::connection: 436: new
TRACE wait_idle:poll_close:poll_control: h3::connection: 436: enter
TRACE wait_idle:poll_close:poll_control:poll_accept_recv: h3::connection: 347: new
TRACE wait_idle:poll_close:poll_control:poll_accept_recv: h3::connection: 347: enter
TRACE wait_idle:poll_close:poll_control:poll_accept_recv:poll_accept_recv: h3_quinn: 179: new
TRACE wait_idle:poll_close:poll_control:poll_accept_recv:poll_accept_recv: h3_quinn: 179: enter




==========================



Err 500:
TRACE drive{id=0}:send{space=Data pn=8}: quinn_proto::connection::packet_builder: 94: new
TRACE drive{id=0}:send{space=Data pn=8}: quinn_proto::connection::packet_builder: 94: enter
TRACE drive{id=0}:send{space=Data pn=8}: quinn_proto::connection::packet_builder: 234: PADDING * 1350
TRACE drive{id=0}:send{space=Data pn=8}: quinn_proto::connection::packet_builder: 94: exit
TRACE drive{id=0}:send{space=Data pn=8}: quinn_proto::connection::packet_builder: 94: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 1004: writing MTUD probe probe_size=1389
TRACE drive{id=0}: quinn_proto::connection: 1011: sending 1389 bytes in 1 datagrams
DEBUG drive{id=0}: quinn::connection: 246: exit
DEBUG drive{id=0}: quinn::connection: 246: close time.busy= time.idle=
DEBUG drive{id=0}: quinn::connection: 246: new
DEBUG drive{id=0}: quinn::connection: 246: enter




Ok 200:
TRACE drive{id=0}:send{space=Data pn=6}: quinn_proto::connection::packet_builder: 94: new
TRACE drive{id=0}:send{space=Data pn=6}: quinn_proto::connection::packet_builder: 94: enter
TRACE drive{id=0}:send{space=Data pn=6}: quinn_proto::connection::packet_builder: 234: PADDING * 1350
TRACE drive{id=0}:send{space=Data pn=6}: quinn_proto::connection::packet_builder: 94: exit
TRACE drive{id=0}:send{space=Data pn=6}: quinn_proto::connection::packet_builder: 94: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 1004: writing MTUD probe probe_size=1389
TRACE drive{id=0}: quinn_proto::connection: 1011: sending 1389 bytes in 1 datagrams





==========================



No


Ok 200:
DEBUG drive{id=0}: quinn::connection: 246: exit
DEBUG drive{id=0}: quinn::connection: 246: close time.busy= time.idle=
DEBUG drive{id=0}: quinn::connection: 246: new
DEBUG drive{id=0}: quinn::connection: 246: enter
TRACE drive{id=0}:send{space=Data pn=7}: quinn_proto::connection::packet_builder: 234: PADDING * 1381
TRACE drive{id=0}:send{space=Data pn=7}: quinn_proto::connection::packet_builder: 94: exit
TRACE drive{id=0}:send{space=Data pn=7}: quinn_proto::connection::packet_builder: 94: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 1004: writing MTUD probe probe_size=1420
TRACE drive{id=0}: quinn_proto::connection: 1011: sending 1420 bytes in 1 datagrams
DEBUG drive{id=0}: quinn::connection: 246: exit
DEBUG drive{id=0}: quinn::connection: 246: close time.busy= time.idle=
DEBUG drive{id=0}: quinn::connection: 246: new
DEBUG drive{id=0}: quinn::connection: 246: enter
TRACE drive{id=0}: quinn_proto::connection: 2163: got Data packet (1200 bytes) from [::ffff:172.67.209.69]:443 using id 27d86c4639481967
TRACE drive{id=0}:recv{space=Data pn=10}: quinn_proto::connection: 2213: new
TRACE drive{id=0}:recv{space=Data pn=10}: quinn_proto::connection: 2213: enter
TRACE drive{id=0}:recv{space=Data pn=10}:frame{ty=STREAM}: quinn_proto::connection: 2636: new
TRACE drive{id=0}:recv{space=Data pn=10}: quinn_proto::connection: 2647: got stream frame id=client bidirectional stream 0 offset=0 len=1169 fin=false
TRACE drive{id=0}:recv{space=Data pn=10}:frame{ty=STREAM}: quinn_proto::connection: 2636: enter
TRACE drive{id=0}:recv{space=Data pn=10}:frame{ty=STREAM}: quinn_proto::connection: 2636: exit
TRACE drive{id=0}:recv{space=Data pn=10}:frame{ty=STREAM}: quinn_proto::connection: 2636: close time.busy= time.idle=
TRACE drive{id=0}:recv{space=Data pn=10}: quinn_proto::connection: 2213: exit
TRACE drive{id=0}:recv{space=Data pn=10}: quinn_proto::connection: 2213: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 2163: got Data packet (649 bytes) from [::ffff:172.67.209.69]:443 using id 27d86c4639481967
TRACE drive{id=0}:recv{space=Data pn=11}: quinn_proto::connection: 2213: new
TRACE drive{id=0}:recv{space=Data pn=11}: quinn_proto::connection: 2213: enter
TRACE drive{id=0}:recv{space=Data pn=11}:frame{ty=STREAM}: quinn_proto::connection: 2636: new
TRACE drive{id=0}:recv{space=Data pn=11}: quinn_proto::connection: 2647: got stream frame id=client bidirectional stream 0 offset=1169 len=617 fin=false
TRACE drive{id=0}:recv{space=Data pn=11}:frame{ty=STREAM}: quinn_proto::connection: 2636: enter
TRACE drive{id=0}:recv{space=Data pn=11}:frame{ty=STREAM}: quinn_proto::connection: 2636: exit
TRACE drive{id=0}:recv{space=Data pn=11}:frame{ty=STREAM}: quinn_proto::connection: 2636: close time.busy= time.idle=
TRACE drive{id=0}:recv{space=Data pn=11}: quinn_proto::connection: 2213: exit
TRACE drive{id=0}:recv{space=Data pn=11}: quinn_proto::connection: 2213: close time.busy= time.idle=
TRACE drive{id=0}: quinn_proto::connection: 2163: got Data packet (34 bytes) from [::ffff:172.67.209.69]:443 using id 27d86c4639481967
TRACE drive{id=0}:recv{space=Data pn=12}: quinn_proto::connection: 2213: new
TRACE drive{id=0}:recv{space=Data pn=12}: quinn_proto::connection: 2213: enter
TRACE drive{id=0}:recv{space=Data pn=12}:frame{ty=STREAM}: quinn_proto::connection: 2636: new
TRACE drive{id=0}:recv{space=Data pn=12}: quinn_proto::connection: 2647: got stream frame id=client bidirectional stream 0 offset=1786 len=2 fin=true
TRACE drive{id=0}:recv{space=Data pn=12}:frame{ty=STREAM}: quinn_proto::connection: 2636: enter
TRACE drive{id=0}:recv{space=Data pn=12}:frame{ty=STREAM}: quinn_proto::connection: 2636: exit
TRACE drive{id=0}:recv{space=Data pn=12}:frame{ty=STREAM}: quinn_proto::connection: 2636: close time.busy= time.idle=
TRACE drive{id=0}:recv{space=Data pn=12}: quinn_proto::connection: 2213: exit
TRACE drive{id=0}:recv{space=Data pn=12}: quinn_proto::connection: 2213: close time.busy= time.idle=
TRACE drive{id=0}:send{space=Data pn=8}: quinn_proto::connection::packet_builder: 94: new
TRACE drive{id=0}:send{space=Data pn=8}: quinn_proto::connection::packet_builder: 94: enter

@highlyrecommand
Copy link

highlyrecommand commented Aug 3, 2024

Update

@Ruben2424 Complex cases code is just above your reply.

part_error.log
part_success.log
full_error.log
full_success.log

@Ruben2424
Copy link
Contributor

I tried the original example with doh3 again. I captured the traffic with wireshark and the data sent in the data frame is identical when success or failure.

Fail:
Screenshot from 2024-08-04 15-17-18

I lean more and more towards thinking that this might be a bug in Cloudflare.
I dont know how i can resolve this. @seanmonstar do you have an idea?

@daxpedda
Copy link
Contributor Author

daxpedda commented Aug 4, 2024

When I was still debugging this I tried Quiche as a client as well and was unable to reproduce the issue with it.
Looking at Wireshark back then however gave me very different results between Quiche and H3/Quinn, while the data was the same, the way it was sent, e.g. opened QUIC streams and stuff like that, was very different.

I remember also trying it with cURL, which also used very different ways to establish a HTTP/3 connection (but still the same data).

So my best guess until now is that Cloudflare's DNS server doesn't like the way we handle/create QUIC streams to send the message. If Cloudflare uses Quiche for their DNS server as well, we might be able to reproduce the issue locally.

@highlyrecommand
Copy link

highlyrecommand commented Aug 4, 2024

I tried the original example with doh3 again. I captured the traffic with wireshark and the data sent in the data frame is identical when success or failure.

Fail: Screenshot from 2024-08-04 15-17-18

I lean more and more towards thinking that this might be a bug in Cloudflare. I dont know how i can resolve this. @seanmonstar do you have an idea?

No matter how many times it is retried, the log always shows:
Error (500):
stream 0 off=0 len=49 fin=false + stream 0 off=49 len=0 fin=true
Ok(200):
stream 0 off=0 len=49 fin=true

@Ruben2424 But for my case, sleep(Duration::from_nanos(1)).await or tokio::task::yield_now().await can temporarily fix this problem.
If it's not data races, then why does this work?

Emmm, I still think this is data races.
My case has nothing to do with GREASE

@Ruben2424
Copy link
Contributor

Ruben2424 commented Aug 4, 2024

What for me does not make sense is the fact that the errors are application http errors.

  • Error 500 Exception GET, Request with a GET or HEAD method cannot have a body.
  • Error 400 Invalid query

If this is a Problem on the quic or http level I would expect an error like this:
https://www.rfc-editor.org/rfc/rfc9000.html#name-transport-error-codes
https://www.rfc-editor.org/rfc/rfc9114.html#name-http-3-error-codes

Both errors indicate that there is a problem with the body.

  • 500 Problem: There is a body where none should be
  • 400 Problem: Different or longer body which leads to a incorrect DNS Query

My first thought was that there may be an issue that there is wrong data sent on the body triggered by a race condition.
My observations in wireshark for the DNS Problem showed that exactly the same data was sent (on the DATA frame) when failing or success.

No matter how many times it is retried, the log always shows:
Error (500):
stream 0 off=0 len=49 fin=false + stream 0 off=49 len=0 fin=true
Ok(200):
stream 0 off=0 len=49 fin=true

This indicates the same behavior. In any case there are 49 bytes sent on the request stream. This includes headers, body data, and potential grease frames. The only difference that i can see here that the FIN bit is set in a different quic frame. As far as I know this is is legit and not an error. But to be 100% sure that in this case there was the same data sent in both cases, we need to see what exactly the 49 bytes are.

@Ruben2424 But for my case, sleep(Duration::from_nanos(1)).await or tokio::task::yield_now().await can temporarily fix this problem.
If it's not data races, then why does this work?

I honestly dont know.

So my best guess until now is that Cloudflare's DNS server doesn't like the way we handle/create QUIC streams to send the message. If Cloudflare uses Quiche for their DNS server as well, we might be able to reproduce the issue locally.

I will try again to reproduce it localy with quiche. But in case I cannot reproduce it locally I have no idea hot to find the cause of the problem.

@highlyrecommand
Copy link

highlyrecommand commented Aug 4, 2024

@Ruben2424 Okay, got it, Thank you for your help.

I created the repository (https://github.com/highlyrecommand/h3_test),
This will cause 500 Error in China.
(if do not use sleep(Duration::from_nanos(1)).await / tokio::task::yield_now().await)

There is no problem with AWS or Azure in Hong Kong and other countries (cannot reproduce). This problem is very strange.
may be caused by Cloudflare?
may be caused by Workers application?
(but quiche cURL never got 500 Error)
may be caused by Great firewall of China IDS?
may be caused by ..?

Thanks for everyone's answers and help, I tried to solve it myself and will update here if there is any new progress.

  • workers.js
addEventListener(
  'fetch', event => {
    const worker_mirror_host = event.request.headers.get('Worker-Mirror-Host');
    const url = new URL(event.request.url);
    url.hostname = worker_mirror_host;

    const modifiedRequest = new Request(url, {
      body: event.request.body,
      headers: event.request.headers,
      method: event.request.method,
      redirect: event.request.redirect
    });

    modifiedRequest.headers.delete('Worker-Mirror-Host');
    modifiedRequest.headers.delete('CF-Connecting-IP');
    modifiedRequest.headers.delete('X-Forwarded-For');
    modifiedRequest.headers.delete('X-Real-IP');

    const request = new Request(url, modifiedRequest);
    event.respondWith(fetch(request));
  }
)

@highlyrecommand
Copy link

highlyrecommand commented Aug 4, 2024

Update

@Ruben2424 OK, I finally solved the problem with an acceptable solution (Cloudflare Workers side) (if ("GET" == request.method) || ("HEAD" == request.method);)

Workers side hack

export default {
  async fetch(request, env, ctx) {
    const get_or_head = ("GET" == request.method) || ("HEAD" == request.method);
    const worker_mirror_host = request.headers.get('Worker-Mirror-Host');

    const url = new URL(request.url);
    url.hostname = worker_mirror_host;

    const headers = new Headers(request.headers);
    headers.delete('Worker-Mirror-Host');
    headers.delete('CF-Connecting-IP');
    headers.delete('X-Forwarded-For');
    headers.delete('X-Real-IP');

    const modifiedRequest = new Request(url, {
      method: request.method,
      headers: headers,
      body: (get_or_head ? null : request.body),
      redirect: request.redirect,
    });

    return fetch(modifiedRequest);
  },
};

My guess: maybe some strange reason makes cloudflare think there is a body in the request, and the body length is 0.
(This might be due to network latency causing the send order to be incorrect? Queue?, hyperium/h3 retry mechanism? or perhaps other unknown factors).
In China ISP, hyperium/h3 500 Error does not occur every time (For example: 90 times out of 100 times 500 Error (Request with a GET or HEAD method cannot have a body), and 10 times 200 OK).

But there is no need to do this in quiche curl, and 500 Error will never be triggered.
Although this solved the problem, I still think there may be some underlying issues (data races?, or quiche has some weird issues with hyperium/h3, ..., Unknown cause, ...).

  • [-] The problem can be reproduced in mainland China (ISP / Fiber), but cannot be reproduced on AWS and Azure servers in Hong Kong or other countries (IDC / Datacenter).
  • [-] In China, this problem can be solved through sleep 1ns or yield_now.
  • [-] Or solved by Workers side hack.

Finally or Ending

I can accept this solution, thank you all for your help.

@Ruben2424
Copy link
Contributor

@Ruben2424 OK, I finally solved the problem with an acceptable solution (Cloudflare Workers side) (if ("GET" == request.method) || ("HEAD" == request.method);)

I am still curious why cloudflare receives a body when there is none sent. Would you mind printing out the body or something like that when one is received. Maybe this will help finding the underlying problem.

@highlyrecommand
Copy link

highlyrecommand commented Aug 7, 2024

Reply

@Ruben2424 No problem, I have recaptured two logs, one with 500 ERROR and one with 200 OK.

The response Body is in the log.

I replaced the china IP address and domain name with x only (hide sensitive information).
Please view the attachment below.

The execution order of function calls in the logs of 500ERROR and 200OK is very different, just because a yield_now is added.

This could be affecting the async scheduler?, the queue?, or some internal mechanism?, poll?, I only have a rudimentary understanding of asynchronous, so doing my best to figure out the problem.

And I found the strangest thing, if I put yield_now in an h3 internal function such as h3::connection::RequestStream::send_data(), it can also prevent the 500 error from being triggered.

and I cannot decrypt the traffic via SSL_KEY_LOG_FILE, if I can successfully decrypt it I will provide you the wireshark pcap file. (I don't know why, how do you do it?)

By the way

By the way, It seems that this length: 0 problem will only affect the request header (HEADER FRAME?), uploading (E.g. POST) and receiving responses/data (E.g. HTTP 206) will not be affected (perfect). I tested it yesterday using cloudflare workers to middlebox YouTube, and no failures occurred (using workers side hack), this is good news.

and quinn is very different from quiche. quinn uses inner padding (the padding content is also encrypted), while quiche uses outer padding (00000000 zero data until 1200 Bytes), The Great Firewall of China can recognize/identify quinn data packet, but not quiche.
I don't know which padding complies with the RFC standard. There may be many internal differences, (just an example).

Thanks

Thank you again for replying to me. 🙏

Attachments

err500_20240807_capture.log
ok200_20240807_capture.log

@hargut
Copy link

hargut commented Sep 12, 2024

As per the logs it looks like http/500 contains a request header content-lenght: 0, and http/200 does not contain this request header. This header likely triggers the body not being null but eventually "" in the workers request.body.

It would be interesting to see from the client logs if this header is part of the sent h3 request.

@Ruben2424 Ruben2424 added the needs-reproduction Problem cannot be reproduced locally with client and server label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-reproduction Problem cannot be reproduced locally with client and server
Projects
None yet
Development

No branches or pull requests

5 participants