Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More fine-grained timeout handling #1041

Closed
otrosien opened this issue Apr 24, 2019 · 22 comments
Closed

More fine-grained timeout handling #1041

otrosien opened this issue Apr 24, 2019 · 22 comments

Comments

@otrosien
Copy link
Member

otrosien commented Apr 24, 2019

Is your feature request related to a problem? Please describe.

Our backend is expected to fail with a timeout after 450ms, but in case of massive traffic it starts garbage collecting pauses etc, so ideally skipper in front of my application should react and send a 504 Gateway Timeout to the client. Unfortunately there is only a global backend timeout configurable in skipper, and we're sharing skippers ingress with other applications.

Describe the solution you would like

Ideally I want to configure a timeout to be configurable per route, or per backend...

Describe alternatives you've considered (optional)

As a work-around we could invest in hardening the backend to ensure it fails faster, instead of trying to handle more connections and getting into GC issues.

Another alternative could be a dedicated skipper-ingress for our application that hard-codes our desired timeout setting.

Additional context (optional)

Why failing fast is important here? If we're not responding within the expected timeout, we're burdening our upstream clients with our increased latency, causing them to handle more concurrent connections, and making it harder for them to trigger their circuit breakers.

Would you like to work on it?

Maybe.

@szuecs
Copy link
Member

szuecs commented Apr 24, 2019

Thanks for the issue. I think we could start to work on it if we do configurable timeouts per backend (group).

@AlexanderYastrebov
Copy link
Member

AlexanderYastrebov commented Oct 22, 2020

I will post a comment instead of a new issue, it is about an edge case (infinitely slow backend) but is relevant to the subject.

Problem

TL;DR: There is no "response read timeout" available and therefore skipper hangs if backend hangs until client cancels the request.

Skipper has the following timeout configurations (look for *timeoutBackend*) available regarding backend:

skipper/config/config.go

Lines 178 to 183 in 5a92e9f

TimeoutBackend time.Duration `yaml:"timeout-backend"`
KeepaliveBackend time.Duration `yaml:"keepalive-backend"`
EnableDualstackBackend bool `yaml:"enable-dualstack-backend"`
TlsHandshakeTimeoutBackend time.Duration `yaml:"tls-timeout-backend"`
ResponseHeaderTimeoutBackend time.Duration `yaml:"response-header-timeout-backend"`
ExpectContinueTimeoutBackend time.Duration `yaml:"expect-continue-timeout-backend"`

skipper/config/config.go

Lines 386 to 391 in 5a92e9f

timeoutBackendUsage = "sets the TCP client connection timeout for backend connections"
keepaliveBackendUsage = "sets the keepalive for backend connections"
enableDualstackBackendUsage = "enables DualStack for backend connections"
tlsHandshakeTimeoutBackendUsage = "sets the TLS handshake timeout for backend connections"
responseHeaderTimeoutBackendUsage = "sets the HTTP response header timeout for backend connections"
expectContinueTimeoutBackendUsage = "sets the HTTP expect continue timeout for backend connections"

For backend requests skipper uses http.Transport:

skipper/proxy/proxy.go

Lines 623 to 637 in 5a92e9f

tr := &http.Transport{
DialContext: newSkipperDialer(net.Dialer{
Timeout: p.Timeout,
KeepAlive: p.KeepAlive,
DualStack: p.DualStack,
}).DialContext,
TLSHandshakeTimeout: p.TLSHandshakeTimeout,
ResponseHeaderTimeout: p.ResponseHeaderTimeout,
ExpectContinueTimeout: p.ExpectContinueTimeout,
MaxIdleConns: p.MaxIdleConns,
MaxIdleConnsPerHost: p.IdleConnectionsPerHost,
IdleConnTimeout: p.CloseIdleConnsPeriod,
DisableKeepAlives: p.DisableHTTPKeepalives,
Proxy: proxyFromHeader,
}

  • TimeoutBackend is used to configure net.Dialer timeout:
    // Timeout is the maximum amount of time a dial will wait for
    // a connect to complete. If Deadline is also set, it may fail
    // earlier.
    //
    // The default is no timeout.
    //
    // When using TCP and dialing a host name with multiple IP
    // addresses, the timeout may be divided between them.
    //
    // With or without a timeout, the operating system may impose
    // its own earlier timeout. For instance, TCP timeouts are
    // often around 3 minutes.
    Timeout time.Duration
  • ResponseHeaderTimeoutBackend is used to configure http.Transport timeout:
    // ResponseHeaderTimeout, if non-zero, specifies the amount of
    // time to wait for a server's response headers after fully
    // writing the request (including its body, if any). This
    // time does not include the time to read the response body.
    ResponseHeaderTimeout time.Duration // Go 1.1

Since skipper uses http.Transport and not http.Client there is no "response read timeout" available such that http.Client has:

    // Timeout specifies a time limit for requests made by this
    // Client. The timeout includes connection time, any
    // redirects, and reading the response body. The timer remains
    // running after Get, Head, Post, or Do return and will
    // interrupt reading of the Response.Body.
    //
    // A Timeout of zero means no timeout.
    //
    // The Client cancels requests to the underlying Transport
    // as if the Request's Context ended.
    //
    // For compatibility, the Client will also use the deprecated
    // CancelRequest method on Transport if found. New
    // RoundTripper implementations should use the Request's Context
    // for cancellation instead of implementing CancelRequest.
    Timeout time.Duration // Go 1.3

If backend has started to respond and has send response status and then hangs,
skipper will forward response status to the client and hang as well (in copyStream):

skipper/proxy/proxy.go

Lines 1225 to 1227 in 5a92e9f

ctx.responseWriter.WriteHeader(ctx.response.StatusCode)
ctx.responseWriter.Flush()
err := copyStream(ctx.responseWriter, ctx.response.Body, p.tracing, ctx.proxySpan)

until client cancels its request.

Consider malformed backend that writes status and hangs:

package main

import (
	"fmt"
	"log"
	"net/http"
	"time"
)

func main() {
	http.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {

		w.WriteHeader(418)

		f := w.(http.Flusher)
		f.Flush()

		time.Sleep(10 * time.Hour)
	})

	fmt.Printf("Starting server at port 8080\n")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatal(err)
	}
}
$ go run ~/tmp/slow.go
Starting server at port 8080

and skipper with all timouts set to 1s:

$ bin/skipper -inline-routes='* -> "http://localhost:8080"' -read-timeout-server=1s -read-header-timeout-server=1s -write-timeout-server=1s -timeout-backend=1s -response-header-timeout-backend=1s
[APP]INFO[0000] Expose metrics in codahale format            
[APP]INFO[0000] support listener on :9911                    
[APP]INFO[0000] proxy listener on :9090                      
[APP]INFO[0000] TLS settings not found, defaulting to HTTP   
[APP]INFO[0000] route settings, reset, route: : * -> "http://localhost:8080" 
[APP]INFO[0000] route settings received                      
[APP]INFO[0000] route settings applied 

and client:

$ curl -v localhost:9090/slow
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9090 (#0)
> GET /slow HTTP/1.1
> Host: localhost:9090
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 418 I'm a teapot
< Date: Thu, 22 Oct 2020 10:45:56 GMT
< Server: Skipper
< Transfer-Encoding: chunked
< 

The client receives status from skipper and hangs forever, none of the skipper configured timeouts are applicable (because all of them are for connection or the first bytes received/sent, not for the request-response whole roundtrip duration).

When client cancels request (^C), the following appears in skipper logs:

[APP]ERRO[0053] error while copying the response stream: context canceled 
127.0.0.1 - - [22/Oct/2020:12:45:56 +0200] "GET /slow HTTP/1.1" 418 0 "-" "curl/7.58.0" 47893 localhost:9090 - -

which comes from copyStream:

skipper/proxy/proxy.go

Lines 1227 to 1230 in 5a92e9f

err := copyStream(ctx.responseWriter, ctx.response.Body, p.tracing, ctx.proxySpan)
if err != nil {
p.metrics.IncErrorsStreaming(ctx.route.Id)
p.log.Errorf("error while copying the response stream: %v", err)

Solution

Backend request should configure timeout similar to (note its a demo and context is not cancelled properly):

@@ -883,6 +881,11 @@ func (p *Proxy) makeBackendRequest(ctx *context) (*http.Response, *proxyError) {
                return nil, &proxyError{handled: true}
        }
 
+       tc, _ := stdlibcontext.WithTimeout(req.Context(), 5*time.Second)
+       //defer _()  // releases resources if slowOperation completes before timeout elapses
+
+       req = req.WithContext(tc)
+
        bag := ctx.StateBag()
        spanName, ok := bag[tracingfilter.OpenTracingProxySpanKey].(string)
        if !ok {

In the above setup copyStream will unblock after 5s:

[APP]ERRO[0009] error while copying the response stream: context deadline exceeded 
127.0.0.1 - - [22/Oct/2020:12:51:41 +0200] "GET /slow HTTP/1.1" 418 0 "-" "curl/7.58.0" 5000 localhost:9090 - -

There might be global timeout setting for all requests and per-route setting (i.e. this ticket subject).

Open questions

  • How would global "read timeout" setting play with infinite responses (streaming/long polling)?
  • Backend request timeout context must be cancelled after successful response (happy path, nothing hangs) i.e. after copyStream but currently there are no means to do that. It would be nice to implement generic hook for response write completion - it could be used to clean request timeout context as well as to measure real backend latency that includes response write - useful for Endpoint metadata #1559 and loadbalancer algorithms #557.
  • How request timeout context plays with transparent retry (should it be the same timeout per attempt or both attempts must fit into timeout)?

Links

AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Oct 23, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Oct 23, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Oct 23, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Oct 23, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Oct 24, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Oct 24, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Oct 24, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Oct 24, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Nov 19, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Dec 1, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body).

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Dec 1, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body)

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Dec 8, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body)

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Dec 8, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body)

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Dec 8, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body)

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/skipper that referenced this issue Dec 8, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body)

See zalando#1041

Signed-off-by: Alexander Yastrebov <[email protected]>
AlexanderYastrebov added a commit that referenced this issue Dec 15, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body)

See #1041

Signed-off-by: Alexander Yastrebov <[email protected]>
aryszka pushed a commit that referenced this issue Dec 17, 2020
* Adds `backendTimeout` filter to configure route backend timeout
* Proxy sets up request context with configured timeout and responds
with 504 status on timeout (note: if response streaming has already
started it will be terminated, client will receive backend status and
truncated response body)

See #1041

Signed-off-by: Alexander Yastrebov <[email protected]>
@szuecs
Copy link
Member

szuecs commented Dec 21, 2020

https://opensource.zalando.com/skipper/reference/filters/#backendtimeout filter can be configured per route

@szuecs szuecs closed this as completed Dec 21, 2020
@AlexanderYastrebov
Copy link
Member

AlexanderYastrebov commented May 7, 2021

Addition to the #1041 (comment)

The client receives status from skipper and hangs forever, none of the skipper configured timeouts are applicable

It turns out that -write-timeout-server applies if server sends body. Consider server that sends pings infinitely:

package main

import (
	"fmt"
	"log"
	"net/http"
	"time"
)

func main() {
	http.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {

		w.WriteHeader(418)

		f := w.(http.Flusher)
		f.Flush()

		ticker := time.NewTicker(5 * time.Second)
		for {
			select {
			case <-ticker.C:
				_, err := fmt.Fprintf(w, "ping\n\n")
				if err != nil {
					fmt.Printf("Done\n")
					return
				}
				f.Flush()
			}
		}
	})

	fmt.Printf("Starting server at port 8080\n")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatal(err)
	}
}
$ go run ~/tmp/slowping.go 
Starting server at port 8080

and Skipper with all timeouts set to 1s:

$ bin/skipper -inline-routes='* -> "http://localhost:8080"' -read-timeout-server=1s -read-header-timeout-server=1s -write-timeout-server=1s -timeout-backend=1s -response-header-timeout-backend=1s

and the client:

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9090 (#0)
> GET /slow HTTP/1.1
> Host: localhost:9090
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 418 I'm a teapot
< Date: Fri, 07 May 2021 10:39:47 GMT
< Server: Skipper
< Transfer-Encoding: chunked
< 
* transfer closed with outstanding read data remaining
* stopped the pause stream!
* Closing connection 0
curl: (18) transfer closed with outstanding read data remaining

The client receives status from Skipper and Skipper closes the connection:

[APP]ERRO[0139] error while copying the response stream: context canceled 
127.0.0.1 - - [07/May/2021:12:40:54 +0200] "GET /slow HTTP/1.1" 418 6 "-" "curl/7.58.0" 5002 localhost:9090 - -

If -write-timout-server is set to 10s the client receives the first ping and then Skipper closes the connection:

$ bin/skipper -inline-routes='* -> "http://localhost:8080"' -read-timeout-server=1s -read-header-timeout-server=1s -write-timeout-server=10s -timeout-backend=1s -response-header-timeout-backend=1s
[APP]INFO[0000] Expose metrics in codahale format            
[APP]INFO[0000] support listener on :9911                    
[APP]INFO[0000] proxy listener on :9090                      
[APP]INFO[0000] TLS settings not found, defaulting to HTTP   
[APP]INFO[0000] route settings, reset, route: : * -> "http://localhost:8080" 
[APP]INFO[0000] route settings received                      
[APP]INFO[0000] route settings applied                       
[APP]ERRO[0013] error while copying the response stream: context canceled 
127.0.0.1 - - [07/May/2021:12:42:17 +0200] "GET /slow HTTP/1.1" 418 12 "-" "curl/7.58.0" 10003 localhost:9090 - -
$ curl -v localhost:9090/slow
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9090 (#0)
> GET /slow HTTP/1.1
> Host: localhost:9090
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 418 I'm a teapot
< Date: Fri, 07 May 2021 10:42:17 GMT
< Server: Skipper
< Transfer-Encoding: chunked
< 
ping

* transfer closed with outstanding read data remaining
* stopped the pause stream!
* Closing connection 0
curl: (18) transfer closed with outstanding read data remaining

backendTimeout

backendTimeout filter implemented by #1566 can only reduce the timeout with regards to -write-timeout-server:

$ bin/skipper -inline-routes='* -> backendTimeout("5s") -> "http://localhost:8080"' -read-timeout-server=1s -read-header-timeout-server=1s -write-timeout-server=10s -timeout-backend=1s -response-header-timeout-backend=1s
[APP]INFO[0000] Expose metrics in codahale format            
[APP]INFO[0000] support listener on :9911                    
[APP]INFO[0000] proxy listener on :9090                      
[APP]INFO[0000] TLS settings not found, defaulting to HTTP   
[APP]INFO[0000] route settings, reset, route: : * -> backendTimeout("5s") -> "http://localhost:8080" 
[APP]INFO[0000] route settings received                      
[APP]INFO[0000] route settings applied                       
[APP]ERRO[0009] error while copying the response stream: context deadline exceeded 
127.0.0.1 - - [07/May/2021:12:48:58 +0200] "GET /slow HTTP/1.1" 418 0 "-" "curl/7.58.0" 5000 localhost:9090 - -
$ curl -v localhost:9090/slow
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9090 (#0)
> GET /slow HTTP/1.1
> Host: localhost:9090
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 418 I'm a teapot
< Date: Fri, 07 May 2021 10:48:58 GMT
< Server: Skipper
< Transfer-Encoding: chunked
< 
* Connection #0 to host localhost left intact

If it is configured above then -write-timeout-server applies:

$ bin/skipper -inline-routes='* -> backendTimeout("60s") -> "http://localhost:8080"' -read-timeout-server=1s -read-header-timeout-server=1s -write-timeout-server=10s -timeout-backend=1s -response-header-timeout-backend=1s
[APP]INFO[0000] Expose metrics in codahale format            
[APP]INFO[0000] support listener on :9911                    
[APP]INFO[0000] proxy listener on :9090                      
[APP]INFO[0000] TLS settings not found, defaulting to HTTP   
[APP]INFO[0000] route settings, reset, route: : * -> backendTimeout("60s") -> "http://localhost:8080" 
[APP]INFO[0000] route settings received                      
[APP]INFO[0000] route settings applied                       
[APP]ERRO[0014] error while copying the response stream: context canceled 
127.0.0.1 - - [07/May/2021:12:50:38 +0200] "GET /slow HTTP/1.1" 418 12 "-" "curl/7.58.0" 10003 localhost:9090 - -
$ curl -v localhost:9090/slow
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9090 (#0)
> GET /slow HTTP/1.1
> Host: localhost:9090
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 418 I'm a teapot
< Date: Fri, 07 May 2021 10:50:38 GMT
< Server: Skipper
< Transfer-Encoding: chunked
< 
ping

* transfer closed with outstanding read data remaining
* stopped the pause stream!
* Closing connection 0
curl: (18) transfer closed with outstanding read data remaining

I.e. backendTimeout filter may only be used to reduce -write-timeout-server (60s by default)

defaultWriteTimeoutServer = 60 * time.Second

@SergeiKalachev
Copy link

Hello.

I have another use case where I need to configure timeouts per route/backend.
The original discussion started here.
Here is some copy-paste from the thread that explains the problem:

In our app I'm setting up an alternative to web-sockets which is SSE.
Locally everything works smoothly and as expected but in production my connections do not remain open (but they should). The connections are closed each minute. I tried running the local instance of skipper and was able to reproduce the problem.

I run:

./bin/skipper -address :3003 -routes-file=example.eskip -experimental-upgrade -insecure

example.eskip:

r: * -> "https://localhost:3000"

Then I call my backend through skipper http:localhost:3003.
I see browser establishes the connection, it remains open only for one minuet and then closes. The logs from skipper:

[APP]ERRO[5034] error while copying the response stream: context canceled 
::1 - - [06/May/2021:16:09:05 +0200] "GET /events/assignments HTTP/1.1" 200 13 "http://localhost:3003/" "Mozilla/5.0 (Macintosh; 

Could you please help me to configure SSE feature properly?

and

Ok, I found the guy, it is -write-timeout-server
My guess it's skipper because when I call my local BE directly connection remains open. When I enable skipper my connection breaks after a minute. When I enable skipper with flag -write-timeout-server=3m it remains open.

With the help of @AlexanderYastrebov I figured out the timeout I need to adjust, it is -write-timeout-server. If I set it to, let's say, 5m then connection remains for 5 mins.
@AlexanderYastrebov also confirmed (see #1041 (comment)) that it's -write-timeout-server.

My question is, is it possible to create new skipper filter or adjust existing one backend-timeout that would allow to change the value for -write-timeout-server?

@szuecs @aryszka folks what do you think?

@szuecs
Copy link
Member

szuecs commented May 8, 2021

As far as I remember it's not easy to change it from filters. I am back end of May and can check it.
@AlexanderYastrebov maybe you can check if we have the queue listener accessible from proxy. If so we could add a func to change it via FilterContext, not sure if it makes a lot of sense. The problem is that we have to increase it to max (for all routes and all connections) and rely on backendTimeout for earlier timeouts. Out of my head this could lead to slowloris attack vector in case body is slowly send on purpose.

@jrusso1020
Copy link

jrusso1020 commented May 19, 2021

Took me forever to find this issue, but I believe we are encountering the same issue with a SSE stream behind skipper. Full issue in the Backstage repo describing it backstage/backstage#5535

I also think it would be great to to create a new skipper filter or adjust existing one backend-timeout that would allow to change the value for -write-timeout-server for a given route as opposed for everything. Since as stated there are security concerns with setting -write-timeout-server to high and opening an attack vector for resource exhaustion

@szuecs
Copy link
Member

szuecs commented May 20, 2021

@jrusso1020 I will check the references issue, thanks for pointing it out.
The -write-timeout-server can't be changed by route. It's based on the Listener socket which does not know any routes (L4 vs L7 OSI layer).

@jrusso1020
Copy link

@szuecs gotcha, just curious are there any other recommendations on how to configure skipper to handle long lived (over 1 minute) SSE streams? besides increasing -write-timeout-server

@szuecs
Copy link
Member

szuecs commented May 20, 2021

It depends on where the timeout hits us. The problem is that Go does not expose something that we can use for this kind of decision. Maybe just decide that some duration like 15m or 1h is fine as timeout?
You could also decide to use 0 and disable it. Then you can use the backendTimeout() filter to add route based configurations

@jrusso1020
Copy link

jrusso1020 commented May 27, 2021

thanks for opening up #1782 @szuecs ! We are still observing some issues with SSE, let me know if you'd prefer I open a new issue or move this convo to #1782

We've increased -write-timeout-server to 300s (5 minutes) which is well below how long the stream takes, normally around 1 -2 minutes.

However, we are still observing the issue. Getting error while copying the response stream: context canceled. I've also noticed that this occurs even in some "successful" cases. What I mean by that is the stream can take 1.1 to 1.2 minutes and we receive all the logs/messages looking successful but the error while copying the response stream: context canceled will appear. Anything under 1 minute though we don't see error while copying the response stream: context canceled

I happened upon #1748 this issue which seems rather similar as well, and they solved it by setting -response-header-timeout-backend as well as -write-timeout-server. So we tried that as well, however still encountering the issue

Logs as you can see the error while copying the response stream: context canceled appears a minute (or there about) after the eventstream was started (eventstream endpoint in logs is the SSE stream)
Screen Shot 2021-05-27 at 4 09 53 PM

Our skipper image registry.opensource.zalan.do/teapot/skipper:v0.13.51
Our config

      skipper
      -kubernetes
      -kubernetes-in-cluster
      -address=:9999
      -proxy-preserve-host
      -serve-host-metrics
      -metrics-flavour=prometheus
      -enable-ratelimits
      -experimental-upgrade
      -enable-route-lifo-metrics
      -metrics-exp-decay-sample
      -kubernetes-https-redirect=true
      -lb-healthcheck-interval=3s
      -enable-connection-metrics
      -reverse-source-predicate
      -kubernetes-ingress-class=skipper-ingress-internal
      -write-timeout-server=300s
      -response-header-timeout-backend=300s

@AlexanderYastrebov
Copy link
Member

@jrusso1020 Hi, if your Skipper is behind AWS LoadBalancer you may need to increase its idle timeout as well https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html (also mentioned in #1748 (comment))

@jrusso1020
Copy link

Just to close this loop here, we are still seeing the error while copying the response stream: context canceled message in our logs

But from a user perspective we are no longer seeing the SSE streams in the browser being reset

@szuecs
Copy link
Member

szuecs commented Jun 2, 2021

@jrusso1020 so do you retry from the client or do you configure the idle timeout correctly, such that ALB opens a new connection automatically?

@jrusso1020
Copy link

@szuecs
We added the below to the ingress

-write-timeout-server=300s
-response-header-timeout-backend=300s

and the below to the kube-ingress-aws-controller controller

--idle-connection-timeout=300s

No additional code changes on our applications

So increasing the idle timeout means that the ALB will automatically open up new connections when the ingress closes them, as long as it's within that idle connection timeout and the client hasn't closed the connection?

@AlexanderYastrebov
Copy link
Member

IMO LB idle timeout should be above Skipper timeouts. From Skipper point of view load balancer is the client that closes connection after 5 minutes. If it happens just before skipper timeout it would manifest itself as context canceled error in the logs.

@szuecs
Copy link
Member

szuecs commented Jun 2, 2021

@jrusso1020 also set the idle timeout +2s higher than the ALB idle timeout. Once I investigated a rarely happened issue that returned 500 for some reason and it was a race condition on idle timeouts and a connect in flight.

@jrusso1020
Copy link

@szuecs can you clarify what is the "idle timeout" you recommend setting 2s higher than the ALB idle timeout? (which flag/deployment should I do that for)

@szuecs
Copy link
Member

szuecs commented Jun 2, 2021

skipper's idle timeout should be +2s bigger than the ALB idle timeout. Including @AlexanderYastrebov 's answer, for example skipper:

-write-timeout-server=300s
-response-header-timeout-backend=300s
-idle-timeout-server=304s

and the below to the kube-ingress-aws-controller controller

--idle-connection-timeout=302s

@AlexanderYastrebov
Copy link
Member

@szuecs
Copy link
Member

szuecs commented Sep 23, 2022

Let's reopen to be aware of a possible change that we might be able to enhance the current timeouts in the future if the implementation of the proposal (not sure if this proposal is actually possible to use for us) is available.

@szuecs szuecs reopened this Sep 23, 2022
@szuecs
Copy link
Member

szuecs commented Jun 8, 2023

We have added https://opensource.zalando.com/skipper/reference/filters/#readtimeout and https://opensource.zalando.com/skipper/reference/filters/#writetimeout a while ago, that you can set from the filter via http.ResponseController per route.
I think it's fine to close this now.

@szuecs szuecs closed this as completed Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants