Question: What happens if a spegel node is down? #440

guettli · 2024-04-17T07:12:17Z

Describe the problem to be solved

If a spegel node goes down, then the images stored on it will be unreachable.

Does spegel distribute the images on several hosts, so that an outage of one node does not have an impact?

I read the docs, forgive me if I was blind, but I found nothing about redundant storage of images.

Proposed solution to the problem

A new part in the documentation which explains what happens if a spegel node is down would be great.

If the answer is "there is no redundancy. If one node goes down, then these image need to be fetched from upstream again", then it is also fine (and should be part of the docs).

danielloader · 2024-04-17T14:02:13Z

From my understanding you're correct, if a node goes (spot instance reclaimed, goes offline, daemonset dies and loses a pod) then spegel will just pull via the spegel pod on the same node, and then advertise it's available to the other instances if you initiate a pull on another node.

bittrance · 2024-04-17T16:27:08Z

UPDATE: Note, better version below.

Indeed, spegel does not currently do any proactive replication of pulled images, see #375. And as danielloader says, the consequence of peer failure is relatively benign. Here is an attempt at a FAQ answer for this question:

In the interval between a spegel peer failing (e.g. node death) and the consensus algorithm agrees the peer is dead, other spegel peers may try to forward requests to the failed peer, delaying the response to the pulling client. In benign scenarios, this delay is the length of an intra-cluster round trip, likely <1ms. Of course, there are less benign scenarios (e.g. inter-node packet loss) where no replies will come back and spegel's forwarder will eventually time out before moving on to the next available instance. Spegel does not specify the various options (primarily timeout and dialOptions) to its internal containerd client and depends on defaults as set in https://github.com/containerd/containerd/blob/8317959018015f6a1756ec8cd08be1093fd630a2/client/client.go#L87. Similarly, spegel depends on libp2p default algorithm and options for detecting dead peers. The exact length of the window between failure and consensus is too dependent on failure mode to give with confidence, but I would expect it to be 1-60s in 95% of cases.

Please note that a client is likely to request several layers in parallel and spegel will try to spread its forwards across those peers that announces a particular layer, the benign scenario is unlikely to impact pod startup time. Only when multiple spegel instances fail simultaneously or when an image dominated by one large layer is affected is pod startup time materially increased.

spegel's documentation currently does not have a detailed text description of the pull flow. @guettli do you feel we should have this detail level in the README or would you have found it if we put it in the FAQ?

bittrance · 2024-04-17T17:24:07Z

Sorry, the above description is slightly confused. Actually, one spegel peer forwarding to another spegel peer will use a httputil.ReverseProxy (not the containerd.Client as the text above implies) which uses a http.DefaultTransport (see https://cs.opensource.google/go/go/+/master:src/net/http/transport.go;l=43) and will time out accordingly. The scenario above may of course also occur if spegel cannot talk to its local containerd domain socket, but that failure is likely to be instant, unless containerd misbehaves in some inspired way. Better version:

In the interval between a spegel peer failing (e.g. node death) and deciding that the peer is dead, other spegel peers may try to forward requests to the failed peer, delaying the response to the pulling client. In benign scenarios, this delay is the length of an intra-cluster round trip (the HTTP request and an ICMP unreachable response), likely <1ms. Of course, there are less benign scenarios (e.g. inter-node packet loss) where no replies will come back and spegel's forwarder will eventually time out before moving on to the next available peer. Spegel uses the standard library's httputil.ReverseProxy to forward requests, which in turn depends on DefaultTransport to decide how long to wait before giving up. Similarly, spegel depends on libp2p default algorithm and options for detecting dead peers. The exact length of the window between failure and eviction can vary, but the max TTL for a resolved peer is currently 10 minutes, so that should be the upper bound.

Please note that a client is likely to request several layers in parallel and spegel will try to spread its forwards across those peers that announces a particular layer, the benign scenario is unlikely to impact pod startup time. Only when multiple spegel instances fail simultaneously or when an image dominated by one large layer is affected is pod startup time materially increased.

guettli · 2024-04-17T20:16:09Z

@bittrance it would be great to have this in the FAQ. Thank you!

phillebaba · 2024-04-17T22:48:54Z

I will have a look at #443 tomorrow but over all what @bittrance stated is true.

I have been looking at future solutions to do preemptive distribution of images to make sure that replication is >1. This will most likely be a feature in Spegel in the future but i don't know when and how it will look. There is a lot of aspects to take into account when building these features and I want to hit as many use cases with as small changes as possible.

guettli · 2024-04-18T06:22:53Z

@phillebaba great to hear, your plans. At the moment this question was mostly about missing documentation. Thank you.

guettli added the enhancement New feature or request label Apr 17, 2024

bittrance mentioned this issue Apr 17, 2024

FAQ answer for instance failure #443

Merged

phillebaba closed this as completed in #443 Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: What happens if a spegel node is down? #440

Question: What happens if a spegel node is down? #440

guettli commented Apr 17, 2024

danielloader commented Apr 17, 2024

bittrance commented Apr 17, 2024 •

edited

Loading

bittrance commented Apr 17, 2024 •

edited

Loading

guettli commented Apr 17, 2024

phillebaba commented Apr 17, 2024

guettli commented Apr 18, 2024

Question: What happens if a spegel node is down? #440

Question: What happens if a spegel node is down? #440

Comments

guettli commented Apr 17, 2024

Describe the problem to be solved

Proposed solution to the problem

danielloader commented Apr 17, 2024

bittrance commented Apr 17, 2024 • edited Loading

bittrance commented Apr 17, 2024 • edited Loading

guettli commented Apr 17, 2024

phillebaba commented Apr 17, 2024

guettli commented Apr 18, 2024

bittrance commented Apr 17, 2024 •

edited

Loading

bittrance commented Apr 17, 2024 •

edited

Loading