Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of certs in containerd configuration causing 500 error (mirror resolve retries exhausted for key) #413

Open
IndhumithaR opened this issue Apr 4, 2024 · 33 comments · Fixed by #424
Labels
bug Something isn't working

Comments

@IndhumithaR
Copy link

Spegel version

v0.0.19

Kubernetes distribution

EKS

Kubernetes version

v1.27.0

CNI

VPC CNI

Describe the bug

We are using jfrog artifactory registry as our registry. we are using tls certificate authentication for resolving and accessing our registry endpoints. We were facing 500 error with mirror resolve retries exhausted for key while running a pod, it gave this error

│   Type     Reason     Age                   From               Message                                                                                                                                                                                                 │
│   ----     ------     ----                  ----               -------                                                                                                                                                                                                 │
│   Normal   Scheduled  29m                   default-scheduler  Successfully assigned large-pod-1 to ip-10-190-18-201.eu-west-1.compute.internal                                                                                                                    │
│   Normal   Pulling    27m (x4 over 29m)     kubelet            Pulling image "xxx"                                                                                                  │
│   Warning  Failed     27m (x4 over 29m)     kubelet            Failed to pull image "xxx": rpc error: code = Unknown desc = failed to pull and unpack image "xxx": failed to resolve reference "xxx": unexpected status from HEAD request to http://10.190.18.201:30020/v2/xxx 500 Internal Server Error                                                                                                                                                 │
│   Warning  Failed     27m (x4 over 29m)     kubelet            Error: ErrImagePull                                                                                                                                                                                     │
│   Warning  Failed     27m (x6 over 29m)     kubelet            Error: ImagePullBackOff                                                                                                                                                                                 │
│   Normal   BackOff    4m6s (x109 over 29m)  kubelet            Back-off pulling image xxx"                                                                                         │
│                                                                                                                                                                  

On diving deep we saw that our client.cert and client.key files are not loaded in the configuration

configuration {"level":"info","ts":1712209973.4201605,"caller":"oci/containerd.go:511","msg":"backing up Containerd host configuration","path":"/etc/containerd/certs.d/xxx"}                                                            │
│ configuration {"level":"info","ts":1712209973.420308,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"https://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4203854,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4204602,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"}               │
│ configuration {"level":"info","ts":1712209973.4205267,"caller":"build/main.go:87","msg":"gracefully shutdown"}                     

only host.toml seems to be loaded to spegel. Therefore not being able to access artifactory and pull.

first.log

Sharing the spegel log file for reference.
Can you please ensure that the all the files in cert configuration is copied in the spegel configuration?

@IndhumithaR IndhumithaR added the bug Something isn't working label Apr 4, 2024
@phillebaba
Copy link
Member

@IndhumithaR I understand the problem that you are in, as Spegel will override any configuration. I do agree that this should be fixed.

Could you share an example hosts.toml that you are using, which can be used in a unit test?

@badaldavda8
Copy link

badaldavda8 commented Apr 5, 2024

Hello @phillebaba thanks for quick turnaround on this.

Actually we don't have a hosts.toml file. It gets created automatically with hosts on 30020 and 30021 with local ipv4 of the VM.

hosts.toml

[host."https://<LOCAL_VM_IPV4>:30020"]
  capabilities = ["pull", "resolve"]

[host."https://<LOCAL_VM_IPV4>:30021"]
  capabilities = ["pull", "resolve"]

But while researching we did find an interesting way to workaround the issue above by actually creating one hosts.toml file and adding client flag within the host configuration. But we are facing issues with it. When just adding
client = ["../../client.cert", "../../client.key"]

we get an error -
failed to load X509 key pair: tls: failed to find certificate PEM data in certificate input, but did find a private key; PEM inputs may have been switched

When we add it as
client = [["../../client.cert", "../../client.key"],[""]]
OR
client = [["../../client.cert", "../../client.key"],["",""]]
OR
client = [["../../client.cert"], ["../../client.key"]]

then we get
xxx": dial tcp 10.190.18.201:30020: connect: no route to host
which means it still doesn't get connected.

But it would be far better if the whole configuration of the mirror gets copied the way we add it in certs.d configuration instead of just hosts.toml getting created.

@IndhumithaR
Copy link
Author

IndhumithaR commented Apr 5, 2024

Hi @phillebaba, As Badal said we did a workaround by creating one hosts.toml file and adding client flag within the host configuration.

Here is the sample Hosts.toml for your information,


mkdir -p /etc/containerd/certs.d/xxx/
cat > /etc/containerd/certs.d/xxx/hosts.toml << EOL
server = 'https://xxx'

[host]
[host.'http://${ipv4}:30020']
capabilities = ['pull', 'resolve']
client = ["/etc/certs/xxx/client.cert", "/etc/certs/xxx/client.key"]

[host.'http://${ipv4}:30021']
capabilities = ['pull', 'resolve']
client = ["/etc/certs/xxx/client.cert", "/etc/certs/xxx/client.key"]
EOL

@badaldavda8
Copy link

Also, I think it would be better that instead of creating a new hosts.toml file, it would be better if we copy it if its already present. Is it currently the case?

@phillebaba
Copy link
Member

Yes I think the solution is to modify existing toml files instead of replacing them.

The current implementation is slightly optimistic when it comes to writing mirror configuration. To avoid any issues I thought it was best to completely replace any configuration. I do see however the use case to append to existing configuration.

A good solution would be to add a new configuration so users can decide whether to keep the existing behavior or use the new one. We can then default it to the current behavior and that way you can enable appending to mirror configuration. Does that sound good?

@IndhumithaR
Copy link
Author

Yes, I think it should work. We can try this approach.

@badaldavda8
Copy link

Yes I think the solution is to modify existing toml files instead of replacing them.

The current implementation is slightly optimistic when it comes to writing mirror configuration. To avoid any issues I thought it was best to completely replace any configuration. I do see however the use case to append to existing configuration.

A good solution would be to add a new configuration so users can decide whether to keep the existing behavior or use the new one. We can then default it to the current behavior and that way you can enable appending to mirror configuration. Does that sound good?

Yes. That would be great. Also, not just add hosts.toml files but whatever files are in configuration like client.key and client.cert to the configuration and load it. Also it would be great that we add a print statement in logs for this and the way we are saying

configuration {"level":"info","ts":1712209973.4201605,"caller":"oci/containerd.go:511","msg":"backing up Containerd host configuration","path":"/etc/containerd/certs.d/xxx"}                                                            │
│ configuration {"level":"info","ts":1712209973.420308,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"https://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4203854,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4204602,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"}               │
│ configuration {"level":"info","ts":1712209973.4205267,"caller":"build/main.go:87","msg":"gracefully shutdown"}                     

we should have it printed as added containerd mirror configuration with all the paths of all the files so that when spegel starts we know what all mirrors and what all files are loaded and added to the configuration. In this if hosts.toml file is present then overwrite. But if its not present, create it.

@phillebaba
Copy link
Member

@badaldavda8 @IndhumithaR

I have created #424 that should resolve this issue.

For clarification of the change it adds a new flag --append-mirrors which when enabled will append to existing mirror configurations. Give the existing mirror configuration for docker.io on the node.

server = 'https://registry-1.docker.io'

[host]

[host.'http://example.com:30020']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']

[host.'http://example.com:30021']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']

It will result in the following configuration after Spegel has run. In this example we are just adding a single mirror located at localhost:5000.

server = 'https://registry-1.docker.io'

[host]

[host.'http://127.0.0.1:5000']
capabilities = ['pull', 'resolve']

[host.'http://example.com:30020']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

[host.'http://example.com:30021']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

Does this match the needs you have to solve the problem?

@IndhumithaR
Copy link
Author

IndhumithaR commented Apr 8, 2024

@badaldavda8 @IndhumithaR

I have created #424 that should resolve this issue.

For clarification of the change it adds a new flag --append-mirrors which when enabled will append to existing mirror configurations. Give the existing mirror configuration for docker.io on the node.

server = 'https://registry-1.docker.io'

[host]

[host.'http://example.com:30020']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']

[host.'http://example.com:30021']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']

It will result in the following configuration after Spegel has run. In this example we are just adding a single mirror located at localhost:5000.

server = 'https://registry-1.docker.io'

[host]

[host.'http://127.0.0.1:5000']
capabilities = ['pull', 'resolve']

[host.'http://example.com:30020']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

[host.'http://example.com:30021']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

Does this match the needs you have to solve the problem?

Hi @phillebaba, Yes this would solve our problem. And also it would be better if we would still have all our file like client.cert and client.key in the config path instead of moving them to the backup path.

It will great if you can give us an approximate ETA for fixing this issue?

@phillebaba
Copy link
Member

So the PR should be ready to merge now, and I could probably cut a release if this is a blocker for you.

I am however thinking about the best way of keeping the certificates in the configuration directory. The question is how far to take this as we would need to determine which files should be kept. For this to work we would need some method of determining which directories are host configurations and which are not.

Would it be possible for you to place the certificates in a different directory like /etc/certs/client.pem like they do in the documentation, rather than placing them in /etc/containerd/certs.d?

@badaldavda8
Copy link

Hello @phillebaba,

Thanks for adding the flag. But the problem that we are facing is that we are not sure exactly how do we add these certs in hosts.toml file and make it work. We were just attempting a workaround for that using hosts.toml.

It would be great if the whole directory is loaded with /etc/containerd/certs.d/xyz.com/client.cert and /etc/containerd/certs.d/xyz.com/client.key because this is how we have had it working until now.

We did try to add it as client as shared above even without spegel and it failed. Do you have any direction for that since containerd documentation is not clear on how do we add client.key and client.cert for tls authentication for each mirror.

We are still trying to add multiple configurations in containerd config.toml and hosts.toml. Will keep you updated on the issue as well.

@phillebaba
Copy link
Member

One solution could be to copy over all of the file contents from the existing directories. To allow registry specific certs. I would rather avoid this if possible. If we were to add this I would probably add an additional flag for this behavior.

I had a look back at your previous examples and saw that you are using client cert pairs for authentication. Note that the Containerd documentation has different syntax for this compared to using a pem file. So its important that you configure them as key pairs.

client = [["/etc/certs/client.cert", "/etc/certs/client.key"]]

https://github.com/containerd/containerd/blob/main/docs/hosts.md#client-field

In the documentation they use absolute paths for both files, which seems to work fine even when the files are located in a different directory.

In theory you do not have to depend on Spegel to configure mirrors for you. As you are running EKS it is totally possible to just setup all of the mirror configuration in your custom bootstrap.sh file and then just disable Spegel mirror configuration.

I will go ahead and merge #424 and cut a release for that. If you are in need for more features to persist other files in the certs.d directory let me know. Will keep this issue open until things are resolved.

@phillebaba
Copy link
Member

@badaldavda8 @IndhumithaR I have cut v0.0.21 now which contains an option to enable merging. As long as you are not placing certificates in certs.d directory I think this would work for you.

@badaldavda8
Copy link

We will test this and get back to you!

@IndhumithaR
Copy link
Author

Hi @phillebaba,
We tried using v0.0.21 and --append-mirrors set to true.
Even then we noticed that the existing hosts.toml which we have is pushed to /etc/containerd/certs.d/_backup and a new hosts.toml is created in /etc/containerd/certs.d/

Not sure, What's wrong.

@phillebaba
Copy link
Member

That is the expected behavior. It is not really possible to not backup all of the mirror configuration. We need the original registries to be able to update the registry configuration at a later date. Without the backup we wont know what the original state is.

You should see that the new host files are basically the original merged with the new configuration for Spegel, which is what we discussed.

The mirror configuration in Spegel is mostly just helper logic for the purpose of simplifying operations for the majority of the user. As I suggested before. if your use case does not fit in with this solution you could disable Spegel mirror configuration and rely on your own method instead.

Just to make sure the new hosts.toml are correct right? With the merged content.

@IndhumithaR
Copy link
Author

Hi @phillebaba
We see that the hosts.toml is getting overwritten. It is not getting merged.

@IndhumithaR
Copy link
Author

That is the expected behavior. It is not really possible to not backup all of the mirror configuration. We need the original registries to be able to update the registry configuration at a later date. Without the backup we wont know what the original state is.

You should see that the new host files are basically the original merged with the new configuration for Spegel, which is what we discussed.

The mirror configuration in Spegel is mostly just helper logic for the purpose of simplifying operations for the majority of the user. As I suggested before. if your use case does not fit in with this solution you could disable Spegel mirror configuration and rely on your own method instead.

Just to make sure the new hosts.toml are correct right? With the merged content.

How do you suggest that we disable spegel mirror configuration? Is it by using spegel.containerdMirrorAdd=false?

@phillebaba
Copy link
Member

Yes if you set that to false you will disable the whole init container which writes mirror configuration.

@IndhumithaR
Copy link
Author

Hi @phillebaba,
We tried disabling spegel mirror configuration.

Now we are able to pull docker images but getting the below error often and after multiple retries the image is getting pulled.

│  code = Canceled desc = failed to pull and unpack image "xxx": failed to commit snapshot extract-975686076- │
│ HBR8 sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled  

Do you know why we are getting this error?

@IndhumithaR
Copy link
Author

IndhumithaR commented Apr 22, 2024

Hi @phillebaba,

We deployed a pod to pull a docker image, a new node is created and tried to pull the image. After several tries, finally the image is pulled in 10 mins. The below is the pod log.

 Events:                                                                                                                                                                                                                                           │
│   Type     Reason            Age                From               Message                                                                                                                                                                        │
│   ----     ------            ----               ----               -------                                                                                                                                                                        │
│   Warning  FailedScheduling  50m                default-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {karpenter.sh/disruption: disrupting}. preemption: 0/2 nodes are │
│  available: 2 Preemption is not helpful for scheduling..                                                                                                                                                                                          │
│   Normal   Nominated         50m                karpenter          Pod should schedule on: nodeclaim/cpu-node-pool-m74hm                                                                                                                          │
│   Normal   Scheduled         48m                default-scheduler  Successfully assigned dtp/large-pod-3 to ip-10-190-46-51.eu-west-1.compute.internal                                                                                            │
│   Warning  Failed            45m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": context canceled                                                                                                                                     │
│   Warning  Failed            43m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": failed to commit snapshot extract-787971535-l7S7 sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled           │
│   Warning  Failed            39m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": failed to extract layer sha256:6779e674d34da348414644d25ffb0254c26178c88ae6854f5e1d9bc3719fab99: context canceled                                    │
│   Warning  Failed            39m (x3 over 45m)  kubelet            Error: ErrImagePull                                                                                                                                                            │
│   Normal   BackOff           39m (x4 over 45m)  kubelet            Back-off pulling image "xxx:1.0"                                                                │
│   Warning  Failed            39m (x4 over 45m)  kubelet            Error: ImagePullBackOff                                                                                                                                                        │
│   Normal   Pulling           39m (x4 over 48m)  kubelet            Pulling image "xxx:1.0"                                                                         │
│   Normal   Pulled            28m                kubelet            Successfully pulled image "xxx:1.0" in 10m23.459933513s (10m23.459942873s including waiting)    │
│   Normal   Created           28m                kubelet            Created container large-container1                                                                                                                                             │
│   Normal   Started           28m                kubelet            Started container large-container1     

Spegel log for the node:

│ {"level":"info","ts":1713763231.1640038,"caller":"build/main.go:204","msg":"running Spegel","registry":":5000","router":":5001"}                                                                                                                  │
│ {"level":"info","ts":1713763231.164356,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713763231.1648052,"logger":"p2p","caller":"routing/p2p.go:123","msg":"starting p2p router","id":"/ip4/10.190.39.116/tcp/5001/p2p/12D3KooWMryfM2RbusSZQbcK64LpRR6udkjKaavpyFBGFr3nxk9x"}                                     │
│ {"level":"info","ts":1713763231.1651144,"caller":"leaderelection/leaderelection.go:250","msg":"attempting to acquire leader lease spegel/spegel-leader-election...\n"}                                                                            │
│ {"level":"info","ts":1713763771.1649404,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713764245.5022466,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713764245.6634855,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713764245.7156382,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e │
│ {"level":"info","ts":1713764311.16439,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                  │
│ {"level":"info","ts":1713764851.1644804,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713765391.164477,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713765931.1649988,"caller":"state/state.go:30","msg":"running scheduled image state update"}    
 

We deployed another pod which gets assigned to a new node. This pod also tries to pull the same image.
Below are its log,

Events:                                                                                                                                                                                                                                           │
│   Type     Reason     Age                From               Message                                                                                                                                                                               │
│   ----     ------     ----               ----               -------                                                                                                                                                                               │
│   Normal   Scheduled  21m                default-scheduler  Successfully assigned dtp/large-pod-2 to ip-10-190-41-236.eu-west-1.compute.internal                                                                                                  │
│   Warning  Failed     18m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│xxx:1.0": context canceled                                                                                                                                            │
│   Warning  Failed     16m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│ xx:1.0": failed to extract layer sha256:6779e674d34da348414644d25ffb0254c26178c88ae6854f5e1d9bc3719fab99: context canceled                                           │
│   Warning  Failed     13m (x3 over 18m)  kubelet            Error: ErrImagePull                                                                                                                                                                   │
│   Warning  Failed     13m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│xxx:1.0": failed to commit snapshot extract-116295700--zVv sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled                  │
│   Normal   BackOff    12m (x5 over 18m)  kubelet            Back-off pulling image "xxx:1.0"                                                                       │
│   Warning  Failed     12m (x5 over 18m)  kubelet            Error: ImagePullBackOff                                                                                                                                                               │
│   Normal   Pulling    12m (x4 over 21m)  kubelet            Pulling image "xxx:1.0"                                                                                │
│   Normal   Pulled     114s               kubelet            Successfully pulled image "xxx:1.0" in 10m22.187702651s (10m22.187717239s including waiting)           │
│   Normal   Created    114s               kubelet            Created container large-container1                                                                                                                                                    │
│   Normal   Started    114s               kubelet            Started container large-container1    

Spegel log for 2nd node:

│ {"level":"info","ts":1713763055.7157514,"caller":"build/main.go:204","msg":"running Spegel","registry":":5000","router":":5001"}                                                                                                                  │
│ {"level":"info","ts":1713763055.716569,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713763055.7168446,"logger":"p2p","caller":"routing/p2p.go:123","msg":"starting p2p router","id":"/ip4/10.190.36.182/tcp/5001/p2p/12D3KooWJJt6drajVNQfjWedNueQ8c9VszokqhjSTQcVdC9w6RAL"}                                     │
│ {"level":"info","ts":1713763055.716939,"caller":"leaderelection/leaderelection.go:250","msg":"attempting to acquire leader lease spegel/spegel-leader-election...\n"}                                                                             │
│ {"level":"info","ts":1713763595.716495,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713764135.7163444,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713764675.7167358,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713765215.7165866,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713765755.716464,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713766106.3790374,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713766106.4227855,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713766106.4387383,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e │
│ {"level":"info","ts":1713766295.716398,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                

We see no improvement in image pull time of the first node and second node. According to our understanding of spegel, the first node will pull from artifactory and second node will pull the image from the first node. In this case, we expect some improvement in the image will time. But actually took the same time.

Even when we look at the spegel logs, it doesn't tell anything like it pulled the image from the first node.
We need your help proper understanding of this.

FYI, while installing spegel, we set spegel.appendMirrors=true and set spegel.containerdMirrorAdd=false. We also set spegel.registries to our private registries.

We also tried increasing spegel.mirrorResolveRetries=4 and spegel.mirrorResolveTimeout="25s" it don't show any improvement.

Can you please explain the expected behavior or anything that we have missed out?
Thanks

@badaldavda8
Copy link

Hi @phillebaba and the spegel team,

Thank you for the excellent work on the spegel project. We've been trying to use it in our application, but we're facing an issue with the TLS configuration while parsing the hosts.toml file.

Here are the details:

  1. We've provided the correct paths for the client.cert and client.key files in the hosts.toml file.
  2. We've verified that the certificate and private key files are in the correct PEM format, starting with -----BEGIN CERTIFICATE----- and -----BEGIN PRIVATE KEY-----, respectively.
  3. We've independently tested the tls.X509KeyPair function with the same certificate and private key files, and it works as expected.
  4. However, when running the spegel library, we're encountering the following error: failed to load X509 key pair: tls: found a certificate rather than a key in the PEM for the private key.
  5. This error seems to be occurring in the hosts.go file, specifically at line 224, where the tls.X509KeyPair function is called.
  6. Based on the Go standard library code (https://go.dev/src/crypto/tls/tls.go?s=5662:5729 - line 282), this error should only occur if the private key file contains "CERTIFICATE" instead of a valid private key.
  7. We've double-checked, and the private key file we're providing does not contain a certificate. It has the following format:
-----BEGIN PRIVATE KEY-----
<private_key_content>
-----END PRIVATE KEY-----

We're unsure why the spegel library is encountering this error when the standard Go TLS library works correctly with the same files.

We would greatly appreciate if you could investigate this issue and provide guidance on how we can resolve it. We've been trying to resolve this for the past two weeks without success, and we're eager to use the spegel project in our application.

Thank you in advance for your assistance.

Best regards,
Badal

@badaldavda8
Copy link

Also, if you can assist us by adding following in the debug log -

  1. hosts.toml file configuration that gets passed
  2. client.key and client.cert blocks getting passed
  3. Capability to preserve our hosts file. As shared by Indumitha, its not getting appended but overwritten currently even after adding the flag.

@phillebaba
Copy link
Member

phillebaba commented Apr 24, 2024

Now I am taking a wild guess but I think that the issue now is that you are passing the client certs to the wrong configuration and that the ordering might be wrong. As far as I understand the mirror configuration is that ordering matters, which could be a reason why you are not seeing any performance gains when things seem to work. The ordering should be top to bottom.

So when you are configuring the mirror configuration on your own it should look something like this. The first two entries are for Spegel. One will route to the local port and the other is a fallback to a random Spegel instance in the cluster. It isn't actually a requirement but a nice to have. Then after that comes the configuration for your artifactory registry with the client cert authentication.

server = 'https://artifactory-registry-url'

[host]
[host.'http://127.0.0.1:30020']
capabilities = ['pull', 'resolve']

[host.'http://127.0.0.1:30021']
capabilities = ['pull', 'resolve']

[host.'https://artifactory-registry-url']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

I realize that I probably misled you in my examples. That is my fault I did not really think through the situation. I will probably add some examples to the docs to explain how mirror configuration needs to look like.

Does this make sense or have I gotten more confused about the actual issue?

@phillebaba
Copy link
Member

FYI I have not tested this but I assume this is the expected configuration for Containerd. Basically we are modifying the registry configuration to use client certs only for the artifactory registry. As far as I can read the Containerd documentation I cannot find another may to solve this.

@IndhumithaR
Copy link
Author

@phillebaba, Thanks for you response.

This is the hosts.toml, we are currently using.

server = 'https://artifact.xxx.com'
client = [['/etc/certs/artifact.xxx.com/client.cert','/etc/certs/artifact.xxx.com/client.pem']]
 
[host]
[host.'http://10.190.63.87:30020']
capabilities = ['pull', 'resolve']
 
 
[host.'http://10.190.63.87:30021']
capabilities = ['pull', 'resolve']

With this toml file, We see response of 200 in node which already has the image.
This is the log,

{"level":"info","ts":1713970486.4934852,"caller":"registry/registry.go:133","msg":"","path":"/v2/xxx/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200,"method":"GET","latency":164.578264002,"ip":"10.190.46.158"  }

As per the spegel document we see the logs having handler= mirror in the node which already has the docker image.
This handler = mirror is not seen in our log. But the ip address is of the new node which is trying to pull the image.

And In the logs of the new node trying to pull the image, we are not seeing any response 200.
This is the log of the new node trying to pull the image from another node.

configuration {"level":"info","ts":1713970359.27705,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https:/ │
│ /ghcr.io","path":"/etc/containerd/certs.d/ghcr.io/hosts.toml"}                                                                                         │
│ configuration {"level":"info","ts":1713970359.2771664,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://quay.io","path":"/etc/containerd/certs.d/quay.io/hosts.toml"}                                                                                       │
│ configuration {"level":"info","ts":1713970359.277267,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https: │
│ //mcr.microsoft.com","path":"/etc/containerd/certs.d/mcr.microsoft.com/hosts.toml"}                                                                    │
│ configuration {"level":"info","ts":1713970359.2773669,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://public.ecr.aws","path":"/etc/containerd/certs.d/public.ecr.aws/hosts.toml"}                                                                         │
│ registry {"level":"info","ts":1713971054.7486622,"caller":"state/state.go:30","msg":"running scheduled image state update"}                            │
│ registry {"level":"info","ts":1713971097.4811194,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"CREATE"}         │
│ registry {"level":"info","ts":1713971097.5458224,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"UPDATE"}         │
│ registry {"level":"info","ts":1713971097.569677,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"CREATE"}                     │
│ configuration {"level":"info","ts":1713970359.2774603,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://gcr.io","path":"/etc/containerd/certs.d/gcr.io/hosts.toml"}                                                                                         │
│ configuration {"level":"info","ts":1713970359.277548,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https: │
│ //registry.k8s.io","path":"/etc/containerd/certs.d/registry.k8s.io/hosts.toml"}                                                                        │
│ configuration {"level":"info","ts":1713970359.2776377,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://k8s.gcr.io","path":"/etc/containerd/certs.d/k8s.gcr.io/hosts.toml"}                                                                                 │
│ configuration {"level":"info","ts":1713970359.2777264,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://lscr.io","path":"/etc/containerd/certs.d/lscr.io/hosts.toml"}                                                                                       │
│ configuration {"level":"info","ts":1713970359.2778192,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"}                                     │
│ configuration {"level":"info","ts":1713970359.2778912,"caller":"build/main.go:88","msg":"gracefully shutdown"}                                         │
│ Stream closed EOF for spegel/spegel-z6fbx (configuration)

We tried this with version 0.0.21 and v 0.0.22.
Not sure, where we are going wrong.
But we still feel that the new node is not pulling the image from the node that already has the image, but everytime it is pulling it from artifactory.
Can you guide us on where it is going wrong and what has to be done to proceed furthur?

Thanks.

@phillebaba
Copy link
Member

Sorry about the logs, I was doing some refactoring and accidentally removed the handler from the logs. So it is no longer easy to see if its the mirror or the serve handler that is running. Will work on getting that back in.

I setup similar solution with client certificates and got thing working without any issue. I pull the first image from the original registry. Then move on to another node and am able to pull the image from the other node this time.

Looking at your logs it looks like the image pull is occurring before the actual mirror configuration is added? Maybe that is the issue to your problems that the Pod is created to quickly? Could you verify that Spegel has started properly before you try pulling the image.

@badaldavda8
Copy link

I see,

The setup that we have is using Karpenter to spin up nodes. Now as soon as the nodes come up the pod starts scheduling in that node. I think that at that point even spegel pod might not be up due to which this might have caused the issue.

Is there any way that we can create a dependency that spegel pod or the daemonset should be up before our pod runs?

@phillebaba
Copy link
Member

That is a good question, I do not have the answer off the top of my head.

Could we close this issue and open another one with this question? I think it would be something valuable to document but it is off topic to the issue at hand.

@IndhumithaR
Copy link
Author

Hi @phillebaba

For the previous question regarding the daemonset, we will create a new issue.

But for testing purpose, we tried assigning pods to the nodes in which spegel is already started.

Log of first node: Node 1: [ip-10-190-58-49]

│ registry {"time":"2024-05-09T02:50:57.010817724Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"/XXX/blobs/sha256:b93d1501ce1249752008d3221380255b8f3c50144a1340b39abdacef1802a8d4","status":200, │
│ "method":"GET","latency":"1m51.167792276s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:51:16.015208831Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"/XXX/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200, │
│ "method":"GET","latency":"2m15.050096046s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:57:58.533665689Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}                                                                                                                                             │
│ registry {"time":"2024-05-09T03:02:04.80476148Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"msg │
│ ":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e962 │
│ 7ae07f4f7e","type":"CREATE"}                                                                                                                                                           │
│ registry {"time":"2024-05-09T03:02:05.041862364Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"ms │
│ g":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e96 │
│ 27ae07f4f7e","type":"UPDATE"}                                                                                                                                                          │
│ registry {"time":"2024-05-09T03:02:05.044542155Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"ms │
│ g":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e │
│ ","type":"CREATE"}                                                                                                                                                                     │
│ registry {"time":"2024-05-09T03:06:58.53309718Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"msg │
│ ":"running scheduled image state update"}                                                                                                                                              │
│ registry {"time":"2024-05-09T03:15:58.533623363Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}                                                                                                                                             │

Here we are able to see that node 2 (10.190.61.15) is trying to pull this node 1.

Log of node 2: [ip-10.190.61.15]

│ registry {"time":"2024-05-09T02:50:40.116834275Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:f8cc231105a81f50e16fc6bf48258cc297f555f2995bd15643c74efebe5b8d78","status":200, │
│ "method":"GET","latency":"1m21.349075008s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:50:56.910979572Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:b93d1501ce1249752008d3221380255b8f3c50144a1340b39abdacef1802a8d4","status":200, │
│ "method":"GET","latency":"1m51.062246479s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:51:15.960404766Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200, │
│ "method":"GET","latency":"2m14.993870993s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:53:41.042683724Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}                                                                                                                                             │
│ registry {"time":"2024-05-09T03:02:41.043052583Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}  

Based the log of node 2, we are not sure if it pulling the image from node1 or from artifactory.
We don't see any considerable decrease in image pull time.

Can we help us understand what's happening based on the logs?

Thanks.

@phillebaba
Copy link
Member

Looking at the logs I can tell you that things are working. I can see in the logs for node 2 that the request starts there and then they are present in node 1. So things are getting pulled from Spegel.

Interesting that you are not seeing any improvement in image pull time. Just to make sure you are running v0.0.22 right? The perceived performance depends on a lot of things, especially the disk performance of the nodes used in the cluster.

@IndhumithaR
Copy link
Author

IndhumithaR commented May 10, 2024

Looking at the logs I can tell you that things are working. I can see in the logs for node 2 that the request starts there and then they are present in node 1. So things are getting pulled from Spegel.

Interesting that you are not seeing any improvement in image pull time. Just to make sure you are running v0.0.22 right? The perceived performance depends on a lot of things, especially the disk performance of the nodes used in the cluster.

Oh Ok, Yes we are using v0.0.22.
we are using aws ec2 m5.xlarge nodes.

image

Image pull time in node 1 is 15 minutes and in node2 is 13 minutes.
How much percent improvement in the image pull time can we expect approximately?
Is there anything we can try to improve image pull time?

@phillebaba
Copy link
Member

I would suggest you check out the benchmarks that I have documented. These should give you an idea of the expected performance.

https://github.com/spegel-org/spegel/blob/main/docs/BENCHMARK.md

Obviously this is subject to change depending on your infrastructure and the registry you are comparing with. I suggest you move the benchmark images to your own registry and run the benchmarks yourself to see what types of results you will see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants