-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of certs in containerd configuration causing 500 error (mirror resolve retries exhausted for key) #413
Comments
@IndhumithaR I understand the problem that you are in, as Spegel will override any configuration. I do agree that this should be fixed. Could you share an example hosts.toml that you are using, which can be used in a unit test? |
Hello @phillebaba thanks for quick turnaround on this. Actually we don't have a hosts.toml file. It gets created automatically with hosts on 30020 and 30021 with local ipv4 of the VM. hosts.toml
But while researching we did find an interesting way to workaround the issue above by actually creating one hosts.toml file and adding client flag within the host configuration. But we are facing issues with it. When just adding we get an error - When we add it as then we get But it would be far better if the whole configuration of the mirror gets copied the way we add it in certs.d configuration instead of just hosts.toml getting created. |
Hi @phillebaba, As Badal said we did a workaround by creating one hosts.toml file and adding client flag within the host configuration. Here is the sample Hosts.toml for your information,
|
Also, I think it would be better that instead of creating a new hosts.toml file, it would be better if we copy it if its already present. Is it currently the case? |
Yes I think the solution is to modify existing toml files instead of replacing them. The current implementation is slightly optimistic when it comes to writing mirror configuration. To avoid any issues I thought it was best to completely replace any configuration. I do see however the use case to append to existing configuration. A good solution would be to add a new configuration so users can decide whether to keep the existing behavior or use the new one. We can then default it to the current behavior and that way you can enable appending to mirror configuration. Does that sound good? |
Yes, I think it should work. We can try this approach. |
Yes. That would be great. Also, not just add hosts.toml files but whatever files are in configuration like client.key and client.cert to the configuration and load it. Also it would be great that we add a print statement in logs for this and the way we are saying
we should have it printed as added containerd mirror configuration with all the paths of all the files so that when spegel starts we know what all mirrors and what all files are loaded and added to the configuration. In this if hosts.toml file is present then overwrite. But if its not present, create it. |
I have created #424 that should resolve this issue. For clarification of the change it adds a new flag server = 'https://registry-1.docker.io'
[host]
[host.'http://example.com:30020']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
[host.'http://example.com:30021']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key'] It will result in the following configuration after Spegel has run. In this example we are just adding a single mirror located at server = 'https://registry-1.docker.io'
[host]
[host.'http://127.0.0.1:5000']
capabilities = ['pull', 'resolve']
[host.'http://example.com:30020']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']
[host.'http://example.com:30021']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve'] Does this match the needs you have to solve the problem? |
Hi @phillebaba, Yes this would solve our problem. And also it would be better if we would still have all our file like client.cert and client.key in the config path instead of moving them to the backup path. It will great if you can give us an approximate ETA for fixing this issue? |
So the PR should be ready to merge now, and I could probably cut a release if this is a blocker for you. I am however thinking about the best way of keeping the certificates in the configuration directory. The question is how far to take this as we would need to determine which files should be kept. For this to work we would need some method of determining which directories are host configurations and which are not. Would it be possible for you to place the certificates in a different directory like |
Hello @phillebaba, Thanks for adding the flag. But the problem that we are facing is that we are not sure exactly how do we add these certs in hosts.toml file and make it work. We were just attempting a workaround for that using hosts.toml. It would be great if the whole directory is loaded with We did try to add it as client as shared above even without spegel and it failed. Do you have any direction for that since containerd documentation is not clear on how do we add client.key and client.cert for tls authentication for each mirror. We are still trying to add multiple configurations in containerd config.toml and hosts.toml. Will keep you updated on the issue as well. |
One solution could be to copy over all of the file contents from the existing directories. To allow registry specific certs. I would rather avoid this if possible. If we were to add this I would probably add an additional flag for this behavior. I had a look back at your previous examples and saw that you are using client cert pairs for authentication. Note that the Containerd documentation has different syntax for this compared to using a pem file. So its important that you configure them as key pairs.
https://github.com/containerd/containerd/blob/main/docs/hosts.md#client-field In the documentation they use absolute paths for both files, which seems to work fine even when the files are located in a different directory. In theory you do not have to depend on Spegel to configure mirrors for you. As you are running EKS it is totally possible to just setup all of the mirror configuration in your custom bootstrap.sh file and then just disable Spegel mirror configuration. I will go ahead and merge #424 and cut a release for that. If you are in need for more features to persist other files in the certs.d directory let me know. Will keep this issue open until things are resolved. |
@badaldavda8 @IndhumithaR I have cut v0.0.21 now which contains an option to enable merging. As long as you are not placing certificates in certs.d directory I think this would work for you. |
We will test this and get back to you! |
Hi @phillebaba, Not sure, What's wrong. |
That is the expected behavior. It is not really possible to not backup all of the mirror configuration. We need the original registries to be able to update the registry configuration at a later date. Without the backup we wont know what the original state is. You should see that the new host files are basically the original merged with the new configuration for Spegel, which is what we discussed. The mirror configuration in Spegel is mostly just helper logic for the purpose of simplifying operations for the majority of the user. As I suggested before. if your use case does not fit in with this solution you could disable Spegel mirror configuration and rely on your own method instead. Just to make sure the new hosts.toml are correct right? With the merged content. |
Hi @phillebaba |
How do you suggest that we disable spegel mirror configuration? Is it by using spegel.containerdMirrorAdd=false? |
Yes if you set that to false you will disable the whole init container which writes mirror configuration. |
Hi @phillebaba, Now we are able to pull docker images but getting the below error often and after multiple retries the image is getting pulled.
Do you know why we are getting this error? |
Hi @phillebaba, We deployed a pod to pull a docker image, a new node is created and tried to pull the image. After several tries, finally the image is pulled in 10 mins. The below is the pod log.
Spegel log for the node:
We deployed another pod which gets assigned to a new node. This pod also tries to pull the same image.
Spegel log for 2nd node:
We see no improvement in image pull time of the first node and second node. According to our understanding of spegel, the first node will pull from artifactory and second node will pull the image from the first node. In this case, we expect some improvement in the image will time. But actually took the same time. Even when we look at the spegel logs, it doesn't tell anything like it pulled the image from the first node. FYI, while installing spegel, we set spegel.appendMirrors=true and set spegel.containerdMirrorAdd=false. We also set spegel.registries to our private registries. We also tried increasing spegel.mirrorResolveRetries=4 and spegel.mirrorResolveTimeout="25s" it don't show any improvement. Can you please explain the expected behavior or anything that we have missed out? |
Hi @phillebaba and the spegel team, Thank you for the excellent work on the spegel project. We've been trying to use it in our application, but we're facing an issue with the TLS configuration while parsing the Here are the details:
We're unsure why the spegel library is encountering this error when the standard Go TLS library works correctly with the same files. We would greatly appreciate if you could investigate this issue and provide guidance on how we can resolve it. We've been trying to resolve this for the past two weeks without success, and we're eager to use the spegel project in our application. Thank you in advance for your assistance. Best regards, |
Also, if you can assist us by adding following in the debug log -
|
Now I am taking a wild guess but I think that the issue now is that you are passing the client certs to the wrong configuration and that the ordering might be wrong. As far as I understand the mirror configuration is that ordering matters, which could be a reason why you are not seeing any performance gains when things seem to work. The ordering should be top to bottom. So when you are configuring the mirror configuration on your own it should look something like this. The first two entries are for Spegel. One will route to the local port and the other is a fallback to a random Spegel instance in the cluster. It isn't actually a requirement but a nice to have. Then after that comes the configuration for your artifactory registry with the client cert authentication. server = 'https://artifactory-registry-url'
[host]
[host.'http://127.0.0.1:30020']
capabilities = ['pull', 'resolve']
[host.'http://127.0.0.1:30021']
capabilities = ['pull', 'resolve']
[host.'https://artifactory-registry-url']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve'] I realize that I probably misled you in my examples. That is my fault I did not really think through the situation. I will probably add some examples to the docs to explain how mirror configuration needs to look like. Does this make sense or have I gotten more confused about the actual issue? |
FYI I have not tested this but I assume this is the expected configuration for Containerd. Basically we are modifying the registry configuration to use client certs only for the artifactory registry. As far as I can read the Containerd documentation I cannot find another may to solve this. |
@phillebaba, Thanks for you response. This is the hosts.toml, we are currently using.
With this toml file, We see response of 200 in node which already has the image.
As per the spegel document we see the logs having handler= mirror in the node which already has the docker image. And In the logs of the new node trying to pull the image, we are not seeing any response 200.
We tried this with version 0.0.21 and v 0.0.22. Thanks. |
Sorry about the logs, I was doing some refactoring and accidentally removed the handler from the logs. So it is no longer easy to see if its the mirror or the serve handler that is running. Will work on getting that back in. I setup similar solution with client certificates and got thing working without any issue. I pull the first image from the original registry. Then move on to another node and am able to pull the image from the other node this time. Looking at your logs it looks like the image pull is occurring before the actual mirror configuration is added? Maybe that is the issue to your problems that the Pod is created to quickly? Could you verify that Spegel has started properly before you try pulling the image. |
I see, The setup that we have is using Karpenter to spin up nodes. Now as soon as the nodes come up the pod starts scheduling in that node. I think that at that point even spegel pod might not be up due to which this might have caused the issue. Is there any way that we can create a dependency that spegel pod or the daemonset should be up before our pod runs? |
That is a good question, I do not have the answer off the top of my head. Could we close this issue and open another one with this question? I think it would be something valuable to document but it is off topic to the issue at hand. |
Hi @phillebaba For the previous question regarding the daemonset, we will create a new issue. But for testing purpose, we tried assigning pods to the nodes in which spegel is already started. Log of first node: Node 1: [ip-10-190-58-49]
Here we are able to see that node 2 (10.190.61.15) is trying to pull this node 1. Log of node 2: [ip-10.190.61.15]
Based the log of node 2, we are not sure if it pulling the image from node1 or from artifactory. Can we help us understand what's happening based on the logs? Thanks. |
Looking at the logs I can tell you that things are working. I can see in the logs for node 2 that the request starts there and then they are present in node 1. So things are getting pulled from Spegel. Interesting that you are not seeing any improvement in image pull time. Just to make sure you are running v0.0.22 right? The perceived performance depends on a lot of things, especially the disk performance of the nodes used in the cluster. |
Oh Ok, Yes we are using v0.0.22. Image pull time in node 1 is 15 minutes and in node2 is 13 minutes. |
I would suggest you check out the benchmarks that I have documented. These should give you an idea of the expected performance. https://github.com/spegel-org/spegel/blob/main/docs/BENCHMARK.md Obviously this is subject to change depending on your infrastructure and the registry you are comparing with. I suggest you move the benchmark images to your own registry and run the benchmarks yourself to see what types of results you will see. |
Spegel version
v0.0.19
Kubernetes distribution
EKS
Kubernetes version
v1.27.0
CNI
VPC CNI
Describe the bug
We are using jfrog artifactory registry as our registry. we are using tls certificate authentication for resolving and accessing our registry endpoints. We were facing 500 error with mirror resolve retries exhausted for key while running a pod, it gave this error
On diving deep we saw that our client.cert and client.key files are not loaded in the configuration
only host.toml seems to be loaded to spegel. Therefore not being able to access artifactory and pull.
first.log
Sharing the spegel log file for reference.
Can you please ensure that the all the files in cert configuration is copied in the spegel configuration?
The text was updated successfully, but these errors were encountered: