Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CachedImage not created #452

Open
bpoland opened this issue Dec 11, 2024 · 11 comments
Open

CachedImage not created #452

bpoland opened this issue Dec 11, 2024 · 11 comments
Assignees

Comments

@bpoland
Copy link

bpoland commented Dec 11, 2024

Hey there, I have a weird situation where one of my images was getting rewritten but no CachedImage or Repository were getting automatically created for it. I have looked through the controller logs for any hints but am not seeing anything obvious. The CachedImage and Repository are being created for other images in the same repo but not this one.

The only thing I could think of is that the problematic image name and tag are longer than others, so I tried manually creating the Cachedimage and Repository to see if I got any errors, but it worked fine and now my Pods are starting.

Has anyone seen anything like this before? Are there specific log lines I can look at in the controller log to try to troubleshoot? Thank you!

@plaffitt plaffitt self-assigned this Dec 12, 2024
@plaffitt
Copy link
Contributor

Hello,

I would need more information in order to help you. Could you please provide:

  • Relevant logs (specifically the pod controller logs, which is in charge of creating CachedImages). You can enable debugging logs with .controllers.verbosity: DEBUG to get more details.
  • A minimal reproducible example of a Pod producing this behavior.
  • Any custom configuration in your values.yaml.

Thanks for reporting!

@bpoland
Copy link
Author

bpoland commented Dec 12, 2024

Ah thank you, I will enable the higher verbosity and try to reproduce in a test cluster in the next few days.

@bpoland
Copy link
Author

bpoland commented Feb 7, 2025

Hi just want to circle back here, sorry for the delay! This continues to happen to us occasionally but when it does happen, enabling debug logs restarts the controllers and as soon as they restart then they notice there should be a CachedImage and create it. So I guess we could try leaving it in debug mode for a while to try to reproduce but that might be too noisy.

The latest time it was just a simple nginx image that was failing. When this happens, the pod goes into ImagePullBackOff because the rewrite happens to add localhost:7439 but then the local registry doesn't have the image and 404s :(

  Normal   Pulling    24s (x2 over 42s)  kubelet            Pulling image "localhost:7439/public.ecr.aws/nginx/nginx:stable-alpine"
  Warning  Failed     24s (x2 over 42s)  kubelet            Failed to pull image "localhost:7439/public.ecr.aws/nginx/nginx:stable-alpine": reading manifest stable-alpine in localhost:7439/public.ecr.aws/nginx/nginx: StatusCode: 404, ""
  Warning  Failed     24s (x2 over 42s)  kubelet            Error: ErrImagePull
  Normal   BackOff    12s (x2 over 42s)  kubelet            Back-off pulling image "localhost:7439/public.ecr.aws/nginx/nginx:stable-alpine"
  Warning  Failed     12s (x2 over 42s)  kubelet            Error: ImagePullBackOff

In terms of custom stuff in our helm values, I don't think there's anything too crazy:

cachedImagesExpiryDelay: 7

controllers:
  webhook:
    acceptedImages:
      - "docker.io/.*"
      - "public.ecr.aws/.*"
      ... some private repos
    ignorePullPolicyAlways: false
  resources:
    requests:
      cpu: "500m"
      memory: "500Mi"
    limits:
      cpu: "2"
      memory: "1Gi"

proxy:
  hostNetwork: true

registry:
  garbageCollection:
    schedule: "0 0 * * *" # daily
  imagePullSecrets:
    - name: regcred
  replicas: 3
  pdb:
    create: true

minio:
  enabled: true
  global:
    imagePullSecrets:
      - name: regcred
  metrics:
    enabled: true
  persistence:
    storageClass: high-io
    size: 50Gi
  statefulset:
    zones: 1

@plaffitt
Copy link
Contributor

Hello,

Something that could help a lot would be to have the yaml of a pod that have this issue (a container rewritten but with the corresponding CachedImage not being created). With this I may not even need logs. What could happen is that some annotations that should be added by our controller are missing, and thus the CachedImage is not created, event if the image is rewritten.

I'm also thinking, did you change the controllers.webhook.acceptedImages or ignorePullPolicyAlways not a long time before this happened? I wonder if this could be related. Like for instance the image would be rewritten, but then it is removed from acceptedImages so the annotation is removed but the image is not reverted to its original value and then the CachedImage expires.

@bpoland
Copy link
Author

bpoland commented Feb 19, 2025

Something that could help a lot would be to have the yaml of a pod that have this issue (a container rewritten but with the corresponding CachedImage not being created). With this I may not even need logs. What could happen is that some annotations that should be added by our controller are missing, and thus the CachedImage is not created, event if the image is rewritten.

aha thanks, I will grab this if I see the issue happen again!

I'm also thinking, did you change the controllers.webhook.acceptedImages or ignorePullPolicyAlways not a long time before this happened? I wonder if this could be related. Like for instance the image would be rewritten, but then it is removed from acceptedImages so the annotation is removed but the image is not reverted to its original value and then the CachedImage expires.

That is an interesting thought, but no we haven't changed either of those when the issue happened. We have had ECR public added to the accepted images for weeks when this last problem happened pulling the nginx image from ECR public (that was the first time we had tried to pull that image).

Thanks for the response, will keep you updated if this happens again and grab that pod yaml!

@bpoland
Copy link
Author

bpoland commented Feb 24, 2025

Hi we just had this come up again. I was able to get the DEBUG controller logs: https://gist.github.com/bpoland/604a1ed998e12a4b57e5d3a7dd4b0891

I also grabbed the pod yaml when the issue was happening: https://gist.github.com/bpoland/093f78dad60410597d362dc261ea9861

I redacted both but the image that was not being cached is redacted-image-repo-problem:tag. This is a private image this time that requires credentials, not from ECR public.

This time, restarting the kuik controllers was not enough to get things working again. I had to manually create the Repository and CachedImage resources myself, but then kuik kicked in and pulled the image (and a new pod was able to use the pulled image without issues).

Please let me know if that's helpful, thank you!

@plaffitt
Copy link
Contributor

plaffitt commented Mar 4, 2025

Hi, I've tried to create this pod in my dev cluster. But it successfully created required images. If you try to re-create this pod (with redacted images or not, it should not matter), are you able to reproduce the issue?

@bpoland
Copy link
Author

bpoland commented Mar 4, 2025

Hi, I've tried to create this pod in my dev cluster. But it successfully created required images. If you try to re-create this pod (with redacted images or not, it should not matter), are you able to reproduce the issue?

Yes, trying to create multiple pods with the same new image has the same problem until I manually create the Repository and CachedImage resources, or sometimes restarting the kuik controllers is enough to get it fixed.

What other info can I grab the next time this happens to help troubleshoot? Thank you!

@plaffitt
Copy link
Contributor

plaffitt commented Mar 5, 2025

Sorry, I think I wasn't clear enough. What I meant was if you:

kubectl create ns redacted-namespace-1
curl https://gist.githubusercontent.com/bpoland/093f78dad60410597d362dc261ea9861/raw/6142f34d8ad71969dc8261ee1f23a0bfddcddcf7/badpod.yaml | kubectl apply -f -

Does the 3 images docker.io-library-redacted-image-repo-{agent,gradle,problem}-tag get created?

This is to understand if the issue really is with the spec of the pod (which seems to be the case since restarting kuik controllers didn't fix the issue). In my case, when I specifically apply the above yaml, images get created. If in your setup it is not the case, it may come from a difference in our setups (different k8s version, kuik version, cloud provider or any other). By the way, what versions of kuik and k8s are you using? Is there anything specific to your setup that could cause that?

@bpoland
Copy link
Author

bpoland commented Mar 5, 2025

Ah okay, I will try that the next time we see the issue. My guess is that other similar pods using the same new image will not trigger the cachedImage resource to be created but I will let you know.

We are on k8s 1.29 (soon to be 1.30) and kuik 1.12.0. The only potentially weird configuration is that we specify acceptedImages instead of caching everything, but I think you saw that above?

@plaffitt
Copy link
Contributor

plaffitt commented Mar 7, 2025

Ok thanks. Yeah I saw the acceptedImages option, but I don't think the issue could come from there, otherwise you would have it 100% of the time. IMO it's rather coming from some weird race condition. Let me know if you find something interesting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants