Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why the registry cache cannot pull blobs from ECR #259

Open
ialidzhikov opened this issue Sep 19, 2024 · 3 comments
Open

Investigate why the registry cache cannot pull blobs from ECR #259

ialidzhikov opened this issue Sep 19, 2024 · 3 comments
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug

Comments

@ialidzhikov
Copy link
Member

How to categorize this issue?

/area quality
/kind bug

What happened:
The registry cache for some reason cannot pull blobs from ECR (at least from public.ecr.aws).

What you expected to happen:
The registry cache to pull images from ECR.

How to reproduce it (as minimally and precisely as possible):

  1. Create a Shoot with cache for upstream public.ecr.aws

  2. Create a Pod from the upstream , for example public.ecr.aws/nginx/nginx:1.23.0

  3. Make sure the registry-cache fails to pull the blobs

Logs:

time="2024-09-19T14:23:24.143601968Z" level=error msg="response completed with error" err.code=unknown err.detail="unauthorized: " err.message="unknown error" go.version=go1.22.4 http.request.host="10.4.4.82:5000" http.request.id=46e13e3b-667d-44d4-bae2-6166c692b88a http.request.method=GET http.request.remoteaddr="10.3.0.1:2562" http.request.uri="/v2/nginx/nginx/blobs/sha256:f3d3961ba57b97cee8dea2cdc950856e6c3f4f6d1ba2fadaf5bdf069557bc469?ns=public.ecr.aws" http.request.useragent=containerd/v1.7.18 http.response.contenttype=application/json http.response.duration=263.540084ms http.response.status=500 http.response.written=84 instance.id=1c940ca5-d392-476a-8f73-18c5c4b26817 service=registry vars.digest="sha256:f3d3961ba57b97cee8dea2cdc950856e6c3f4f6d1ba2fadaf5bdf069557bc469" vars.name=nginx/nginx version=3.0.0-beta.1
time="2024-09-19T14:23:24.143602427Z" level=error msg="response completed with error" err.code=unknown err.detail="unauthorized: " err.message="unknown error" go.version=go1.22.4 http.request.host="10.4.4.82:5000" http.request.id=5e9c9254-da0f-48dd-be58-c31a66f8cec2 http.request.method=GET http.request.remoteaddr="10.3.0.1:39548" http.request.uri="/v2/nginx/nginx/blobs/sha256:778ddef5c8e3dfac8ba7265cbd22065f975b42a467e899f753d6d42d1b069da4?ns=public.ecr.aws" http.request.useragent=containerd/v1.7.18 http.response.contenttype=application/json http.response.duration=265.722708ms http.response.status=500 http.response.written=84 instance.id=1c940ca5-d392-476a-8f73-18c5c4b26817 service=registry vars.digest="sha256:778ddef5c8e3dfac8ba7265cbd22065f975b42a467e899f753d6d42d1b069da4" vars.name=nginx/nginx version=3.0.0-beta.1
time="2024-09-19T14:23:24.147671677Z" level=error msg="response completed with error" err.code=unknown err.detail="unauthorized: " err.message="unknown error" go.version=go1.22.4 http.request.host="10.4.4.82:5000" http.request.id=6ab4f416-338e-4714-9112-ad7ceb18cf70 http.request.method=GET http.request.remoteaddr="10.3.0.1:43359" http.request.uri="/v2/nginx/nginx/blobs/sha256:78979650788c06290785aaf0b0b200bd5c5e20285eec32c5684d93310ee38b67?ns=public.ecr.aws" http.request.useragent=containerd/v1.7.18 http.response.contenttype=application/json http.response.duration=255.817666ms http.response.status=500 http.response.written=84 instance.id=1c940ca5-d392-476a-8f73-18c5c4b26817 service=registry vars.digest="sha256:78979650788c06290785aaf0b0b200bd5c5e20285eec32c5684d93310ee38b67" vars.name=nginx/nginx version=3.0.0-beta.1
10.3.0.1 - - [19/Sep/2024:14:23:23 +0000] "GET /v2/nginx/nginx/blobs/sha256:78979650788c06290785aaf0b0b200bd5c5e20285eec32c5684d93310ee38b67?ns=public.ecr.aws HTTP/1.1" 500 84 "" "containerd/v1.7.18"

Anything else we need to know?:
Similar upstream issue: distribution/distribution#4383

Credits to @dimitar-kostadinov for this finding

Environment:

  • Gardener version (if relevant):
  • Extension version: v0.10.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-prow gardener-prow bot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug labels Sep 19, 2024
@ialidzhikov ialidzhikov changed the title Investigate the registry cache cannot pull blobs from ECR Investigate why the registry cache cannot pull blobs from ECR Sep 19, 2024
@oliver-goetz
Copy link
Member

I just stumbled over this issue 😄
Does the occur on all shoots or on AWS shoots only? If it is the latter, it might be related to some AWS credential helpers and permissions of the VMs or its service accounts.

I can remember similar issues for gcsweb on prow on GCP when the VMs had a GCP service-account. The application was aware of the service account because of the GCP metadata service and tried to use it. The service account did not have permissions to access the storage buckets (it was a storage bucket in the gcsweb case) at all. Even though the bucket was public, gcsweb could not access it.
I could imagine that something similar could happen for other application on other hyperscalers too.

It is just an idea which came to my mind. I did not investigate the registry-cache case at all yet.

@dimitar-kostadinov
Copy link
Contributor

The issue occurs on all shoots, even in the local setup.
What we observe is that image indexes and manifests are successfully cached, but the image layers download fails with http.response.status=500.

@erfanw
Copy link

erfanw commented Oct 13, 2024

I found some weirdness about public ECR:

https://docs.aws.amazon.com/AmazonECR/latest/public/public-registry-auth.html
according to the auth document of public ECR, Amazon ECR Public supports the [Docker Registry HTTP API](https://docs.docker.com/registry/spec/api/), with the exception of the tags API. However, you must provide an authorization token with every HTTP request.

When I tried for example curl -u AWS:<ecr-public-password> https://public.ecr.aws/v2, it didn't work ({"errors":[{"code":"DENIED","message":"Your Authorization Token is invalid."}]}). So according to the document above, it must inject token to every curl request when using HTTP API Auth.

I think this is the same reason why distribution registry (I used 3.0.0-beta.1) is not working with public ECR.

However, when it comes to the private ECR, the same curl -u AWS:<ecr-private-password> https://aws_account_id.dkr.ecr.region.amazonaws.com/v2 will actually work. And distribution registry 3.0.0-beta.1 can work as a pull through cache for private ECR.

I can only conclude that this is a limitation with public ECR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug
Projects
None yet
Development

No branches or pull requests

4 participants