Adding docker image caching#4860
Conversation
Greptile SummaryThis PR restructures the CI pipeline to build Docker images once, push them to ECR with a commit-SHA tag, and have all test jobs pull from ECR — eliminating redundant multi-gigabyte rebuilds. It also switches layer caching from GitHub Actions cache to ECR registry caching with per-branch keys. The architectural intent is sound and the code logic is well-structured. However, the implementation has two blocking issues that will prevent the new workflow from functioning:
Additionally, on the Confidence Score: 1/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant GH as GitHub Actions
participant ECR as ECR Registry
participant Runner as Self-hosted Runner (GPU)
GH->>Runner: Trigger build-base-image job
Runner->>ECR: (auth missing) docker buildx build --push
Note over Runner,ECR: ECR_CACHE_URL undefined → malformed tag
ECR-->>Runner: ❌ auth error / invalid ref
GH->>Runner: Trigger build-curobo-image job (parallel)
Runner->>ECR: (auth missing) docker buildx build --push
ECR-->>Runner: ❌ auth error / invalid ref
Note over GH,Runner: If build jobs succeeded...
GH->>Runner: test-isaaclab-tasks (needs build-base-image)
Runner->>ECR: docker pull <ecr-image-tag>
Note over Runner,ECR: No ECR login → pull fails
ECR-->>Runner: ❌ auth error
GH->>Runner: test-isaaclab-tasks-2 (parallel)
GH->>Runner: test-general (parallel)
GH->>Runner: test-curobo (needs build-curobo-image)
GH->>Runner: combine-results (needs all test jobs)
Last reviewed commit: 8e162ae |
| ECR_IMAGE_TAG="${ECR_CACHE_URL}:pr-${{ github.event.pull_request.number }}-${SHA}" | ||
| else | ||
| ECR_IMAGE_TAG="${ECR_CACHE_URL}:${BRANCH}-${SHA}" | ||
| fi | ||
| echo "ecr_image_tag=${ECR_IMAGE_TAG}" >> $GITHUB_OUTPUT | ||
| echo "ecr_cache_tag=${ECR_CACHE_URL}:cache-base-${BRANCH}" >> $GITHUB_OUTPUT | ||
| echo "ecr_cache_fallback=${ECR_CACHE_URL}:cache-base-develop" >> $GITHUB_OUTPUT |
There was a problem hiding this comment.
ECR_CACHE_URL is never defined
The shell variable ECR_CACHE_URL is used to construct every ECR tag in this step, but it is never declared anywhere in this workflow — not in the top-level env: block, not as a job-level env, and not as a secret/variable mapping like ECR_CACHE_URL: ${{ secrets.ECR_CACHE_URL }} or ECR_CACHE_URL: ${{ vars.ECR_CACHE_URL }}.
As a result, at runtime ECR_CACHE_URL expands to the empty string, producing malformed tags such as :pr-123-abc123 instead of <registry>.dkr.ecr.<region>.amazonaws.com/<repo>:pr-123-abc123. docker buildx build --push will fail with an invalid reference error, and docker pull in the test jobs will fail too.
The fix is to expose the value through a secret or repository variable, e.g. add to the env: block at the top of the workflow:
ECR_CACHE_URL: ${{ secrets.ECR_CACHE_URL }}
or
ECR_CACHE_URL: ${{ vars.ECR_CACHE_URL }}
The same omission exists in .github/workflows/postmerge-ci.yml (lines 97–98).
| ECR_CACHE_WRITE="${ECR_CACHE_URL}:cache-base-${SAFE_BRANCH_NAME}" | ||
| ECR_CACHE_READ_FALLBACK="${ECR_CACHE_URL}:cache-base-develop" |
There was a problem hiding this comment.
ECR_CACHE_URL is never defined
ECR_CACHE_URL is used here to construct the cache tags but is never declared in the workflow's env: block or as a secret/variable mapping. At runtime the variable will be empty, producing invalid registry references like :cache-base-develop (missing the host/repository prefix). The --cache-from and --cache-to flags will be passed malformed ref= values, which will either silently no-op or cause docker buildx build to fail.
Add the following to the top-level env: section of this workflow:
ECR_CACHE_URL: ${{ secrets.ECR_CACHE_URL }}cbd2f2a to
18a7db9
Compare
18a7db9 to
d705220
Compare
|
It seems to fail at getting image from nvcr.io even though I can pull when ssh-d into the machine. Weird... |
|
Merged with #4877. |
Description
Adding Docker image caching to CI