Skip to content

Adding docker image caching#4860

Closed
nv-apoddubny wants to merge 13 commits intoisaac-sim:developfrom
nv-apoddubny:apoddubny/docker-ci
Closed

Adding docker image caching#4860
nv-apoddubny wants to merge 13 commits intoisaac-sim:developfrom
nv-apoddubny:apoddubny/docker-ci

Conversation

@nv-apoddubny
Copy link
Collaborator

Description

Adding Docker image caching to CI

@nv-apoddubny nv-apoddubny requested a review from myurasov-nv March 6, 2026 22:46
@nv-apoddubny nv-apoddubny marked this pull request as draft March 6, 2026 22:47
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR restructures the CI pipeline to build Docker images once, push them to ECR with a commit-SHA tag, and have all test jobs pull from ECR — eliminating redundant multi-gigabyte rebuilds. It also switches layer caching from GitHub Actions cache to ECR registry caching with per-branch keys. The architectural intent is sound and the code logic is well-structured.

However, the implementation has two blocking issues that will prevent the new workflow from functioning:

  • ECR_CACHE_URL is undefined in both build.yml and postmerge-ci.yml. The variable is used throughout both files to construct every ECR image/cache tag, but it is never declared in any env: block or mapped from a secret/variable. At runtime the variable will be empty, producing malformed tags (e.g. :pr-123-abc instead of <account>.dkr.ecr.<region>.amazonaws.com/<repo>:pr-123-abc). Every docker push and docker pull in the pipeline will fail with an invalid reference error.
  • No ECR authentication is performed in any of the new build or test jobs. Pushing to and pulling from a private ECR registry requires an explicit login step (e.g. aws ecr get-login-password | docker login) that is absent from build-base-image, build-curobo-image, and all four test jobs in build.yml.

Additionally, on the develop branch, ecr_cache_tag and ecr_cache_fallback resolve to the same value, resulting in a redundant --cache-from argument (minor, style-level).

Confidence Score: 1/5

  • Not safe to merge — two critical runtime blockers (missing ECR_CACHE_URL definition and missing ECR authentication) will cause every job in both workflows to fail.
  • The undefined ECR_CACHE_URL variable means all computed image/cache tags are malformed, and the absent ECR login steps mean push and pull operations will be rejected by the registry. Both issues are blocking: the pipeline cannot build, push, or pull any images until they are resolved.
  • .github/workflows/build.yml and .github/workflows/postmerge-ci.yml both need ECR_CACHE_URL defined and ECR login steps added to every job that interacts with the registry.

Important Files Changed

Filename Overview
.github/actions/docker-build/action.yml Refactored docker build composite action to support ECR registry caching and push mode; adds push, ecr-cache-tags, and ecr-write-cache-tag inputs and conditionally assembles --cache-from/--cache-to/--push flags. Logic is sound but relies on ECR_CACHE_URL being defined in the calling workflow (which it currently is not).
.github/workflows/build.yml Restructured CI pipeline into distinct build and test phases; introduces build-base-image and build-curobo-image jobs that push to ECR, with test jobs pulling pre-built images. Two critical issues: ECR_CACHE_URL is never defined (causing all ECR tags to be malformed), and there is no ECR authentication step in any of the new jobs.
.github/workflows/postmerge-ci.yml Replaced GHA cache with ECR registry cache for --cache-from/--cache-to in the post-merge build; ECR_CACHE_URL is referenced but never declared in the workflow env, making all constructed cache references invalid at runtime.

Sequence Diagram

sequenceDiagram
    participant GH as GitHub Actions
    participant ECR as ECR Registry
    participant Runner as Self-hosted Runner (GPU)

    GH->>Runner: Trigger build-base-image job
    Runner->>ECR: (auth missing) docker buildx build --push
    Note over Runner,ECR: ECR_CACHE_URL undefined → malformed tag
    ECR-->>Runner: ❌ auth error / invalid ref

    GH->>Runner: Trigger build-curobo-image job (parallel)
    Runner->>ECR: (auth missing) docker buildx build --push
    ECR-->>Runner: ❌ auth error / invalid ref

    Note over GH,Runner: If build jobs succeeded...
    GH->>Runner: test-isaaclab-tasks (needs build-base-image)
    Runner->>ECR: docker pull <ecr-image-tag>
    Note over Runner,ECR: No ECR login → pull fails
    ECR-->>Runner: ❌ auth error

    GH->>Runner: test-isaaclab-tasks-2 (parallel)
    GH->>Runner: test-general (parallel)
    GH->>Runner: test-curobo (needs build-curobo-image)

    GH->>Runner: combine-results (needs all test jobs)
Loading

Last reviewed commit: 8e162ae

Comment on lines +59 to +65
ECR_IMAGE_TAG="${ECR_CACHE_URL}:pr-${{ github.event.pull_request.number }}-${SHA}"
else
ECR_IMAGE_TAG="${ECR_CACHE_URL}:${BRANCH}-${SHA}"
fi
echo "ecr_image_tag=${ECR_IMAGE_TAG}" >> $GITHUB_OUTPUT
echo "ecr_cache_tag=${ECR_CACHE_URL}:cache-base-${BRANCH}" >> $GITHUB_OUTPUT
echo "ecr_cache_fallback=${ECR_CACHE_URL}:cache-base-develop" >> $GITHUB_OUTPUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECR_CACHE_URL is never defined

The shell variable ECR_CACHE_URL is used to construct every ECR tag in this step, but it is never declared anywhere in this workflow — not in the top-level env: block, not as a job-level env, and not as a secret/variable mapping like ECR_CACHE_URL: ${{ secrets.ECR_CACHE_URL }} or ECR_CACHE_URL: ${{ vars.ECR_CACHE_URL }}.

As a result, at runtime ECR_CACHE_URL expands to the empty string, producing malformed tags such as :pr-123-abc123 instead of <registry>.dkr.ecr.<region>.amazonaws.com/<repo>:pr-123-abc123. docker buildx build --push will fail with an invalid reference error, and docker pull in the test jobs will fail too.

The fix is to expose the value through a secret or repository variable, e.g. add to the env: block at the top of the workflow:

ECR_CACHE_URL: ${{ secrets.ECR_CACHE_URL }}

or

ECR_CACHE_URL: ${{ vars.ECR_CACHE_URL }}

The same omission exists in .github/workflows/postmerge-ci.yml (lines 97–98).

Comment on lines +97 to +98
ECR_CACHE_WRITE="${ECR_CACHE_URL}:cache-base-${SAFE_BRANCH_NAME}"
ECR_CACHE_READ_FALLBACK="${ECR_CACHE_URL}:cache-base-develop"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECR_CACHE_URL is never defined

ECR_CACHE_URL is used here to construct the cache tags but is never declared in the workflow's env: block or as a secret/variable mapping. At runtime the variable will be empty, producing invalid registry references like :cache-base-develop (missing the host/repository prefix). The --cache-from and --cache-to flags will be passed malformed ref= values, which will either silently no-op or cause docker buildx build to fail.

Add the following to the top-level env: section of this workflow:

ECR_CACHE_URL: ${{ secrets.ECR_CACHE_URL }}

@myurasov-nv myurasov-nv force-pushed the apoddubny/docker-ci branch from cbd2f2a to 18a7db9 Compare March 7, 2026 05:42
@myurasov-nv myurasov-nv force-pushed the apoddubny/docker-ci branch from 18a7db9 to d705220 Compare March 7, 2026 05:43
@myurasov-nv
Copy link
Collaborator

It seems to fail at getting image from nvcr.io even though I can pull when ssh-d into the machine. Weird...

@myurasov-nv
Copy link
Collaborator

Merged with #4877.

@myurasov-nv myurasov-nv closed this Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants