Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Multi-tenancy for tests and garbage collection #9179

Merged
merged 2 commits into from
May 17, 2024

Conversation

ko3n1g
Copy link
Collaborator

@ko3n1g ko3n1g commented May 13, 2024

What does this PR do ?

More robust CI pipeline by creating isolated directories for job-outputs.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the CI label May 13, 2024
@ko3n1g ko3n1g changed the base branch from ko3n1g/ci/add-dockerfile to main May 13, 2024 15:54
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels May 13, 2024
@ko3n1g ko3n1g force-pushed the ko3n1g/ci/multi-tenancy-for-tests branch from f34e2d7 to aae74a2 Compare May 13, 2024 16:12
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels May 13, 2024
@ko3n1g ko3n1g force-pushed the ko3n1g/ci/multi-tenancy-for-tests branch from aae74a2 to 16f4454 Compare May 13, 2024 17:10
@ko3n1g ko3n1g removed the Run CICD label May 13, 2024
@ko3n1g ko3n1g changed the base branch from main to ko3n1g/ci/add-dockerfile May 13, 2024 17:10
@ko3n1g ko3n1g force-pushed the ko3n1g/ci/multi-tenancy-for-tests branch from 16f4454 to c6dc27a Compare May 13, 2024 17:19
@ko3n1g ko3n1g requested a review from pablo-garay May 13, 2024 17:19
@ko3n1g ko3n1g marked this pull request as ready for review May 13, 2024 17:19
@ko3n1g
Copy link
Collaborator Author

ko3n1g commented May 13, 2024

Hey @pablo-garay , this is an proposal how we could make our CI pipeline a bit more stable. We should/could isolate folders that are being written to, and garbage-collect them by if: always() steps.

The only caveat I can see is that templating our tests might become a little bit more tricky. But I don't think it would be a blocker. See #9177 for reference

@@ -270,10 +270,15 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v4
- run: |
mkdir -p /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just make it be:

        mkdir -p /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }};
        python scripts/checkpoint_converters/convert_starcoder_hf_to_nemo.py \
        --input_name_or_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf \
        --output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}
        rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo; 
        rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for another Cleanup step. Make it all in just one script

Copy link
Collaborator Author

@ko3n1g ko3n1g May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would help us to keep a more unified test case template, but we would need a different (maybe scheduled?) workflow that cleans up directories that weren't cleaned up due to workflow cancellation.

not a big issue, what do you prefer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need a different (maybe scheduled?) workflow that cleans up directories that weren't cleaned up due to workflow cancellation.

Ah good point. Let's do as you have it here then. Scheduled workflow will also have same issue given other workflow can be running that test & it will be synchonization mess.

Leave it as it is now then :) and please kindly do this analogous change for all the other tests prefixed "L2_Community_LLM_Checkpoints_tests_"

I.e. all the tests that are affected by this:
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_gpt/falcon-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_gpt/starcoder-ci-hf/model_weights

@pablo-garay
Copy link
Collaborator

Fantastic proposal

We need this for all tests whose name start with prefix: L2_Community_LLM_Checkpoints_tests_
Could you please kindly make the proposed change & add for the other tests? :)

@ko3n1g ko3n1g force-pushed the ko3n1g/ci/multi-tenancy-for-tests branch from bee87e7 to ff4014d Compare May 15, 2024 20:13
@ko3n1g ko3n1g changed the base branch from ko3n1g/ci/add-dockerfile to main May 15, 2024 20:13
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels May 15, 2024
@ko3n1g ko3n1g force-pushed the ko3n1g/ci/multi-tenancy-for-tests branch from 24f2f3c to 226245e Compare May 16, 2024 07:27
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels May 16, 2024
@ko3n1g ko3n1g requested a review from pablo-garay May 16, 2024 07:50
if: "always()"
run: |
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo;
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to do for all the tests that I mentioned to you offline.

@ko3n1g ko3n1g force-pushed the ko3n1g/ci/multi-tenancy-for-tests branch from 32950d4 to 3b0377c Compare May 17, 2024 10:40
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels May 17, 2024
run: |
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo;
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/
- name: Cleanup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why there are two cleanups
Will merge & can do follow up PR to sort this out if/as needed

@pablo-garay pablo-garay merged commit ce1612d into main May 17, 2024
132 checks passed
@pablo-garay pablo-garay deleted the ko3n1g/ci/multi-tenancy-for-tests branch May 17, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants