Skip to content

ci: Multi-tenancy for tests and garbage collection #9179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 23 additions & 2 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,10 @@ jobs:
--input_name_or_path=/home/TestData/nlp/megatron_llama/llama-ci-hf-tiny \
--output_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
--precision=16
- name: Cleanup
if: "always()"
run: |
rm -rf /home/TestData/nlp/megatron_llama/model_weights
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

Expand All @@ -251,6 +255,10 @@ jobs:
--output_path=/home/TestData/nlp/megatron_llama/llama3-ci-hf/llama3_ci.nemo \
--precision=16
rm -f /home/TestData/nlp/megatron_llama/llama3-ci-hf/llama3_ci.nemo
- name: Cleanup
if: "always()"
run: |
rm -rf /home/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

Expand All @@ -272,10 +280,19 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v4
- run: |
mkdir -p /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just make it be:

        mkdir -p /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }};
        python scripts/checkpoint_converters/convert_starcoder_hf_to_nemo.py \
        --input_name_or_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf \
        --output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}
        rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo; 
        rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for another Cleanup step. Make it all in just one script

Copy link
Collaborator Author

@ko3n1g ko3n1g May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would help us to keep a more unified test case template, but we would need a different (maybe scheduled?) workflow that cleans up directories that weren't cleaned up due to workflow cancellation.

not a big issue, what do you prefer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need a different (maybe scheduled?) workflow that cleans up directories that weren't cleaned up due to workflow cancellation.

Ah good point. Let's do as you have it here then. Scheduled workflow will also have same issue given other workflow can be running that test & it will be synchonization mess.

Leave it as it is now then :) and please kindly do this analogous change for all the other tests prefixed "L2_Community_LLM_Checkpoints_tests_"

I.e. all the tests that are affected by this:
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_gpt/falcon-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_gpt/starcoder-ci-hf/model_weights

python scripts/checkpoint_converters/convert_starcoder_hf_to_nemo.py \
--input_name_or_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf \
--output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf
rm -f /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo
--output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}
- name: Cleanup
if: "always()"
run: |
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo;
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/
- name: Cleanup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why there are two cleanups
Will merge & can do follow up PR to sort this out if/as needed

if: "always()"
run: |
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/model_weights
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to do for all the tests that I mentioned to you offline.

Expand All @@ -301,6 +318,10 @@ jobs:
--input_name_or_path /home/TestData/nlp/megatron_gpt/falcon-ci-hf \
--output_path /home/TestData/nlp/megatron_gpt/falcon-ci-hf/falcon_ci.nemo
rm -f /home/TestData/nlp/megatron_gpt/falcon-ci-hf/falcon_ci.nemo
- name: Cleanup
if: "always()"
run: |
rm -rf /home/TestData/nlp/megatron_gpt/falcon-ci-hf/model_weights
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

Expand Down