-
Notifications
You must be signed in to change notification settings - Fork 3k
ci: Multi-tenancy for tests and garbage collection #9179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -225,6 +225,10 @@ jobs: | |
--input_name_or_path=/home/TestData/nlp/megatron_llama/llama-ci-hf-tiny \ | ||
--output_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \ | ||
--precision=16 | ||
- name: Cleanup | ||
if: "always()" | ||
run: | | ||
rm -rf /home/TestData/nlp/megatron_llama/model_weights | ||
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main" | ||
if: "failure()" | ||
|
||
|
@@ -251,6 +255,10 @@ jobs: | |
--output_path=/home/TestData/nlp/megatron_llama/llama3-ci-hf/llama3_ci.nemo \ | ||
--precision=16 | ||
rm -f /home/TestData/nlp/megatron_llama/llama3-ci-hf/llama3_ci.nemo | ||
- name: Cleanup | ||
if: "always()" | ||
run: | | ||
rm -rf /home/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights | ||
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main" | ||
if: "failure()" | ||
|
||
|
@@ -272,10 +280,19 @@ jobs: | |
- name: Checkout repository | ||
uses: actions/checkout@v4 | ||
- run: | | ||
mkdir -p /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}; | ||
python scripts/checkpoint_converters/convert_starcoder_hf_to_nemo.py \ | ||
--input_name_or_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf \ | ||
--output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf | ||
rm -f /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo | ||
--output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }} | ||
- name: Cleanup | ||
if: "always()" | ||
run: | | ||
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo; | ||
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/ | ||
- name: Cleanup | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure why there are two cleanups |
||
if: "always()" | ||
run: | | ||
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/model_weights | ||
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main" | ||
if: "failure()" | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We just need to do for all the tests that I mentioned to you offline. |
||
|
@@ -301,6 +318,10 @@ jobs: | |
--input_name_or_path /home/TestData/nlp/megatron_gpt/falcon-ci-hf \ | ||
--output_path /home/TestData/nlp/megatron_gpt/falcon-ci-hf/falcon_ci.nemo | ||
rm -f /home/TestData/nlp/megatron_gpt/falcon-ci-hf/falcon_ci.nemo | ||
- name: Cleanup | ||
if: "always()" | ||
run: | | ||
rm -rf /home/TestData/nlp/megatron_gpt/falcon-ci-hf/model_weights | ||
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main" | ||
if: "failure()" | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just make it be:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for another Cleanup step. Make it all in just one script
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would help us to keep a more unified test case template, but we would need a different (maybe scheduled?) workflow that cleans up directories that weren't cleaned up due to workflow cancellation.
not a big issue, what do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. Let's do as you have it here then. Scheduled workflow will also have same issue given other workflow can be running that test & it will be synchonization mess.
Leave it as it is now then :) and please kindly do this analogous change for all the other tests prefixed "L2_Community_LLM_Checkpoints_tests_"
I.e. all the tests that are affected by this:
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_gpt/falcon-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_gpt/starcoder-ci-hf/model_weights