Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BERT pretraining trn1 example #435

Merged
merged 1 commit into from
Feb 17, 2024
Merged

Conversation

5cp
Copy link
Contributor

@5cp 5cp commented Feb 17, 2024

What does this PR do?

Corrects a few issues with the BERT pretraining / trn1 example

  • adds missing Volcano queue
  • pins Neuron version to 2.16.1 to avoid CCOM timeouts
  • adjusts torchx entrypoint / args to avoid 'filename too long' errors

Motivation

User reported an issue with this example

More

Tested 2-node jobs using both trn1.32xl and trn1n.32xl

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

@vara-bonthu vara-bonthu merged commit 56bacb1 into awslabs:main Feb 17, 2024
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants