Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set instance recreation times count (exceeded maximum number of attempts error on start up)? #1440

Closed
OLSecret opened this issue Feb 5, 2024 · 1 comment

Comments

@OLSecret
Copy link

OLSecret commented Feb 5, 2024

Summary / Background

I want to run 8xA100 on EC2 - yet as it is in high demand, I would not expect it to get available in 3 retries, nor in 100 - I want it to retry each second it can for X hours until ready (like a bot).

error example:

{"level":"info","message":"iterative_cml_runner.runner: Creating..."}
{"level":"info","message":"iterative_cml_runner.runner: Creation errored after 10s"}
{"level":"error","message":"terraform error: Error: Failed creating the machine: Not able to decode: operation error EC2: RunInstances, exceeded maximum number of attempts, 3, https response error StatusCode: 500, RequestID: 78dbfe11, api error InsufficientInstanceCapacity: We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c."}
{"level":"info","message":"::error::Terraform exited with code 1."}

Scope

So I want to have some option to set X retry attempts or infinite retry when I try to get an instance started. Is there any hidden option for it or at least to set retry count to 99999999?

@OLSecret OLSecret added the epic Collection of sub-issues label Feb 5, 2024
@OLSecret OLSecret changed the title How to set instance recreation times count? How to set instance recreation times count (exceeded maximum number of attempts error on start up)? Feb 5, 2024
@0x2b3bfa0 0x2b3bfa0 removed the epic Collection of sub-issues label May 10, 2024
@0x2b3bfa0
Copy link
Member

We don't provide any inbuilt mechanism to do that, but you can always retry at the shell level.

for attempt in {1..100}; do
  cml runner ...
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants