You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to run 8xA100 on EC2 - yet as it is in high demand, I would not expect it to get available in 3 retries, nor in 100 - I want it to retry each second it can for X hours until ready (like a bot).
error example:
{"level":"info","message":"iterative_cml_runner.runner: Creating..."}
{"level":"info","message":"iterative_cml_runner.runner: Creation errored after 10s"}
{"level":"error","message":"terraform error: Error: Failed creating the machine: Not able to decode: operation error EC2: RunInstances, exceeded maximum number of attempts, 3, https response error StatusCode: 500, RequestID: 78dbfe11, api error InsufficientInstanceCapacity: We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c."}
{"level":"info","message":"::error::Terraform exited with code 1."}
Scope
So I want to have some option to set X retry attempts or infinite retry when I try to get an instance started. Is there any hidden option for it or at least to set retry count to 99999999?
The text was updated successfully, but these errors were encountered:
OLSecret
changed the title
How to set instance recreation times count?
How to set instance recreation times count (exceeded maximum number of attempts error on start up)?
Feb 5, 2024
Summary / Background
I want to run 8xA100 on EC2 - yet as it is in high demand, I would not expect it to get available in 3 retries, nor in 100 - I want it to retry each second it can for X hours until ready (like a bot).
error example:
Scope
So I want to have some option to set X retry attempts or infinite retry when I try to get an instance started. Is there any hidden option for it or at least to set retry count to 99999999?
The text was updated successfully, but these errors were encountered: