Cannot create certain GCP GPU instances #1398

awendel-presien · 2023-07-03T01:50:12Z

Hi everyone,

I'm getting the following error when trying to create A100 or L4 based instances on GCP using cml runner launch (a2-highgpu and g2-standard types respectively):

***"level":"error","message":"terraform error: Error: Failed creating the machine: googleapi: Error 400: Instances with guest accelerators do not support live migration., badRequest"***

I have no problem creating V100 and T4 instances (both n1 types).

I have found this discussion, which suggests the maintenance policy needs to be set to TERMINATE. Am I on the right track, and if yes, is there a way to do that using cml runner launch?

Regards,
Alex.

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2023-07-03T03:48:28Z

Hello, @awendel-presien! It looks like the instance maintenance behavior is already being set to TERMINATE when creating GPU instances. 🤔

awendel-presien · 2023-07-03T04:02:36Z

Hello, @awendel-presien! It looks like the instance maintenance behavior is already being set to TERMINATE when creating GPU instances. 🤔

Hi @0x2b3bfa0, thanks for having a look at this! Do you have any other ideas as to why it might return that error for the newer a2-highgpu and g2-standard instance types?

awendel-presien · 2023-07-04T23:58:20Z

@0x2b3bfa0, unfortunately this error persists for us. As a test, I tried creating a g2-standard-4 instance using Terraform and the Iterative Terraform provider (so not using CML), and that worked without issue.

So the problem only occurs when trying to start g2 or a2 instances using cml runner launch.

Any ideas?

awendel-presien · 2023-07-07T01:23:33Z

We managed to get this working by including the GPU type and number in the --cloud-type option, e.g. g2-standard-96+nvidia-l4*8 instead of g2-standard-96.

I think this is something that should at least be documented, because it is technically superfluous; i.e. g2-standard-96 instances only come with 8x Nvidia L4 GPUs . It's the same with a2 instances; for example a2-highgpu-8g only comes with 8x A100 GPUs.

It's also not necessary to specify the number and type of GPUs when using the Terraform Provider Iterative directly - it works just fine when only providing the machine type. And cml runner launch does not require this for AWS instances; for example you can launch a g4dn.metal instance without specifying the type and number of GPUs.

hopeai · 2023-07-18T13:33:56Z

Hi @awendel-presien,

Did you manage to run any a2-highgpu using cml runner launch ? I am getting the same error when I set --cloud-type=a2-highgpu-1g .

dacbd · 2023-07-18T21:34:15Z

@hopeai can you try as a2-highgpu+nvidia-a100*1 or a2-highgpu+nvidia-tesla-a100*1 we'll see if we can address this in the near future. In the past you had to select GPUs and the gcp types didn't have preselected gpus options, like for example with the aws image types.

hopeai · 2023-07-19T03:35:07Z

Thanks @dacbd, I was able to solve this problem by setting --cloud-type=a2-highgpu-1g+nvidia-tesla-a100*1 . BTW, how do you deal with resource availability problem. Is there a plan to address this in the near future.

error: terraform error: Error: Failed creating the machine: Operation error: compute.OperationErrorErrors{Code:"ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS", ErrorDetails:[]*compute.OperationErrorErrorsErrorDetails{(*compute.OperationErrorErrorsErrorDetails)(0xc000336870), (*compute.OperationErrorErrorsErrorDetails)(0xc000336960), (*compute.OperationErrorErrorsErrorDetails)(0xc000336cd0)}, Location:"", Message:"The zone 'projects/MY_PROJECT/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'.", ForceSendFields:[]string(nil), NullFields:[]string(nil)}

dacbd · 2023-07-19T21:42:37Z

My recommendation if its something that you encounter often would be to try some kind of simple bash loop, something like this:

zones=("us-central1-a", "us-central1-b", "us-central1-c")
for zone in "{zones[@]}"; do
    cml runner launch ... \
        --region="$zone" \
        ...
    if [ $? -eq 0 ]; then
          echo "deploy runner in $zone"
          break
    else
          echo "Runner in $zone failed, trying next zone"
    fi
done

(I haven't explicitly tested the above)

@hopeai we aren't doing much active development on CML for the moment, but if you want to add this feature yourself, I'd be happy to prioritize testing any pull requests you make, and releasing any new additions.

hopeai · 2023-07-20T04:51:54Z

My recommendation if its something that you encounter often would be to try some kind of simple bash loop, something like this:
zones=("us-central1-a", "us-central1-b", "us-central1-c")
for zone in "{zones[@]}"; do
    cml runner launch ... \
        --region="$zone" \
        ...
    if [ $? -eq 0 ]; then
          echo "deploy runner in $zone"
          break
    else
          echo "Runner in $zone failed, trying next zone"
    fi
done
(I haven't explicitly tested the above)

@hopeai we aren't doing much active development on CML for the moment, but if you want to add this feature yourself, I'd be happy to prioritize testing any pull requests you make, and releasing any new additions.

Thanks for the recommendation @dacbd. At the moment I'm using a similar bash loop, but I'd like to know if this is something that will be addressed in cml runner launch it could be a --cloud-region-list option or --cloud-region can accept more than one region to try.

Arslan-Mehmood1 · 2023-12-29T15:52:50Z

check quotas of your gcp account and try to provision resources accordingly via cml runner

0x2b3bfa0 mentioned this issue Jul 3, 2023

Set maintenance behavior to TERMINATE when using GPU iterative/terraform-provider-iterative#755

Closed

0x2b3bfa0 self-assigned this Jul 3, 2023

0x2b3bfa0 added bug Something isn't working cloud-gcp Google Cloud labels Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot create certain GCP GPU instances #1398

Cannot create certain GCP GPU instances #1398

awendel-presien commented Jul 3, 2023

0x2b3bfa0 commented Jul 3, 2023

awendel-presien commented Jul 3, 2023

awendel-presien commented Jul 4, 2023 •

edited

awendel-presien commented Jul 7, 2023

hopeai commented Jul 18, 2023

dacbd commented Jul 18, 2023

hopeai commented Jul 19, 2023

dacbd commented Jul 19, 2023

hopeai commented Jul 20, 2023 •

edited

Arslan-Mehmood1 commented Dec 29, 2023

Cannot create certain GCP GPU instances #1398

Cannot create certain GCP GPU instances #1398

Comments

awendel-presien commented Jul 3, 2023

0x2b3bfa0 commented Jul 3, 2023

awendel-presien commented Jul 3, 2023

awendel-presien commented Jul 4, 2023 • edited

awendel-presien commented Jul 7, 2023

hopeai commented Jul 18, 2023

dacbd commented Jul 18, 2023

hopeai commented Jul 19, 2023

dacbd commented Jul 19, 2023

hopeai commented Jul 20, 2023 • edited

Arslan-Mehmood1 commented Dec 29, 2023

awendel-presien commented Jul 4, 2023 •

edited

hopeai commented Jul 20, 2023 •

edited