[AN-144] Cost capping GPU support #7672

sam-schu · 2024-12-12T21:08:36Z

Jira Ticket: AN-144

Follows #7583 (removing support for Nvidia Tesla K80 GPUs)

Description

Overview

This PR adds GPU support to cost capping in Cromwell. Specifically, when completed, we should be able to:

get information from both PAPI and the Batch API about the GPU(s) being used for a particular VM to run a task,
get the SKUs from the GCP cost catalog that correspond to GPUs and associated machine configurations supported by Cromwell and add them to our internal cost catalog,
convert VM GPU information to a cost catalog key,
use this key to look up the corresponding SKU from the internal cost catalog, and
extract the cost information from this SKU and use it to calculate the VM cost.

Note that only the first bullet above involves separate implementations for PAPI and Batch. The implementations for all other bullets should be Google backend-agnostic.

Let's go through what changes I have and have not made yet. More detailed descriptions of what I have and have not tested yet are saved for the following couple of sections.

Please also see the inline GitHub comments I added to this PR to clarify changes and highlight opportunities for improvement.

What is done so far:

Getting the GPU information for a VM from PAPI. This involved updating the logic that converts the Java maps that we get from the Google API into structured objects: the GPU count is stored in JSON as a string but stored in the corresponding object field as a long, and the deserialization code could not handle this case. That change was made in Deserialization.scala and unit tested in DeserializationSpec.scala and StringToNumberDeserializationTestClass.java. The code to actually get the GPU information for the VM is in GetRequestHandler.scala, and was manually tested for simple cases but not unit tested. InstantiatedVmInfo in GcpCostCatalogTypes.scala was also updated to hold the GPU information. The new field gpuInfo is an Option[GpuInfo] because not all VMs use GPUs, unlike with CPUs and RAM. We use our own GpuInfo class so as to be agnostic between the PAPI and Batch backends.

Updating the cost catalog types to support GPUs. CostCatalogKey, at the top of GcpCostCatalogService.scala, is used to represent the keys of the internal cost catalog map. The MachineType entry of each key was changed to the new ResourceInfo type, which supports having either a MachineType, needed for CPU and RAM keys, or a GpuType, needed for GPU keys. (The cost of a GPU is not affected by the machine type of the machine it is attached to.) Also, the MachineCustomization entry was changed to an Option[MachineCustomization] because GPU keys do not need a machine customization value, as the cost of a GPU is not affected by whether the machine it is attached to is custom or predefined. These updates have many corresponding changes in GcpCostCatalogTypes.scala, which also now includes a Gpu ResourceType.

Getting the relevant GPU SKUs from the Google cost catalog and adding them to our internal cost catalog. This largely consists of the changes to expectedSku and the first apply method of CostCatalogKey in GcpCostCatalogService.scala. Manual testing of other changes that rely on this code shows that it at least adds some of the correct keys/SKUs to the catalog, but I was not able to directly manually or unit test this code at all yet. We need to make sure that all relevant entries from Google's cost catalog (i.e., all SKUs that we need to calculate the cost of GPUs in machine configurations supported by Cromwell) are added to ours, and that no extraneous entries are added.

Converting VM GPU information to a cost catalog key. This allows us to look up the SKU with the same key in our internal cost catalog to determine the GPU cost for a particular VM. This involved changing the second apply method of CostCatalogKey in GcpCostCatalogService.scala. This has passed basic manual testing but needs more extensive manual and unit testing.

Looking up the GPU SKU from our cost catalog corresponding to a VM. Changes for this start in calculateVmCostPerHour and also include lookUpSku, both of which are in GcpCostCatalogService.scala. calculateVmCostPerHour had to be restructured more than I would wish because of the difference in how GPU costs work compared to CPU and RAM costs — primarily, that a VM does not always have a GPU, and we cannot attempt to look up the GPU SKU in this case. This has passed basic manual testing but needs more extensive manual and unit testing.

What is NOT done so far:

Getting the GPU information for a VM from Google Batch. The None I added in BatchRequestExecutor.scala will have to be replaced with the GPU value determined from the Batch API. See GetRequestHandler.scala, the corresponding file for PAPI, because what needs to be done is the same; we just need to get the information from the Batch API instead of PAPI this time.

Extracting the cost information from a GPU SKU that was looked up and using this to calculate the actual GPU cost per hour. This should involve implementing the calculateGpuPricePerHour stub I added to GcpCostCatalogService.scala in a similar way to calculateCpuPricePerHour and calculateRamPricePerHour. I already added code in calculateVmCostPerHour in the same file to call this method and incorporate its result into the total VM cost per hour.

Now, let's take a look at manual testing. I tested my changes as much as I had time to, but I was only able to use a single WDL, and therefore a single machine/GPU configuration, for tests of success cases involving GPU use. This is a modified version of the WDL at centaur/src/main/resources/standardTestCases/gpu_on_papi/gpu_cuda_image.wdl, with the gpuCount changed from 1 to 2. We expect 2 Nvidia Tesla T4 GPUs to be used when running this workflow. For testing with no GPUs, pretty much any basic workflow would work, but I have been using one of my workflows from the Methods Repo.

The only unit tests I have written were those mentioned earlier in DeserializationSpec.scala for one part of getting the GPU information for a VM from PAPI. Many of the manual tests that I did should really be replaced with unit tests that serve the same purpose; I just didn't have time to write those unit tests, so I did manual testing to check for any obvious bugs in the code I had written. The second test below is the most important to continue running as a manual test while working on this PR.

Manual testing completed:

GPU info for a VM can be determined from PAPI and added to the InstantiatedVmInfo.

InstantiatedVmInfo(us-central1,custom-1-2048,Some(GpuInfo(2,nvidia-tesla-t4)),false)

The GPU part of the "Calculated vmCostPerHour" log from calculateVmCostPerHour in GcpCostCatalogService.scala looks correct, other than the GPU cost itself, when running a workflow with GPUs. (The cost per hour is given as 1 because calculateGpuPricePerHour is currently a stub that always returns 1.) This shows that, in this simple case, the GPU information determined from PAPI was converted into a cost catalog key, and the SKU corresponding to that key was successfully added to our internal cost catalog and was able to be looked up correctly based on the key. ("Nvidia Tesla T4 GPU running in Americas" is the description of the SKU that was looked up, and it represents the correct SKU corresponding to the GPUs used in the workflow.)

GPU 1 for 2 GPUs [Some(Nvidia Tesla T4 GPU running in Americas)])

The GPU part of the "Calculated vmCostPerHour" log from calculateVmCostPerHour looks correct when running a workflow with no GPUs.

GPU 0 for 0 GPUs [None]

Cost calculations failed when I edited the expectedSku regex in GcpCostCatalogService.scala so as not to add Nvidia Tesla T4 SKUs from the Google cost catalog into our internal cost catalog.

Failed to calculate VM cost per hour for InstantiatedVmInfo(us-central1,custom-1-2048,Some(GpuInfo(2,nvidia-tesla-t4)),false). Failed to look up Gpu SKU for InstantiatedVmInfo(us-central1,custom-1-2048,Some(GpuInfo(2,nvidia-tesla-t4)),false)

Cost calculations failed when I edited the code to make the InstantiatedVmInfo normally determined from PAPI include a GPU type that is not in the set list of GPU types we support.

Failed to calculate VM cost per hour for InstantiatedVmInfo(us-central1,custom-1-2048,Some(GpuInfo(1,fakegpu)),false). Unrecognized GPU type: fakegpu

Cost calculations failed when I edited the code to always call lookUpSku even when no GPUs are being used. This shows that the second apply method of CostCatalogKey in GcpCostCatalogService.scala correctly fails if no GPU is being used. The default error message still needs to be updated.

Failed to calculate VM cost per hour for InstantiatedVmInfo(us-central1,custom-1-2048,None,false). None.get

Minimal regression testing by making the minimal necessary changes to GcpCostCatalogServiceSpec.scala and checking that all tests still pass.

Testing that should still be completed:

Tests similar to the first 2 above with all supported GPU types.
Tests involving workflows that split out into multiple tasks using different GPU types and counts.
Tests that check the full price calculation for a workflow, not just the cost per hour.
Manually verifying that the GPU SKUs we need for all machine configurations supported by Cromwell are actually added to the internal cost catalog. To my understanding, this includes SKUs for the nvidia-tesla-v100, nvidia-tesla-p100, nvidia-tesla-p4, and nvidia-tesla-t4 GPU types, each for preemptible and non-preemptible machines and all possible regions, but not including any commitment pricing SKUs.
- See this Cromwell docs page for the list of supported GPU types. nvidia-tesla-k80 was removed from Cromwell as a supported GPU type in the precursor PR to this one and should be removed from the docs page as well.
Manually verifying that no extraneous SKUs are added to the internal cost catalog.
Regression testing. This includes verifying that, for CPU and RAM, filtering and adding SKUs to the internal cost catalog, looking up SKUs, and cost calculations have not been affected by any changes in this PR.
Unit testing.
Consider whether an integration test should be added for cost capping with GPUs, or whether an existing cost capping integration test should be modified to include GPUs if such a test already exists.
- Potential sources of variability that could make such a test flaky include fluctuations in the amount of time a workflow takes to run, and pricing updates in Google's cost catalog. The former could be controlled by checking only the computed VM cost per hour and not the total price, and the latter by always using the same GCP cost catalog file in the test rather than fetching updated versions from Google.

Release Notes Confirmation

`CHANGELOG.md`

I updated CHANGELOG.md in this PR
I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users