Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AN-144] Cost capping GPU support #7672

Draft
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

sam-schu
Copy link

@sam-schu sam-schu commented Dec 12, 2024

Jira Ticket: AN-144

Follows #7583 (removing support for Nvidia Tesla K80 GPUs)

Description

Overview

This PR adds GPU support to cost capping in Cromwell. Specifically, when completed, we should be able to:

  • get information from both PAPI and the Batch API about the GPU(s) being used for a particular VM to run a task,
  • get the SKUs from the GCP cost catalog that correspond to GPUs and associated machine configurations supported by Cromwell and add them to our internal cost catalog,
  • convert VM GPU information to a cost catalog key,
  • use this key to look up the corresponding SKU from the internal cost catalog, and
  • extract the cost information from this SKU and use it to calculate the VM cost.

Note that only the first bullet above involves separate implementations for PAPI and Batch. The implementations for all other bullets should be Google backend-agnostic.

Let's go through what changes I have and have not made yet. More detailed descriptions of what I have and have not tested yet are saved for the following couple of sections.

Please also see the inline GitHub comments I added to this PR to clarify changes and highlight opportunities for improvement.

What is done so far:

Getting the GPU information for a VM from PAPI. This involved updating the logic that converts the Java maps that we get from the Google API into structured objects: the GPU count is stored in JSON as a string but stored in the corresponding object field as a long, and the deserialization code could not handle this case. That change was made in Deserialization.scala and unit tested in DeserializationSpec.scala and StringToNumberDeserializationTestClass.java. The code to actually get the GPU information for the VM is in GetRequestHandler.scala, and was manually tested for simple cases but not unit tested. InstantiatedVmInfo in GcpCostCatalogTypes.scala was also updated to hold the GPU information. The new field gpuInfo is an Option[GpuInfo] because not all VMs use GPUs, unlike with CPUs and RAM. We use our own GpuInfo class so as to be agnostic between the PAPI and Batch backends.

Updating the cost catalog types to support GPUs. CostCatalogKey, at the top of GcpCostCatalogService.scala, is used to represent the keys of the internal cost catalog map. The MachineType entry of each key was changed to the new ResourceInfo type, which supports having either a MachineType, needed for CPU and RAM keys, or a GpuType, needed for GPU keys. (The cost of a GPU is not affected by the machine type of the machine it is attached to.) Also, the MachineCustomization entry was changed to an Option[MachineCustomization] because GPU keys do not need a machine customization value, as the cost of a GPU is not affected by whether the machine it is attached to is custom or predefined. These updates have many corresponding changes in GcpCostCatalogTypes.scala, which also now includes a Gpu ResourceType.

Getting the relevant GPU SKUs from the Google cost catalog and adding them to our internal cost catalog. This largely consists of the changes to expectedSku and the first apply method of CostCatalogKey in GcpCostCatalogService.scala. Manual testing of other changes that rely on this code shows that it at least adds some of the correct keys/SKUs to the catalog, but I was not able to directly manually or unit test this code at all yet. We need to make sure that all relevant entries from Google's cost catalog (i.e., all SKUs that we need to calculate the cost of GPUs in machine configurations supported by Cromwell) are added to ours, and that no extraneous entries are added.

Converting VM GPU information to a cost catalog key. This allows us to look up the SKU with the same key in our internal cost catalog to determine the GPU cost for a particular VM. This involved changing the second apply method of CostCatalogKey in GcpCostCatalogService.scala. This has passed basic manual testing but needs more extensive manual and unit testing.

Looking up the GPU SKU from our cost catalog corresponding to a VM. Changes for this start in calculateVmCostPerHour and also include lookUpSku, both of which are in GcpCostCatalogService.scala. calculateVmCostPerHour had to be restructured more than I would wish because of the difference in how GPU costs work compared to CPU and RAM costs — primarily, that a VM does not always have a GPU, and we cannot attempt to look up the GPU SKU in this case. This has passed basic manual testing but needs more extensive manual and unit testing.

What is NOT done so far:

Getting the GPU information for a VM from Google Batch. The None I added in BatchRequestExecutor.scala will have to be replaced with the GPU value determined from the Batch API. See GetRequestHandler.scala, the corresponding file for PAPI, because what needs to be done is the same; we just need to get the information from the Batch API instead of PAPI this time.

Extracting the cost information from a GPU SKU that was looked up and using this to calculate the actual GPU cost per hour. This should involve implementing the calculateGpuPricePerHour stub I added to GcpCostCatalogService.scala in a similar way to calculateCpuPricePerHour and calculateRamPricePerHour. I already added code in calculateVmCostPerHour in the same file to call this method and incorporate its result into the total VM cost per hour.


Now, let's take a look at manual testing. I tested my changes as much as I had time to, but I was only able to use a single WDL, and therefore a single machine/GPU configuration, for tests of success cases involving GPU use. This is a modified version of the WDL at centaur/src/main/resources/standardTestCases/gpu_on_papi/gpu_cuda_image.wdl, with the gpuCount changed from 1 to 2. We expect 2 Nvidia Tesla T4 GPUs to be used when running this workflow. For testing with no GPUs, pretty much any basic workflow would work, but I have been using one of my workflows from the Methods Repo.

The only unit tests I have written were those mentioned earlier in DeserializationSpec.scala for one part of getting the GPU information for a VM from PAPI. Many of the manual tests that I did should really be replaced with unit tests that serve the same purpose; I just didn't have time to write those unit tests, so I did manual testing to check for any obvious bugs in the code I had written. The second test below is the most important to continue running as a manual test while working on this PR.

Manual testing completed:

  • GPU info for a VM can be determined from PAPI and added to the InstantiatedVmInfo.

InstantiatedVmInfo(us-central1,custom-1-2048,Some(GpuInfo(2,nvidia-tesla-t4)),false)

  • The GPU part of the "Calculated vmCostPerHour" log from calculateVmCostPerHour in GcpCostCatalogService.scala looks correct, other than the GPU cost itself, when running a workflow with GPUs. (The cost per hour is given as 1 because calculateGpuPricePerHour is currently a stub that always returns 1.) This shows that, in this simple case, the GPU information determined from PAPI was converted into a cost catalog key, and the SKU corresponding to that key was successfully added to our internal cost catalog and was able to be looked up correctly based on the key. ("Nvidia Tesla T4 GPU running in Americas" is the description of the SKU that was looked up, and it represents the correct SKU corresponding to the GPUs used in the workflow.)

GPU 1 for 2 GPUs [Some(Nvidia Tesla T4 GPU running in Americas)])

  • The GPU part of the "Calculated vmCostPerHour" log from calculateVmCostPerHour looks correct when running a workflow with no GPUs.

GPU 0 for 0 GPUs [None]

  • Cost calculations failed when I edited the expectedSku regex in GcpCostCatalogService.scala so as not to add Nvidia Tesla T4 SKUs from the Google cost catalog into our internal cost catalog.

Failed to calculate VM cost per hour for InstantiatedVmInfo(us-central1,custom-1-2048,Some(GpuInfo(2,nvidia-tesla-t4)),false). Failed to look up Gpu SKU for InstantiatedVmInfo(us-central1,custom-1-2048,Some(GpuInfo(2,nvidia-tesla-t4)),false)

  • Cost calculations failed when I edited the code to make the InstantiatedVmInfo normally determined from PAPI include a GPU type that is not in the set list of GPU types we support.

Failed to calculate VM cost per hour for InstantiatedVmInfo(us-central1,custom-1-2048,Some(GpuInfo(1,fakegpu)),false). Unrecognized GPU type: fakegpu

  • Cost calculations failed when I edited the code to always call lookUpSku even when no GPUs are being used. This shows that the second apply method of CostCatalogKey in GcpCostCatalogService.scala correctly fails if no GPU is being used. The default error message still needs to be updated.

Failed to calculate VM cost per hour for InstantiatedVmInfo(us-central1,custom-1-2048,None,false). None.get

  • Minimal regression testing by making the minimal necessary changes to GcpCostCatalogServiceSpec.scala and checking that all tests still pass.

Testing that should still be completed:

  • Tests similar to the first 2 above with all supported GPU types.
  • Tests involving workflows that split out into multiple tasks using different GPU types and counts.
  • Tests that check the full price calculation for a workflow, not just the cost per hour.
  • Manually verifying that the GPU SKUs we need for all machine configurations supported by Cromwell are actually added to the internal cost catalog. To my understanding, this includes SKUs for the nvidia-tesla-v100, nvidia-tesla-p100, nvidia-tesla-p4, and nvidia-tesla-t4 GPU types, each for preemptible and non-preemptible machines and all possible regions, but not including any commitment pricing SKUs.
  • Manually verifying that no extraneous SKUs are added to the internal cost catalog.
  • Regression testing. This includes verifying that, for CPU and RAM, filtering and adding SKUs to the internal cost catalog, looking up SKUs, and cost calculations have not been affected by any changes in this PR.
  • Unit testing.
  • Consider whether an integration test should be added for cost capping with GPUs, or whether an existing cost capping integration test should be modified to include GPUs if such a test already exists.
    • Potential sources of variability that could make such a test flaky include fluctuations in the amount of time a workflow takes to run, and pricing updates in Google's cost catalog. The former could be controlled by checking only the computed VM cost per hour and not the total price, and the latter by always using the same GCP cost catalog file in the test rather than fetching updated versions from Google.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@sam-schu sam-schu changed the title Cost capping GPU support [AN-144] Cost capping GPU support Dec 12, 2024
@@ -38,28 +39,47 @@ object CostCatalogKey {
final val expectedSku =
(".*?N1 Predefined Instance (Core|Ram) .*|" +
".*?N2 Custom Instance (Core|Ram) .*|" +
".*?N2D AMD Custom Instance (Core|Ram) .*").r
".*?N2D AMD Custom Instance (Core|Ram) .*|" +
"Nvidia Tesla V100 GPU .*|" +
Copy link
Author

@sam-schu sam-schu Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, since we use the findFirstIn function below, omitting the .*? does not stop us from matching SKU descriptions that have text before "Nvidia Tesla." Really, as long as this does not affect CPU and RAM SKUs, we should change that function to require the full description to match the regex because those GPU SKUs with extra text at the start are for commitment pricing and are not needed.

We should also verify my conclusion that all the GPU SKUs we need look like "Nvidia Tesla T4 GPU running in Paris" or "Nvidia Tesla P4 GPU attached to Spot Preemptible VMs running in Milan."

mType,
if (resourceType == Gpu)
for {
gpuInfo <- ErrorOr(instantiatedVmInfo.gpuInfo.get) // TODO: improve error message (default: "None.get")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if this method is called with instantiated VM info that does not include GPUs, this line will fail with a very cryptic error message. It should fail — we can't get a GPU cost catalog key for a VM that doesn't use a GPU — but we should improve the error message.

@@ -116,6 +136,9 @@ object GcpCostCatalogService {
s"Expected usage units of RAM to be 'GiBy.h'. Got ${usageUnit}".invalidNel
}
}

// TODO: implement this
def calculateGpuPricePerHour(gpuSku: Sku, gpuCount: Long): ErrorOr[BigDecimal] = BigDecimal(1).validNel
Copy link
Author

@sam-schu sam-schu Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not have time to dig into how calculateCpuPricePerHour and calculateRamPricePerHour work, but I expect that this method's implementation should be similar. At a high level, we need to get the pricing info from the SKU, convert it to cost per GPU per hour, and then multiply by the GPU count to get the total GPU price per hour.

@@ -212,23 +235,47 @@ class GcpCostCatalogService(serviceConfig: Config, globalConfig: Config, service
}

// TODO consider caching this, answers won't change until we reload the SKUs
def calculateVmCostPerHour(instantiatedVmInfo: InstantiatedVmInfo): ErrorOr[BigDecimal] =
for {
def calculateVmCostPerHour(instantiatedVmInfo: InstantiatedVmInfo): ErrorOr[BigDecimal] = {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method had to be significantly restructured due to the fact that, unlike with CPUs and RAM, we need to be able to skip the GPU SKU lookup and cost calculation steps for a VM that does not use GPUs. Note that we can't just try to look up the SKU anyway and give the GPU cost as 0 if there is an error if we want to be able to distinguish between not having anything to look up and having a genuine error because, for example, the GPU type is not recognized.

To the best of my knowledge, this necessitated breaking up the large for-yield block into separate pieces. (It might not have been the most readable anyway if I added more steps to the already large for-yield that existed before.)


for {
cpuPricingInfo <- cpuPricingInfoErrorOr
(cpuSku, coreCount, cpuPricePerHour) = cpuPricingInfo
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of breaking up this method into smaller for-yields is that, since the log message at the end incorporates information from nearly every line of every for-yield, we need to box up the results of all the for-yields into ErrorOrs of tuples and then unravel all the data here. It would make me happy if, when I think to check back on this PR a month from now, I see that someone with more Scala knowledge than me was able to figure out a way to make this less messy!

_ = logger.info(
s"Calculated vmCostPerHour of ${totalCost} " +
s"(CPU ${cpuPricePerHour} for ${coreCount} cores [${cpuSku.getDescription}], " +
s"RAM ${ramPricePerHour} for ${ramGbCount} Gb [${ramSku.getDescription}]) " +
s"RAM ${ramPricePerHour} for ${ramGbCount} Gb [${ramSku.getDescription}], " +
s"GPU ${gpuPricePerHour} for ${gpuCount} GPUs [${gpuSku.map(_.getDescription)}]) " +
Copy link
Author

@sam-schu sam-schu Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike with the CPU and RAM SKUs, I sent back the GPU SKU from the GPU cost for-yield as an Option[Sku] since we don't always have a GPU SKU. I then used the Option in the log message; we should try to make this more readable.

@@ -94,28 +121,30 @@ object MachineCustomization {
- For non-N1 machines, both custom and predefined SKUs are included, custom ones include "Custom" in their description
strings and predefined SKUs are only identifiable by the absence of "Custom."
*/
def fromSku(sku: Sku): Option[MachineCustomization] = {
def fromCpuOrRamSku(sku: Sku): MachineCustomization = {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed unnecessary to have this method return an Option when the returned value is always a Some, especially when that made the return type inconsistent with fromMachineTypeString. This method should not be used for GPU SKUs.

@@ -80,9 +80,9 @@ class GcpCostCatalogServiceSpec

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add unit tests to this file that actually use GPUs, not just make the compiler happy by updating to the new types that support GPUs (as I did).

@@ -145,7 +145,7 @@ object BatchRequestExecutor {
// Get instances that can be created with this AllocationPolicy, only instances[0] is supported
val instancePolicy = allocationPolicy.getInstances(0).getPolicy
val machineType = instancePolicy.getMachineType
val preemtible = instancePolicy.getProvisioningModelValue == ProvisioningModel.PREEMPTIBLE.getNumber
val preemptible = instancePolicy.getProvisioningModelValue == ProvisioningModel.PREEMPTIBLE.getNumber
Copy link
Author

@sam-schu sam-schu Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick wins :)

@@ -155,7 +155,8 @@ object BatchRequestExecutor {
else
location.split("/").last

val instantiatedVmInfo = Some(InstantiatedVmInfo(region, machineType, preemtible))
// TODO: include GPU info
val instantiatedVmInfo = Some(InstantiatedVmInfo(region, machineType, None, preemptible))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder until we actually determine the GPU info for VMs from Batch.

// - Log appears repeatedly while task is running
// - Improve formatting of accelerator info
// - Include workflow/task ID?
logger.warn(
Copy link
Author

@sam-schu sam-schu Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many reasons why this log message is bad, stemming from the root cause that this is a placeholder and I was not sure of the best way to handle this case and provide a useful warning message. Please consider but do not limit yourself to the notes in the code comment in the effort to make it good. (Note that users should only be able to specify one GPU type per task in their WDL as far as Janet and I know, so this case should never come up unless Google does something very unexpected.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant