-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AN-144] Cost capping GPU support #7672
base: develop
Are you sure you want to change the base?
Conversation
@@ -38,28 +39,47 @@ object CostCatalogKey { | |||
final val expectedSku = | |||
(".*?N1 Predefined Instance (Core|Ram) .*|" + | |||
".*?N2 Custom Instance (Core|Ram) .*|" + | |||
".*?N2D AMD Custom Instance (Core|Ram) .*").r | |||
".*?N2D AMD Custom Instance (Core|Ram) .*|" + | |||
"Nvidia Tesla V100 GPU .*|" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, since we use the findFirstIn
function below, omitting the .*?
does not stop us from matching SKU descriptions that have text before "Nvidia Tesla." Really, as long as this does not affect CPU and RAM SKUs, we should change that function to require the full description to match the regex because those GPU SKUs with extra text at the start are for commitment pricing and are not needed.
We should also verify my conclusion that all the GPU SKUs we need look like "Nvidia Tesla T4 GPU running in Paris" or "Nvidia Tesla P4 GPU attached to Spot Preemptible VMs running in Milan."
mType, | ||
if (resourceType == Gpu) | ||
for { | ||
gpuInfo <- ErrorOr(instantiatedVmInfo.gpuInfo.get) // TODO: improve error message (default: "None.get") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if this method is called with instantiated VM info that does not include GPUs, this line will fail with a very cryptic error message. It should fail — we can't get a GPU cost catalog key for a VM that doesn't use a GPU — but we should improve the error message.
@@ -116,6 +136,9 @@ object GcpCostCatalogService { | |||
s"Expected usage units of RAM to be 'GiBy.h'. Got ${usageUnit}".invalidNel | |||
} | |||
} | |||
|
|||
// TODO: implement this | |||
def calculateGpuPricePerHour(gpuSku: Sku, gpuCount: Long): ErrorOr[BigDecimal] = BigDecimal(1).validNel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not have time to dig into how calculateCpuPricePerHour
and calculateRamPricePerHour
work, but I expect that this method's implementation should be similar. At a high level, we need to get the pricing info from the SKU, convert it to cost per GPU per hour, and then multiply by the GPU count to get the total GPU price per hour.
@@ -212,23 +235,47 @@ class GcpCostCatalogService(serviceConfig: Config, globalConfig: Config, service | |||
} | |||
|
|||
// TODO consider caching this, answers won't change until we reload the SKUs | |||
def calculateVmCostPerHour(instantiatedVmInfo: InstantiatedVmInfo): ErrorOr[BigDecimal] = | |||
for { | |||
def calculateVmCostPerHour(instantiatedVmInfo: InstantiatedVmInfo): ErrorOr[BigDecimal] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method had to be significantly restructured due to the fact that, unlike with CPUs and RAM, we need to be able to skip the GPU SKU lookup and cost calculation steps for a VM that does not use GPUs. Note that we can't just try to look up the SKU anyway and give the GPU cost as 0 if there is an error if we want to be able to distinguish between not having anything to look up and having a genuine error because, for example, the GPU type is not recognized.
To the best of my knowledge, this necessitated breaking up the large for-yield
block into separate pieces. (It might not have been the most readable anyway if I added more steps to the already large for-yield
that existed before.)
|
||
for { | ||
cpuPricingInfo <- cpuPricingInfoErrorOr | ||
(cpuSku, coreCount, cpuPricePerHour) = cpuPricingInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of breaking up this method into smaller for-yield
s is that, since the log message at the end incorporates information from nearly every line of every for-yield
, we need to box up the results of all the for-yield
s into ErrorOr
s of tuples and then unravel all the data here. It would make me happy if, when I think to check back on this PR a month from now, I see that someone with more Scala knowledge than me was able to figure out a way to make this less messy!
_ = logger.info( | ||
s"Calculated vmCostPerHour of ${totalCost} " + | ||
s"(CPU ${cpuPricePerHour} for ${coreCount} cores [${cpuSku.getDescription}], " + | ||
s"RAM ${ramPricePerHour} for ${ramGbCount} Gb [${ramSku.getDescription}]) " + | ||
s"RAM ${ramPricePerHour} for ${ramGbCount} Gb [${ramSku.getDescription}], " + | ||
s"GPU ${gpuPricePerHour} for ${gpuCount} GPUs [${gpuSku.map(_.getDescription)}]) " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike with the CPU and RAM SKUs, I sent back the GPU SKU from the GPU cost for-yield
as an Option[Sku]
since we don't always have a GPU SKU. I then used the Option in the log message; we should try to make this more readable.
@@ -94,28 +121,30 @@ object MachineCustomization { | |||
- For non-N1 machines, both custom and predefined SKUs are included, custom ones include "Custom" in their description | |||
strings and predefined SKUs are only identifiable by the absence of "Custom." | |||
*/ | |||
def fromSku(sku: Sku): Option[MachineCustomization] = { | |||
def fromCpuOrRamSku(sku: Sku): MachineCustomization = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed unnecessary to have this method return an Option
when the returned value is always a Some
, especially when that made the return type inconsistent with fromMachineTypeString
. This method should not be used for GPU SKUs.
@@ -80,9 +80,9 @@ class GcpCostCatalogServiceSpec | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add unit tests to this file that actually use GPUs, not just make the compiler happy by updating to the new types that support GPUs (as I did).
@@ -145,7 +145,7 @@ object BatchRequestExecutor { | |||
// Get instances that can be created with this AllocationPolicy, only instances[0] is supported | |||
val instancePolicy = allocationPolicy.getInstances(0).getPolicy | |||
val machineType = instancePolicy.getMachineType | |||
val preemtible = instancePolicy.getProvisioningModelValue == ProvisioningModel.PREEMPTIBLE.getNumber | |||
val preemptible = instancePolicy.getProvisioningModelValue == ProvisioningModel.PREEMPTIBLE.getNumber |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick wins :)
@@ -155,7 +155,8 @@ object BatchRequestExecutor { | |||
else | |||
location.split("/").last | |||
|
|||
val instantiatedVmInfo = Some(InstantiatedVmInfo(region, machineType, preemtible)) | |||
// TODO: include GPU info | |||
val instantiatedVmInfo = Some(InstantiatedVmInfo(region, machineType, None, preemptible)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placeholder until we actually determine the GPU info for VMs from Batch.
// - Log appears repeatedly while task is running | ||
// - Improve formatting of accelerator info | ||
// - Include workflow/task ID? | ||
logger.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many reasons why this log message is bad, stemming from the root cause that this is a placeholder and I was not sure of the best way to handle this case and provide a useful warning message. Please consider but do not limit yourself to the notes in the code comment in the effort to make it good. (Note that users should only be able to specify one GPU type per task in their WDL as far as Janet and I know, so this case should never come up unless Google does something very unexpected.)
Jira Ticket: AN-144
Follows #7583 (removing support for Nvidia Tesla K80 GPUs)
Description
Overview
This PR adds GPU support to cost capping in Cromwell. Specifically, when completed, we should be able to:
Note that only the first bullet above involves separate implementations for PAPI and Batch. The implementations for all other bullets should be Google backend-agnostic.
Let's go through what changes I have and have not made yet. More detailed descriptions of what I have and have not tested yet are saved for the following couple of sections.
Please also see the inline GitHub comments I added to this PR to clarify changes and highlight opportunities for improvement.
What is done so far:
Getting the GPU information for a VM from PAPI. This involved updating the logic that converts the Java maps that we get from the Google API into structured objects: the GPU count is stored in JSON as a string but stored in the corresponding object field as a long, and the deserialization code could not handle this case. That change was made in
Deserialization.scala
and unit tested inDeserializationSpec.scala
andStringToNumberDeserializationTestClass.java
. The code to actually get the GPU information for the VM is inGetRequestHandler.scala
, and was manually tested for simple cases but not unit tested.InstantiatedVmInfo
inGcpCostCatalogTypes.scala
was also updated to hold the GPU information. The new fieldgpuInfo
is anOption[GpuInfo]
because not all VMs use GPUs, unlike with CPUs and RAM. We use our ownGpuInfo
class so as to be agnostic between the PAPI and Batch backends.Updating the cost catalog types to support GPUs.
CostCatalogKey
, at the top ofGcpCostCatalogService.scala
, is used to represent the keys of the internal cost catalog map. TheMachineType
entry of each key was changed to the newResourceInfo
type, which supports having either aMachineType
, needed for CPU and RAM keys, or aGpuType
, needed for GPU keys. (The cost of a GPU is not affected by the machine type of the machine it is attached to.) Also, theMachineCustomization
entry was changed to anOption[MachineCustomization]
because GPU keys do not need a machine customization value, as the cost of a GPU is not affected by whether the machine it is attached to is custom or predefined. These updates have many corresponding changes inGcpCostCatalogTypes.scala
, which also now includes aGpu
ResourceType
.Getting the relevant GPU SKUs from the Google cost catalog and adding them to our internal cost catalog. This largely consists of the changes to
expectedSku
and the firstapply
method ofCostCatalogKey
inGcpCostCatalogService.scala
. Manual testing of other changes that rely on this code shows that it at least adds some of the correct keys/SKUs to the catalog, but I was not able to directly manually or unit test this code at all yet. We need to make sure that all relevant entries from Google's cost catalog (i.e., all SKUs that we need to calculate the cost of GPUs in machine configurations supported by Cromwell) are added to ours, and that no extraneous entries are added.Converting VM GPU information to a cost catalog key. This allows us to look up the SKU with the same key in our internal cost catalog to determine the GPU cost for a particular VM. This involved changing the second
apply
method ofCostCatalogKey
inGcpCostCatalogService.scala
. This has passed basic manual testing but needs more extensive manual and unit testing.Looking up the GPU SKU from our cost catalog corresponding to a VM. Changes for this start in
calculateVmCostPerHour
and also includelookUpSku
, both of which are inGcpCostCatalogService.scala
.calculateVmCostPerHour
had to be restructured more than I would wish because of the difference in how GPU costs work compared to CPU and RAM costs — primarily, that a VM does not always have a GPU, and we cannot attempt to look up the GPU SKU in this case. This has passed basic manual testing but needs more extensive manual and unit testing.What is NOT done so far:
Getting the GPU information for a VM from Google Batch. The
None
I added inBatchRequestExecutor.scala
will have to be replaced with the GPU value determined from the Batch API. SeeGetRequestHandler.scala
, the corresponding file for PAPI, because what needs to be done is the same; we just need to get the information from the Batch API instead of PAPI this time.Extracting the cost information from a GPU SKU that was looked up and using this to calculate the actual GPU cost per hour. This should involve implementing the
calculateGpuPricePerHour
stub I added toGcpCostCatalogService.scala
in a similar way tocalculateCpuPricePerHour
andcalculateRamPricePerHour
. I already added code incalculateVmCostPerHour
in the same file to call this method and incorporate its result into the total VM cost per hour.Now, let's take a look at manual testing. I tested my changes as much as I had time to, but I was only able to use a single WDL, and therefore a single machine/GPU configuration, for tests of success cases involving GPU use. This is a modified version of the WDL at
centaur/src/main/resources/standardTestCases/gpu_on_papi/gpu_cuda_image.wdl
, with thegpuCount
changed from 1 to 2. We expect 2 Nvidia Tesla T4 GPUs to be used when running this workflow. For testing with no GPUs, pretty much any basic workflow would work, but I have been using one of my workflows from the Methods Repo.The only unit tests I have written were those mentioned earlier in
DeserializationSpec.scala
for one part of getting the GPU information for a VM from PAPI. Many of the manual tests that I did should really be replaced with unit tests that serve the same purpose; I just didn't have time to write those unit tests, so I did manual testing to check for any obvious bugs in the code I had written. The second test below is the most important to continue running as a manual test while working on this PR.Manual testing completed:
InstantiatedVmInfo
.calculateVmCostPerHour
inGcpCostCatalogService.scala
looks correct, other than the GPU cost itself, when running a workflow with GPUs. (The cost per hour is given as 1 becausecalculateGpuPricePerHour
is currently a stub that always returns 1.) This shows that, in this simple case, the GPU information determined from PAPI was converted into a cost catalog key, and the SKU corresponding to that key was successfully added to our internal cost catalog and was able to be looked up correctly based on the key. ("Nvidia Tesla T4 GPU running in Americas" is the description of the SKU that was looked up, and it represents the correct SKU corresponding to the GPUs used in the workflow.)calculateVmCostPerHour
looks correct when running a workflow with no GPUs.expectedSku
regex inGcpCostCatalogService.scala
so as not to add Nvidia Tesla T4 SKUs from the Google cost catalog into our internal cost catalog.InstantiatedVmInfo
normally determined from PAPI include a GPU type that is not in the set list of GPU types we support.lookUpSku
even when no GPUs are being used. This shows that the secondapply
method ofCostCatalogKey
inGcpCostCatalogService.scala
correctly fails if no GPU is being used. The default error message still needs to be updated.GcpCostCatalogServiceSpec.scala
and checking that all tests still pass.Testing that should still be completed:
nvidia-tesla-v100
,nvidia-tesla-p100
,nvidia-tesla-p4
, andnvidia-tesla-t4
GPU types, each for preemptible and non-preemptible machines and all possible regions, but not including any commitment pricing SKUs.nvidia-tesla-k80
was removed from Cromwell as a supported GPU type in the precursor PR to this one and should be removed from the docs page as well.Release Notes Confirmation
CHANGELOG.md
CHANGELOG.md
in this PRCHANGELOG.md
because it doesn't impact community usersTerra Release Notes