Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly calculate a cluster's free memory #373

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cunnie
Copy link
Member

@cunnie cunnie commented Oct 20, 2023

cluster free memory in the vSphere CPI uses "effective_memory" [0] defined as follows [1]:

Effective memory resources (in MB) available to run virtual
machines. This is the aggregated effective resource level from all
running hosts. Hosts that are in maintenance mode or are unresponsive
are not counted. Resources used by the VMware Service Console are not
included in the aggregate. This value represents the amount of resources
available for the root resource pool for running virtual machines.

This sounds like it's "available free memory", but it's not. It's total effective memory that a cluster could give to VMs. This is the wrong metric to use for "free memory" as it does not vary based on the memory consumption for the cluster.

This commit changes the calculation to subtract from the effective memory the memory demand, "Sum of memory demand of all the powered-on VMs in the cluster" [2].

The new metric appears to more closely reflect vCenter's reporting of memory. For example, on https://vcenter-80.nono.io:

   490 GiB   effective memory, old, wrong calculation
  - 86 GiB   demand memory
   -------
   404 GiB   new, better calculation

   420 GiB   vCenter's reporting of "Available Reservation"

Note that although this new calculation of memory is more "truthful", measuring memory is nuanced: vSphere has 3 different metrics for memory use (memDemandMB, memEntitledMB, memReservationMB). Furthermore, on vCenter's cluster's summary page, the "free" memory (e.g. 184 GiB) doesn't dovetail with any of the available metrics, nor any combination thereof.

Drive-by:

  • I changed VimSdk::Vim::ComputeResource::SummaryVimSdk::Vim::ClusterComputeResource::Summary in our tests because that was the object reflected in the vSphere MOB browser (we were instance-doubling the wrong object).

[0]

def fetch_cluster_utilization()
logger.debug("Fetching Memory utilization for Cluster #{self.mob.name}")
properties = @client.cloud_searcher.get_properties(mob, Vim::ClusterComputeResource, 'summary')
raise "Failed to get utilization for cluster'#{self.mob.name}'" if properties.nil?
compute_resource_summary = properties["summary"]
return compute_resource_summary.effective_memory
end

[1] https://vdc-download.vmware.com/vmwb-repository/dcr-public/90ec343b-df7c-493e-9979-36ea55765102/8753fd1e-fcab-4bd4-9cde-a364851f31a6/vim.cluster.UsageSummary.html

[2] https://vdc-download.vmware.com/vmwb-repository/dcr-public/3d076a12-29a2-4d17-9269-cb8150b5a37f/8b5969e2-1a66-4425-af17-feff6d6f705d/SDK/sms-sdk/docs/ReferenceGuide/vim.ComputeResource.Summary.html#effectiveMemory

Suggested release notes:

[Bug Fix] The amount of available RAM available in a vSphere cluster is now calculated more conservatively. This should have no affect other than in memory-starved clusters, in which case errors will be raised earlier in the deployment phase.

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Related PR and Issues

Fixes # (issue)

Impacted Areas in Application

List general components of the application that this PR will affect:

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Test A
  • Test B

Test Configuration:

  • Environment Variables for Integration test:
  • Hardware Requirements in ESXi:
  • Toolchain:
  • SDK:

Checklist:

  • My code follows the standard ruby style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Our `fetch_resource_pool_utilization` method has been broken:
`quick_stats.host_memory_usage` was treated as in kiB when it was in
bytes.

This didn't have much fallout; the miscalculation would've only
affected deployments to memory-starved resource pools, and the only
manifestation would've been to error later in the deployment phase
rather than earlier.

To avoid this type of error, we generously update the variables and
methods that return memory-related information to include the units
(typically MiB).
cluster free memory in the vSphere CPI uses "effective_memory" [0]
defined as follows [1]:

  Effective memory resources (in MB) available to run virtual
  machines. This is the aggregated effective resource level from all
  running hosts. Hosts that are in maintenance mode or are unresponsive
  are not counted. Resources used by the VMware Service Console are not
  included in the aggregate. This value represents the amount of resources
  available for the root resource pool for running virtual machines.

This sounds like it's "available free memory", but it's not. It's total
effective memory that a cluster could give to VMs. This is the wrong
metric to use for "free memory" as it does not vary based on the memory
consumption for the cluster.

This commit changes the calculation to subtract from the effective
memory the memory demand, "Sum of memory demand of all the powered-on
VMs in the cluster" [2].

The new metric appears to more closely reflect vCenter's reporting of
memory. For example, on https://vcenter-80.nono.io: the effective memory

   490 GiB   effective memory, old, wrong calculation
  - 86 GiB   demand memory
   -------
   404 GiB   new, better calculation

   420 GiB   vCenter's reporting of "Available Reservation"

Note that although this new calculation of memory is more "truthful",
measuring memory is nuanced: vSphere has 3 different metrics for memory
use (memDemandMB, memEntitledMB, memReservationMB). Furthermore, on
vCenter's cluster's summary page, the "free" memory (e.g. 184 GiB)
doesn't dovetail with _any_ of the available metrics, nor any
combination thereof.

Drive-by:

- I changed `VimSdk::Vim::ComputeResource::Summary` →
  `VimSdk::Vim::ClusterComputeResource::Summary` in our tests because
  that was the object reflected in the vSphere MOB browser (we were
  instance-doubling the wrong object).

[0] https://github.com/cloudfoundry/bosh-vsphere-cpi-release/blob/3fc6d72d7c69b40416e78663238a15d022a98415/src/vsphere_cpi/lib/cloud/vsphere/resources/cluster.rb#L209-L216

[1] https://vdc-download.vmware.com/vmwb-repository/dcr-public/90ec343b-df7c-493e-9979-36ea55765102/8753fd1e-fcab-4bd4-9cde-a364851f31a6/vim.cluster.UsageSummary.html

[2] https://vdc-download.vmware.com/vmwb-repository/dcr-public/3d076a12-29a2-4d17-9269-cb8150b5a37f/8b5969e2-1a66-4425-af17-feff6d6f705d/SDK/sms-sdk/docs/ReferenceGuide/vim.ComputeResource.Summary.html#effectiveMemory

[#183481442]
@cunnie cunnie force-pushed the cluster-free-memory-#183481442 branch from 600142d to b5d77ca Compare October 27, 2023 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

1 participant