Skip to content

Get vSphere GPU performance counters #807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gabrielgbs97
Copy link

@gabrielgbs97 gabrielgbs97 commented Apr 8, 2025

Get GPU performance counters

Thank you for your interest in contributing to Checkmk!
Consider looking into Readme regarding process details.

General information

vSphere Special Agent could retrieve GPU related data. Now there is no way to get GPU monitoring data from an ESXi virtualization host. Retrieved data should be later processed by a variant of cmk/plugins/vsphere/agent_based/esx_vsphere_counters.py

Proposed changes

For now, I would include GPU related data in vsphere special agent, then built-in or exchange plugin may process related metrics.

  • What is the expected behavior?
    vSphere special agent could retrieve GPU performance data
  • What is the observed behavior?
  • If it's not obvious from the above: In what way does your patch change the current behavior?
  • Consider writing a unit test that would have failed without your fix.
  • Is this a new problem? What made you submit this PR (new firmware, new device, changed device behavior)?

Get GPU performance counters
Use only base key id
@gabrielgbs97
Copy link
Author

GPU counters appended to special agent output (vCenter)

gpu.mem.reserved|000:003:00.0|2219712#2219712|kiloBytes
gpu.mem.total|000:003:00.0|23580672#23580672|kiloBytes
gpu.mem.usage|000:003:00.0|941#941|percent
gpu.mem.used|000:003:00.0|2219712#2219712|kiloBytes
gpu.power.used|000:003:00.0|24#24|watt
gpu.temperature|000:003:00.0|37#37|celsius
gpu.utilization|000:003:00.0|0#0|percent

@gabrielgbs97
Copy link
Author

gabrielgbs97 commented Apr 9, 2025

I tried to get a minimal working service such as GPU utilization.
On my install CMK 2.3.p28, agent_vsphere was still v1 based so I tried an isolated plugin based on modern agent v2 API.

I have a GPU Utilization independent example (local agent based plugin v2)

/omd/sites/main_site/local/lib/python3/cmk_addons/plugins/agent_vsphere_plus/agent_based/esx_vsphere_counters_gpu.py, derived from esx_vsphere_counters.py

#!/usr/bin/env python3
# Copyright (C) 2019 Checkmk GmbH - License: GNU General Public License v2
# This file is part of Checkmk (https://checkmk.com). It is subject to the terms and
# conditions defined in the file COPYING, which is part of this source code package.

import time
from collections.abc import Mapping, Sequence
from typing import Any

from cmk.utils import debug

from cmk.agent_based.v2 import (
    AgentSection,
    CheckPlugin,
    CheckResult,
    DiscoveryResult,
    get_value_store,
    IgnoreResultsError,
    RuleSetType,
    Service,
    StringTable,
    check_levels,
    render,
    Result
)
from cmk.plugins.lib import diskstat, esx_vsphere, interfaces
from cmk.plugins.lib.esx_vsphere import Section, SubSectionCounter

# Example output:
# <<<esx_vsphere_counters:sep(124)>>>
# net.broadcastRx|vmnic0|11|number
# net.broadcastRx||11|number
# net.broadcastTx|vmnic0|0|number
# net.broadcastTx||0|number
# net.bytesRx|vmnic0|3820|kiloBytesPerSecond
# net.bytesRx|vmnic1|0|kiloBytesPerSecond
# net.bytesRx|vmnic2|0|kiloBytesPerSecond
# net.bytesRx|vmnic3|0|kiloBytesPerSecond
# net.bytesRx||3820|kiloBytesPerSecond
# net.bytesTx|vmnic0|97|kiloBytesPerSecond
# net.bytesTx|vmnic1|0|kiloBytesPerSecond
# net.bytesTx|vmnic2|0|kiloBytesPerSecond
# net.bytesTx|vmnic3|0|kiloBytesPerSecond
# net.bytesTx||97|kiloBytesPerSecond
# net.droppedRx|vmnic0|0|number
# net.droppedRx|vmnic1|0|number
# net.droppedRx|vmnic2|0|number
# net.droppedRx|vmnic3|0|number
# net.droppedRx||0|number
# net.droppedTx|vmnic0|0|number
# net.droppedTx|vmnic1|0|number
# ...
# datastore.read|4c4ece34-3d60f64f-1584-0022194fe902|0#1#2|kiloBytesPerSecond
# datastore.read|4c4ece5b-f1461510-2932-0022194fe902|0#4#5|kiloBytesPerSecond
# datastore.numberReadAveraged|511e4e86-1c009d48-19d2-bc305bf54b07|0#0#0|number
# datastore.numberWriteAveraged|4c4ece34-3d60f64f-1584-0022194fe902|0#0#1|number
# datastore.totalReadLatency|511e4e86-1c009d48-19d2-bc305bf54b07|0#5#5|millisecond
# datastore.totalWriteLatency|4c4ece34-3d60f64f-1584-0022194fe902|0#2#7|millisecond
# ...
# sys.uptime||630664|second


def parse_esx_vsphere_counters(string_table: StringTable) -> esx_vsphere.SectionCounter:
    """
    >>> from pprint import pprint
    >>> pprint(parse_esx_vsphere_counters([
    ... ['disk.numberReadAveraged', 'naa.5000cca05688e814', '0#0', 'number'],
    ... ['disk.write',
    ...  'naa.6000eb39f31c58130000000000000015',
    ...  '0#0',
    ...  'kiloBytesPerSecond'],
    ... ['net.bytesRx', 'vmnic0', '1#1', 'kiloBytesPerSecond'],
    ... ['net.droppedRx', 'vmnic1', '0#0', 'number'],
    ... ['net.errorsRx', '', '0#0', 'number'],
    ... ]))
    {'disk.numberReadAveraged': {'naa.5000cca05688e814': [(['0', '0'], 'number')]},
     'disk.write': {'naa.6000eb39f31c58130000000000000015': [(['0', '0'],
                                                              'kiloBytesPerSecond')]},
     'net.bytesRx': {'vmnic0': [(['1', '1'], 'kiloBytesPerSecond')]},
     'net.droppedRx': {'vmnic1': [(['0', '0'], 'number')]},
     'net.errorsRx': {'': [(['0', '0'], 'number')]}}
    """

    parsed: dict[str, dict[str, list[tuple[esx_vsphere.CounterValues, str]]]] = {}
    # The data reported by the ESX system is split into multiple real time samples with
    # a fixed duration of 20 seconds. A check interval of one minute reports 3 samples
    # The esx_vsphere_counters checks need to figure out by themselves how to handle this data
    for counter, instance, multivalues, unit in string_table:
        values = multivalues.split("#")
        parsed.setdefault(counter, {})
        parsed[counter].setdefault(instance, [])
        parsed[counter][instance].append((values, unit))
    return parsed

# .--GPU--------------------.
# |                         |
# |     ____ ____  _   _    |
# |    / ___|  _ \| | | |   |
# |   | |  _| |_) | | | |   |
# |   | |_| |  __/| |_| |   |
# |    \____|_|    \___/    |
# |                         |
# '-------------------------'

# Sample
# gpu.mem.reserved|000:003:00.0|318976|kiloBytes
# gpu.mem.total|000:003:00.0|23580672|kiloBytes
# gpu.mem.usage|000:003:00.0|135|percent
# gpu.mem.used|000:003:00.0|318976|kiloBytes
# gpu.power.used|000:003:00.0|21|watt
# gpu.temperature|000:003:00.0|35|celsius
# gpu.utilization|000:003:00.0|0|percent

def discover_esx_vsphere_counters_gpu_util(section: Section) -> DiscoveryResult:
    if debug.enabled():
      print("[plugin esx_vsphere_counters_gpu] ESX GPU service discovery called")
    for name, instances in section.items():
        if name == "gpu.utilization":
            for gpu_id, metrics in instances.items():
              if debug.enabled():
                print('Found gpu.utilization gpu_id=', gpu_id)
              yield Service(item=gpu_id)

def check_esx_vsphere_counters_gpu_util(
    item: str,
    params: Mapping[str, Any],
    section: Section,
) -> CheckResult:
    gpu_utilization = 0
    data = section.get("gpu.utilization", {}).get(item)
    multivalues, _unit = data[0] if data else (None, None)
    if multivalues is not None:
        gpu_utilization = multivalues[0] / 100
        yield from check_levels(
            gpu_utilization,
            render_func=render.percent,
            metric_name="esx_gpu_utilization",
            label="Utilization",
        )
    else:
        yield Result(state=State.UNKNOWN, summary="Gpu Utilization metric received but no values found")
    

check_plugin_esx_vsphere_gpu_util = CheckPlugin(
    name="esx_vsphere_counters_gpu",
    sections=["esx_vsphere_counters"],
    service_name="GPU Utilization %s",
    discovery_function=discover_esx_vsphere_counters_gpu_util,
    check_function=check_esx_vsphere_counters_gpu_util,
    check_default_parameters={}
)

This is working, and does not collide with agent_vsphere v1 based (2.3.0pX), not tested against master

@gabrielgbs97 gabrielgbs97 changed the title Get GPU performance counters Get vSphere GPU performance counters Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants