Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU (nvidia-smi) monitoring to telegraf, grafana #77

Open
christopheredsall opened this issue Jun 18, 2020 · 2 comments
Open

Add GPU (nvidia-smi) monitoring to telegraf, grafana #77

christopheredsall opened this issue Jun 18, 2020 · 2 comments
Labels
enhancement New feature or request good first issue Post a comment if you're interested in helping

Comments

@christopheredsall
Copy link
Contributor

christopheredsall commented Jun 18, 2020

There is an nvidia-smi plugin for telegraf and there are dashboards available on grafana.com

What is needed is to uncomment the lines (or, at a minimum the [[iinputs.nvidia_smi]]) in /etc/telegraf/telegraf.conf

# # Pulls statistics from nvidia GPUs attached to the host
# [[inputs.nvidia_smi]]
#   ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath
#   # bin_path = "/usr/bin/nvidia-smi"
#
#   ## Optional: timeout for GPU polling
#   # timeout = "5s"

We would probably want to do this on only the nodes that have GPUs.

@christopheredsall christopheredsall added enhancement New feature or request good first issue Post a comment if you're interested in helping labels Jun 18, 2020
@christopheredsall
Copy link
Contributor Author

e.g.

[root@vm-gpu2-1-ad1-0001 ~]# sed --in-place  -e '/inputs.nvidia_smi/ s/^#//' /etc/telegraf/telegraf.conf
[root@vm-gpu2-1-ad1-0001 ~]# systemctl reload telegraf.service 
[root@mgmt ~]# influx -database 'telegraf' -execute 'select * from nvidia_smi where time > now()-20s' -format 'json' -pretty
{
    "results": [
        {
            "series": [
                {
                    "name": "nvidia_smi",
                    "columns": [
                        "time",
                        "clocks_current_graphics",
                        "clocks_current_memory",
                        "clocks_current_sm",
                        "clocks_current_video",
                        "compute_mode",
                        "encoder_stats_average_fps",
                        "encoder_stats_average_latency",
                        "encoder_stats_session_count",
                        "host",
                        "index",
                        "memory_free",
                        "memory_total",
                        "memory_used",
                        "name",
                        "pcie_link_gen_current",
                        "pcie_link_width_current",
                        "power_draw",
                        "pstate",
                        "temperature_gpu",
                        "utilization_gpu",
                        "utilization_memory",
                        "uuid"
                    ],
                    "values": [
                        [
                            1592478071000000000,
                            405,
                            715,
                            405,
                            835,
                            "Default",
                            0,
                            0,
                            0,
                            "vm-gpu2-1-ad1-0001",
                            "0",
                            16280,
                            16280,
                            0,
                            "Tesla P100-SXM2-16GB",
                            3,
                            16,
                            28.1,
                            "P0",
                            41,
                            4,
                            0,
                            "GPU-29282894-1d8f-08c2-9c01-d34c831b1e4d"
                        ]
                    ]
                }
            ]
        }
    ]
}

@christopheredsall
Copy link
Contributor Author

Dashboard 12225 seems to work as long as we change

--- 12225.json.orig	2020-06-18 18:25:42.000000000 +0100
+++ 12225.json	2020-06-18 17:26:16.000000000 +0100
@@ -2734,7 +2734,7 @@
         "multi": false,
         "name": "hostname",
         "options": [],
-        "query": "SHOW TAG VALUES FROM \"win_system\" WITH KEY = \"host\"",
+        "query": "SHOW TAG VALUES FROM \"system\" WITH KEY = \"host\"",
         "refresh": 1,
         "regex": "",
         "skipUrlSync": false,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Post a comment if you're interested in helping
Projects
None yet
Development

No branches or pull requests

1 participant