Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differing queries #12

Open
jounaidr opened this issue Jun 20, 2024 · 14 comments
Open

Differing queries #12

jounaidr opened this issue Jun 20, 2024 · 14 comments

Comments

@jounaidr
Copy link

jounaidr commented Jun 20, 2024

Hi,

Thanks for this project, it is exactly what I was looking for for a similar HPC system. I have managed to setup and run the platform with Grafana, however it required changes to the queries (just syntactically) and looking at the other issues it appears there's some discrepancy with the exporter versions. I believe I was using the modified cgroups exporter provided, but to get it working I had to change the queries to what the metric label was displaying in Prometheus. I am attempting to consolidate the queries into the config file for our setup so its would be somewhat easier to change them in the future if we were to add new exporters or if the existing exporter queries change. Also i'm thinking the issues could be caused by my Prometheus config as what was in the docs was not working so I just made a basic one (i am new to Prometheus :p). Please feel free to close this if its not acceptable!

here is my prom config:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'node-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
            - targets: ['localhost:9100','localhost:9306','localhost:9821']
@plazonic
Copy link
Collaborator

Hello,

good to hear that you managed to get it working but it seems like you had some copy/paste problems? I do not see your configuration attached so not sure what you mean or what problem you had.

@jounaidr
Copy link
Author

Ah yes sorry I updated the comment it should be there now! and thanks for the quick response!

@jounaidr
Copy link
Author

and this is what I changed the queries to within jobstats:

    def get_job_stats(self):
        # query CPU and Memory utilization data
        self.get_data('total_memory', "max_over_time(cgroup_memory_total_bytes{cgroup='/slurm_localhost/uid_0/job_%s', instance='localhost:9306', job='prometheus'}[%ds])")
        self.get_data('used_memory', "max_over_time(cgroup_memory_rss_bytes{cgroup='/slurm_localhost/uid_0/job_%s', instance='localhost:9306', job='prometheus'}[%ds])")
        self.get_data('total_time', "max_over_time(cgroup_cpu_total_seconds{cgroup='/slurm_localhost/uid_0/job_%s', instance='localhost:9306', job='prometheus'}[%ds])")
        self.get_data('cpus', "max_over_time(cgroup_cpus{cgroup='/slurm_localhost/uid_0/job_%s', instance='localhost:9306', job='prometheus'}[%ds])")

@plazonic
Copy link
Collaborator

You need the cluster label in your scrape_config - it is mentioned in our main README and I also expanded on it in issue #8 - we expect that to be added at collection time.

@jounaidr
Copy link
Author

Ok thanks, I was having issues with the labels previously but I will try and update the issue, tysm :)

@jounaidr
Copy link
Author

jounaidr commented Jun 27, 2024

Hi, so I updated the prom config, after some syntax issues it is happy and running:

global:
 scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
 external_labels:
  monitor: 'node-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
  scrape_interval: 5s

  file_sd_configs:
  - files:
    - "/root/prometheus-2.51.2.linux-amd64/localnodes.json"
  metric_relabel_configs:
  - source_labels:
    - __name__
    regex: "^go_.*"
    action: drop

and the nodes json:

 [
   {
     "labels": {
       "cluster": "localcluster",
       "service": "compute"
     },
     "targets": [
       "localhost:9100",
       "localhost:9821",
       "localhost:9306"
     ]
   }
 ]

However it still is only working with the changed queries :/, just to confirm I am using the following cgroups exporter: https://github.com/treydock/cgroup_exporter/releases/tag/v0.9.1

I just saw there's a new release so ill try that as well

@plazonic
Copy link
Collaborator

Hello,

no, it can't be treydock's exporter. It has to be one of the two modified versions on my github page (either master branch for cgroup v1 or cgroupv2 branch if you have something like rhel9 and are running cgroupv2). If it is working you will see jobid tags (if there are active jobs on the node), e.g.:

cgroup_memory_cache_bytes{jobid="57205080",step="",task=""} 1.16293632e+08

@jounaidr
Copy link
Author

Ahhhh okay yeh makes sense, this is most likely the problem then I believe, I will try tomorrow and confirm, thanks :)

@jounaidr
Copy link
Author

jounaidr commented Jul 3, 2024

Hi, i've attempted to build the cgroup exporter from your repo a couple times on the V2 and master branches but it seems to always give me the same queries as before without the job id tag, possibly it is still pulling stuff from treydock's repo as it wont let me build straight with make, I have to run go get github.com/treydock/cgroup_exporter which pulls some things from the original repo I believe. Is there any chance you have the modified cgroup exporter binaries uploaded somewhere so I don't have to build them myself? Thanks!

@plazonic
Copy link
Collaborator

plazonic commented Jul 3, 2024

Hello,

first of all make sure you are using the correct branch - master is for cgroupv1 and cgroupv2 is, obviously for v2. The easiest way to recognize which one is which to just check mounts - there will be multiple cgroup mounts for v1 and only one cgroup2 mount for v2. E.g. for one of our systems:

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot)

Next, check [https://github.com//issues/8#issuecomment-2064599206](issue 8) where I answered a similar question and shown how to build as well as provided details on how to recognize that the build is correct.

@jounaidr
Copy link
Author

jounaidr commented Jul 8, 2024

Thanks, I believe I have the modified versions building now as the queries have changed! However they are still not correct I believe, for example cgroup_cpu_user_seconds{cluster="localcluster", instance="localhost:9306", job="prometheus", service="compute"} the jobID is omitted. This is for both v1 and the v2 branches, and as expected grafana no longer works when specifying job id with the new queries. Most likely made a simple mistake somewhere if you have any ideas :p

@plazonic
Copy link
Collaborator

Hullo, the issue 8 seemed to imply that you solved this? Or are you still having this problem?

@jounaidr
Copy link
Author

Hi, this issue is still ongoing and have still had to use the modified jobstats script within our deployment, has there been any updates to the modified exporter repo or anyone with similar issues since we last had discussions? Thanks.

@plazonic
Copy link
Collaborator

I cleaned up a bit the cgroup_exporter to remove direct references to other cgroup_exporters. What do your cgroup slurm subdirs look like? E.g. do you have something like

/sys/fs/cgroup/memory/slurm/uid_359983/job_60600749

or is it (cgroup v2):

/sys/fs/cgroup/system.slice/slurmstepd.scope/job_100311/step_batch/slurm

What does cgroup_exporter --help output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants