Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A prometheus exporter for playbook metrics ? #177

Open
dmsimard opened this issue Sep 24, 2020 · 14 comments
Open

A prometheus exporter for playbook metrics ? #177

dmsimard opened this issue Sep 24, 2020 · 14 comments

Comments

@dmsimard
Copy link
Contributor

dmsimard commented Sep 24, 2020

What is the idea ?

I haven't yet looked at what this would look like in practice but we have a lot of metrics in ara, such as:

We could explore how to make these metrics useful for monitoring or fancy graphs.

@IlyesSemlali
Copy link

Hi !

I think I can take a look at that, I'm not very familiar with ara bit I've been using a while ago and I'm also a casual prometheus user. I'll dive into the code to see If I'm of any help this weekend 😉

@dmsimard
Copy link
Contributor Author

dmsimard commented Oct 8, 2020

@IlyesSemlali nice, thanks for looking at it!
Feel free to reach out on slack or irc and I can help pointing you in the right direction.

@dmsimard
Copy link
Contributor Author

dmsimard commented Nov 1, 2020

@IlyesSemlali o/

Wanted to let you know that I've proposed an implementation to query ara in order to return metrics about tasks: https://review.opendev.org/#/c/760736/

For example, metrics from the last 1000 tasks:

> ara task metrics --limit 1000
+-----------------------+----------------+----------------+-------------+-----------+---------+---------+---------+
| action                | duration_total | duration_avg   | occurrences | completed | running | expired | unknown |
+-----------------------+----------------+----------------+-------------+-----------+---------+---------+---------+
| add_host              | 0:00:02.896122 | 0:00:00.160896 |          18 |        18 |       0 |       0 |       0 |
| ara_playbook          | 0:00:07.786123 | 0:00:00.432562 |          18 |        18 |       0 |       0 |       0 |
| ara_record            | 0:00:44.732937 | 0:00:00.451848 |          99 |        99 |       0 |       0 |       0 |
| assemble              | 0:00:03.295141 | 0:00:01.647570 |           2 |         2 |       0 |       0 |       0 |
| assert                | 0:00:18.892062 | 0:00:00.174926 |         108 |       108 |       0 |       0 |       0 |
| command               | 0:00:33.964852 | 0:00:00.640846 |          53 |        44 |       9 |       0 |       0 |
| copy                  | 0:00:16.585412 | 0:00:01.658541 |          10 |        10 |       0 |       0 |       0 |
| debug                 | 0:04:41.404371 | 0:00:01.839244 |         153 |       153 |       0 |       0 |       0 |
| fail                  | 0:00:01.752986 | 0:00:00.159362 |          11 |        11 |       0 |       0 |       0 |
| file                  | 0:00:19.609704 | 0:00:01.032090 |          19 |        19 |       0 |       0 |       0 |
| find                  | 0:00:00.828836 | 0:00:00.414418 |           2 |         2 |       0 |       0 |       0 |
| gather_facts          | 0:00:47.211225 | 0:00:01.026331 |          46 |        46 |       0 |       0 |       0 |
| group_by              | 0:00:01.583436 | 0:00:00.395859 |           4 |         4 |       0 |       0 |       0 |
| include_role          | 0:00:18.616065 | 0:00:00.282062 |          66 |        66 |       0 |       0 |       0 |
| include_tasks         | 0:00:29.881530 | 0:00:00.335748 |          89 |        89 |       0 |       0 |       0 |
| kolla_container_facts | 0:00:00.966302 | 0:00:00.966302 |           1 |         1 |       0 |       0 |       0 |
| kolla_docker          | 0:01:45.791232 | 0:00:03.111507 |          34 |        34 |       0 |       0 |       0 |
| kolla_toolbox         | 0:06:10.222628 | 0:00:04.936302 |          75 |        75 |       0 |       0 |       0 |
| merge_configs         | 0:01:17.487438 | 0:00:02.869905 |          27 |        27 |       0 |       0 |       0 |
| modprobe              | 0:00:01.730669 | 0:00:00.576890 |           3 |         3 |       0 |       0 |       0 |
| ping                  | 0:00:04.945961 | 0:00:00.549551 |           9 |         9 |       0 |       0 |       0 |
| set_fact              | 0:00:38.433904 | 0:00:00.541323 |          71 |        71 |       0 |       0 |       0 |
| setup                 | 0:00:09.907015 | 0:00:01.100779 |           9 |         9 |       0 |       0 |       0 |
| shell                 | 0:00:00.775224 | 0:00:00.387612 |           2 |         2 |       0 |       0 |       0 |
| stat                  | 0:00:03.386744 | 0:00:00.211672 |          16 |        16 |       0 |       0 |       0 |
| sysctl                | 0:00:09.490360 | 0:00:04.745180 |           2 |         2 |       0 |       0 |       0 |
| systemd               | 0:00:00.290391 | 0:00:00.290391 |           1 |         1 |       0 |       0 |       0 |
| template              | 0:01:16.878769 | 0:00:01.507427 |          51 |        51 |       0 |       0 |       0 |
| wait_for              | 0:00:00.704144 | 0:00:00.704144 |           1 |         1 |       0 |       0 |       0 |
+-----------------------+----------------+----------------+-------------+-----------+---------+---------+---------+

The CLI framework lets us return that data in json or csv which I suppose could then be made available by a prometheus exporter:

> ara task metrics --limit 1000 -f csv
"action","duration_total","duration_avg","occurrences","completed","running","expired","unknown"
"add_host","0:00:02.896122","0:00:00.160896",18,18,0,0,0
"ara_playbook","0:00:07.786123","0:00:00.432562",18,18,0,0,0
"ara_record","0:00:44.732937","0:00:00.451848",99,99,0,0,0
"assemble","0:00:03.295141","0:00:01.647570",2,2,0,0,0
"assert","0:00:18.892062","0:00:00.174926",108,108,0,0,0
"command","0:00:33.964852","0:00:00.640846",53,44,9,0,0
"copy","0:00:16.585412","0:00:01.658541",10,10,0,0,0
"debug","0:04:41.404371","0:00:01.839244",153,153,0,0,0
"fail","0:00:01.752986","0:00:00.159362",11,11,0,0,0
"file","0:00:19.609704","0:00:01.032090",19,19,0,0,0
"find","0:00:00.828836","0:00:00.414418",2,2,0,0,0
"gather_facts","0:00:47.211225","0:00:01.026331",46,46,0,0,0
"group_by","0:00:01.583436","0:00:00.395859",4,4,0,0,0
"include_role","0:00:18.616065","0:00:00.282062",66,66,0,0,0
"include_tasks","0:00:29.881530","0:00:00.335748",89,89,0,0,0
"kolla_container_facts","0:00:00.966302","0:00:00.966302",1,1,0,0,0
"kolla_docker","0:01:45.791232","0:00:03.111507",34,34,0,0,0
"kolla_toolbox","0:06:10.222628","0:00:04.936302",75,75,0,0,0
"merge_configs","0:01:17.487438","0:00:02.869905",27,27,0,0,0
"modprobe","0:00:01.730669","0:00:00.576890",3,3,0,0,0
"ping","0:00:04.945961","0:00:00.549551",9,9,0,0,0
"set_fact","0:00:38.433904","0:00:00.541323",71,71,0,0,0
"setup","0:00:09.907015","0:00:01.100779",9,9,0,0,0
"shell","0:00:00.775224","0:00:00.387612",2,2,0,0,0
"stat","0:00:03.386744","0:00:00.211672",16,16,0,0,0
"sysctl","0:00:09.490360","0:00:04.745180",2,2,0,0,0
"systemd","0:00:00.290391","0:00:00.290391",1,1,0,0,0
"template","0:01:16.878769","0:00:01.507427",51,51,0,0,0
"wait_for","0:00:00.704144","0:00:00.704144",1,1,0,0,0

Let me know what you think ?
This is for tasks but we can do a similar approach for more granular host and result metrics too.

@dmsimard
Copy link
Contributor Author

dmsimard commented Nov 4, 2020

There are other works in progress for ara playbook metrics and ara host metrics.

I don't have a lot of experience with prometheus but there is a client library that we could use in python: https://pypi.org/project/prometheus-client/

@b-reich
Copy link

b-reich commented Feb 8, 2022

@IlyesSemlali @dmsimard any further plans here?

@dmsimard
Copy link
Contributor Author

dmsimard commented Feb 8, 2022

@b-reich I am not using prometheus at this time and so I am not pursuing this right now.

You may want to look at the following CLI commands to find out if there is something that can help:

@TibScript
Copy link

Hello,

In my work, I use Ansible with ARA and the possibility of having metrics in the Prometheus/Grafana stack interests me a lot.
I don't know if anyone is working on this, but on my end I'm forking the repo to try and add the use of prometheus-client for a metrics page. If I manage to have a metrics page, I will set up a first series of metrics and I will make a Grafana dashboard to go with it.

@dmsimard
Copy link
Contributor Author

Hi @TibScript, this may surface on my end in the not too distant future. Did you end up with something that works ?

@dmsimard
Copy link
Contributor Author

dmsimard commented Feb 20, 2023

I am experimenting and learning this as I go -- I don't believe I am using the right approach but I'm sharing this in case anyone has suggestions, comments or would like to improve on it: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0

1 snTXFElFuQLSFDnvZKJ6IA

I've included a sample of the /metrics endpoint in the gist.

Edit: I forgot to mention that something notably missing from this first iteration are playbook and task durations. I've tried to implement them using Summary (instead of a Gauge) but haven't really figured out how they work yet.

Here's two sample screenshots showing it works with a local prometheus instance:
Screenshot from 2023-02-19 21-19-24
Screenshot from 2023-02-19 21-19-56

@dmsimard
Copy link
Contributor Author

Still not sure where this is going, maybe I should put this in a branch and a PR but a few updates:

  • Added support for querying results through pagination
  • Query everything at boot via result limit (i.e, ?limit=1000) and pagination
  • Store the latest object timestamp such that next scrape will only pick up objects created after that using ?created_after=<timestamp> (thanks built-in support into the API)

I still want to do something about durations and timestamps but haven't made it yet.

In the meantime, here's what it looks like to ingest every playbook, task and host from demo.recordsansible.org:

> ./prometheus_exporter.py
2023-02-20T22:45:23.327849: ara Prometheus exporter listening on http://0.0.0.0:8000/metrics
2023-02-20T22:45:23.328241: collecting playbook metrics  # <-- with limit=1000
2023-02-20T22:45:59.038076: parsing metrics for 3641 playbooks
2023-02-20T22:45:59.128715: finished updating playbook metrics
2023-02-20T22:45:59.133970: collecting task metrics  # <-- with limit=2500
2023-02-20T23:18:35.283720: parsing metrics for 557031 tasks
2023-02-20T23:18:40.065098: finished updating task metrics
2023-02-20T23:18:40.378556: collecting host metrics  # <-- with limit=2500
2023-02-20T23:18:42.111483: parsing metrics for 9983 hosts
2023-02-20T23:18:42.238810: finished updating host metrics
2023-02-20T23:19:12.240766: collecting playbook metrics
2023-02-20T23:19:12.399595: finished updating playbook metrics
2023-02-20T23:19:12.399644: collecting task metrics
2023-02-20T23:19:12.694654: finished updating task metrics
2023-02-20T23:19:12.694715: collecting host metrics
2023-02-20T23:19:12.741372: finished updating host metrics

A random screenshot of the data:
Screenshot from 2023-02-20 23-45-22

It can be slow to boot up the exporter at first because it needs to scrape everything. This is for playbooks, tasks and hosts -- I haven't yet touched results and there's 827 591 of those on the demo server.

This is multiple years of mostly integration test playbooks but the performance will largely depend on the scale and volume of data. We should probably have an argument that controls the amount of time we crawl data for -- like a default of 90 days for example ?

@dmsimard
Copy link
Contributor Author

For a third iteration, I decided to move the standalone script into the ara CLI so it's possible to start the exporter by running ara prometheus.

I've opened up a branch and a PR so I will work there instead of the gist.

@netopsengineer
Copy link

Hey @dmsimard my familiarity with Prometheus is Nautobot, which is also Django based, and they leverage this library, django-prometheus

I believe it has alot of built in django metrics that you can enable by default, from API performance, model performance, db queries, and the like, but what seems to be the common thread is that they are able to quickly add new things that we might want to look at about the data Nautobot holds as well.

For Ansible, I could see things like job/task performance, host failure hot list (which hosts are failing most often), number of times a playbook was ran, technically I think you have alot of great metrics that you show in various ways already, but exporting to prometheus lets folks use something like Grafana to graph those things that are important to them with minimal effort or middleware required for them to write. So in my opinion, just starting with some of the metrics about plays, hosts, tasks, etc that you have now would get the ball rolling!

@dmsimard dmsimard pinned this issue Feb 24, 2023
@dmsimard
Copy link
Contributor Author

Hey @dmsimard my familiarity with Prometheus is Nautobot, which is also Django based, and they leverage this library, django-prometheus

I believe it has alot of built in django metrics that you can enable by default, from API performance, model performance, db queries, and the like, but what seems to be the common thread is that they are able to quickly add new things that we might want to look at about the data Nautobot holds as well.

👋 @netopsengineer

I have not considered the django side of the metrics yet but it's true that it can be useful and it's good to know, thanks !

If anyone wants to tackle this, they can go ahead as I continue iterating on the playbook metrics.

For Ansible, I could see things like job/task performance, host failure hot list (which hosts are failing most often), number of times a playbook was ran, technically I think you have alot of great metrics that you show in various ways already, but exporting to prometheus lets folks use something like Grafana to graph those things that are important to them with minimal effort or middleware required for them to write. So in my opinion, just starting with some of the metrics about plays, hosts, tasks, etc that you have now would get the ball rolling!

Yes, we can consider that one of the objectives is pretty graphs about playbook metrics in grafana :p

While I have been a user (and operator) of both prometheus and grafana, I have been mostly privileged by the fact that so many exporters and graphs had already been written so until now I have not needed to truly learn how it all works underneath. There be dragons.

The challenge is parsing data into the right formats (and field types) as a proper time series with the timestamps provided by ara -- not the time the metric sample is taken. Then, we probably need to some math and find out the right arcane grafana or promql query to produce the pretty graphs.

If anyone wants to help or point me in the right direction, head to the PR 🙏

@dmsimard
Copy link
Contributor Author

dmsimard commented Feb 24, 2023

I have come across this insightful mailing list thread about ingesting metrics with supplied timestamps: https://groups.google.com/g/prometheus-users/c/YqFc1MZLCsM

There is a suggestion to try Histograms instead of Gauges so I will look into that next.

Edit: some additional reading I've come across about setting timestamps on metrics:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants