Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add initial support for prometheus metrics. #1465

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

ykuksenko
Copy link

@ykuksenko ykuksenko commented Mar 29, 2023

Could someone look over this and let me know if this okay for adding initial support for Prometheus metrics?

  1. At the moment this includes a minimal amount of metrics as examples. I may tweak the metric names a bit after I review the Prometheus metrics best practices document again.
  2. The endpoint /metrics is used as that is traditional for Prometheus environments. I can change it back to /core/metrics if required.
  3. Do I need to add any tests for this?
  4. I am not sure if @csrf_exempt is actually needed or wanted here; the monitoring endpoint made it seem so, to me.
  5. Is the for loop at the end okay or should I do this differently? (ie direct calls each time, or should it be a comprehension?)
  6. Is there are reason not to expose some of the version info? The reason it is there now is to track updates in my metrics environment. I think this may be useful if I find clients dropping off after an upgrade for example.
  7. Is there any other feedback?

Use instructions:

  1. Setup MON_TOKEN as in Tips and Tricks.
  2. Test with curl command curl -s -H "Authorization: Bearer $MON_TOKEN" https://api.trmm.example.com/metrics
  3. Setup Prometheus job with:
- job_name: trmm
  scrape_interval: 60s
  #metrics_path: /core/status/
  scheme: https
  bearer_token: $MON_TOKEN
  static_configs:
  - targets:
    - api.trmm.example.com

edit: updated instructions to share json status endpoint. instructions also in commit message now.

@CLAassistant
Copy link

CLAassistant commented Mar 29, 2023

CLA assistant check
All committers have signed the CLA.

@wh1te909
Copy link
Member

just add this to the existing monitoring endpoint instead of duplicating all the code. just add whatever fields you want to the response. doesn't make sense to duplicate 90% of the code lol is exactly the same

@codecov
Copy link

codecov bot commented Mar 30, 2023

Codecov Report

Merging #1465 (15c155b) into develop (b5eed69) will increase coverage by 0.52%.
The diff coverage is 98.41%.

@@             Coverage Diff             @@
##           develop    #1465      +/-   ##
===========================================
+ Coverage    81.13%   81.65%   +0.52%     
===========================================
  Files          116      116              
  Lines         7791     7829      +38     
===========================================
+ Hits          6321     6393      +72     
+ Misses        1470     1436      -34     
Impacted Files Coverage Δ
api/tacticalrmm/core/views.py 79.79% <97.56%> (+7.80%) ⬆️
api/tacticalrmm/core/decorators.py 100.00% <100.00%> (+63.63%) ⬆️
api/tacticalrmm/tacticalrmm/settings.py 92.85% <100.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ykuksenko ykuksenko force-pushed the prom_metrics branch 2 times, most recently from 171886e to af42a8f Compare March 30, 2023 15:52
@wh1te909
Copy link
Member

thank you I am busy atm but will answer all your questions hopefully over weekend but looking good so far nice!

@ykuksenko
Copy link
Author

Thanks for the update!

No hurry here. One more question, in case I do not figure it out before you have time: Is there a redacted example tacticalrmm.settings file available somewhere? I not sure how to setup a traditional install vm to run tests locally yet. It fails early creating the db.

@wh1te909
Copy link
Member

use this ansible role to setup a debian 11 vm for trmm development: https://github.com/amidaware/tacticalrmm/tree/develop/ansible

here's example of local_settings.py from the above role

SECRET_KEY = "changeme"
DEBUG = True
ALLOWED_HOSTS = ["api.example.com"]
ADMIN_URL = "admin/"

CORS_ORIGIN_ALLOW_ALL = True

DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.postgresql",
        "NAME": "tacticalrmm",
        "USER": "tactical",
        "PASSWORD": "changeme",
        "HOST": "localhost",
        "PORT": "5432",
    }
}

REDIS_HOST = "localhost"
CERT_FILE = "/etc/ssl/certs/tacticalpub.pem"
KEY_FILE = "/etc/ssl/certs/tacticalpriv.pem"
MESH_USERNAME = "tactical"
MESH_SITE = "https://mesh.example.com"
MESH_TOKEN_KEY = "changeme"

@ykuksenko
Copy link
Author

Just seeing your message now. I will check out the role.

I initially misunderstood that tacticalrmm.settings was actually referring to a python module rather than a file with that name.

I figured out why my tests were failing on my existing VM. The createdb permission was missing in pgsql.

For others that run across this my tests failing to run was resolved with:

#login to postgres as superuser.
sudo -u postgress psql
#list postgres users.
\du
#find your $USER and add the createdb permission.
ALTER USER $USER CREATEDB;

@ykuksenko ykuksenko force-pushed the prom_metrics branch 3 times, most recently from f969e54 to 3a8decc Compare March 30, 2023 22:03
@ykuksenko
Copy link
Author

For when you have time:

The existing /core/status/ (core/views.py) code checks certificate validity. This part of the code has no tests currently. I included this metric in the new Prometheus code as it could be useful. Now I do not know how the test case should be handled. There is no certificate to read in the upstream repo. So when testing I get a file not found error during github ci runs but not locally.
How should I handle this? Here are a few options I can think of.

  1. I can generate temporary certificates for testing and use them. The downsides:
  • Now I have possible additional dependencies. (should I use the openssl binary on the system if it exists or use a python library?)
  1. I can add code to detect testing and skip that code path. The downsides:
  • Do not have complete code coverage.
  • There is testing code in production code.
  1. I can ignore the bit of existing code I modified and not do the certificate check in the Prometheus code. The downsides:
  • Old modified code has no tests
  • No possibly useful certificate check in Prometheus code.

Thanks!

@wh1te909
Copy link
Member

Look into python mock, check codebase is full of them. Just mock the return values of the functions that check certs.

@ykuksenko ykuksenko force-pushed the prom_metrics branch 4 times, most recently from 2e02f78 to c63839c Compare April 1, 2023 20:22
---
Metrics for the prometheus endpoint are slightly different from the json one.
- I do not want to break things for those that use the old endpoint.
- Some metrics are hard to utilize in Prometheus without a format
  change. (example: certificate timestamp)
  - Usually blackbox_exporter would cover certificate monitoring, but
    since people use TRMM with multiple proxies both internal monitoring
    and blackbox_exporter may be useful.
- Memory and disk space was not transfered because they are system wide
  and Prometheus users likely will have node_exporter installed for
  that.
- Prometheus format has slightly more detailed metrics.

---
Instructions:
1. Setup MON_TOKEN variable in
   `/rmm/api/tacticalrmm/tacticalrmm/local_settings.py` for your bearer_token.
   This is the same as for the json endpoint. See [Tips and Tricks](https://docs.tacticalrmm.com/tipsntricks/#monitor-your-trmm-instance-via-the-built-in-monitoring-endpoint).
2. Test with curl command:
   `curl -s -H "Authorization: Bearer $MON_TOKEN" https://api.trmm.example.com/core/status/`
3. Setup Prometheus job with:
```
- job_name: trmm
  scrape_interval: 60s
  metrics_path: /core/status/
  scheme: https
  bearer_token: $MON_TOKEN
  static_configs:
  - targets:
    - api.trmm.example.com
```
@ykuksenko
Copy link
Author

I did look at mocking/patching the certificate loading bits and was successful but found that supplying a small fake cert tests additional untested areas so I used that route.

I have been trying to get the tests to cover the impossible to reach else case in core/views.py but have not been able to find a way to patch the decorator before it gets loaded.
Since the decorator is only used in one spot I think merging it into the view instead may be a good solution. Is there another way I may not be aware of to deal with this situation?

Thank you for the guidance.

@wh1te909
Copy link
Member

wh1te909 commented Apr 3, 2023

you're not hitting that line in the tests cuz as you wrote in the comment, it's not possible to get there. so just remove that last else statement from the view. it's never gonna hit there anyway, and your tests already prove that lol

@wh1te909 wh1te909 force-pushed the develop branch 2 times, most recently from 942055f to 903a2d6 Compare June 25, 2023 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants