Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Enable prometheus metrics #3675

Open
lyz-code opened this issue Apr 4, 2024 · 14 comments
Open

BUG: Enable prometheus metrics #3675

lyz-code opened this issue Apr 4, 2024 · 14 comments
Labels
bug Things that should work, but don’t
Milestone

Comments

@lyz-code
Copy link

lyz-code commented Apr 4, 2024

Describe the bug
I've seen that Prometheus metrics have been available for a while but I'm not able to make them work.

To Reproduce
Steps to reproduce the behavior:

  1. Set PROMETHEUS_ENABLED=true in your aleph.env file and restart Aleph
  2. docker exec -it aleph_api_1 bash
  3. curl http://localhost:9100
  4. See error curl: (7) Failed to connect to localhost port 9100: Connection refused

Expected behavior
Prometheus metrics are fetched

Aleph version
3.15.5

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
I'm not able to unset the command directive on the docker-compose maybe that's preventing the prometheus metrics server to be loaded.

@lyz-code lyz-code added bug Things that should work, but don’t triage These issues need to be reviewed by the Aleph team labels Apr 4, 2024
@tillprochaska
Copy link
Contributor

Hi, sorry for the confusion. This is indeed related to an incorrect default command specified in the Dockerfile. I’ve fixed this in 0a154ff, but the fix hasn’t been released yet.

In the meantime, can you try setting the following command in docker-compose.yml and let me know if that works for you?

gunicorn --config /aleph/gunicorn.conf.py --workers 6 --log-level debug --log-file -

This should be roughly equivalent to the command that was previously specified in docker-compose.yml, except that a separate Gunicorn configuration file is loaded. This file makes sure that Gunicorn binds to port 9100 when Prometheus metrics are enabled.

(If you have adjusted some of the Gunicorn configuration values such as the number of workers or the log level, that’s fine -- only thing that’s important is that you specify the --config flag.)

Sorry again for the inconvenience! (We also have documentation about this feature in the works.)

@tillprochaska
Copy link
Contributor

@lyz-code Here’s a link to the WIP documentation, but please take it with a grain of salt, as it is still a work in progress: https://github.com/alephdata/aleph/blob/docs/tech-docs/docs/src/pages/developers/how-to/operations/prometheus/index.mdx

If you run into other problems, please let me know. I’m happy to help and will make sure to update the documentation accordingly.

@lyz-code
Copy link
Author

lyz-code commented Apr 4, 2024

Hi @tillprochaska first thank you so much for the Prometheus work it looks very promising. I haven't seen that many applications with so detailed app metrics, so congratulations.

I've followed your guides and now I'm seeing the next error on the API:

Error: '/aleph/gunicorn.conf.py' doesn't exist

I didn't set any gunicorn configurations myself

@tillprochaska
Copy link
Contributor

@lyz-code I think you’re onto something. It seems there was a mistake in our release process. I’ll let you know when I know more.

@stchris stchris removed the triage These issues need to be reviewed by the Aleph team label Apr 9, 2024
@tillprochaska
Copy link
Contributor

tillprochaska commented Apr 10, 2024

Hi @lyz-code, sorry, just a quick update: This is indeed an issue with the 3.15.5 release. While we did include the Prometheus feature in the release candidates for 3.15.5, we made a mistake when releasing 3.15.5 and so it’s not actually included in that release. We’ll try to do a proper, new release soon.

@lyz-code
Copy link
Author

lyz-code commented Apr 25, 2024

Hi @tillprochaska I've seen that 3.15.6 didn't fix the bug. I know you didn't say it has but I wanted to try :P. FYI, I'm seeing another error when spawning the exporter on the latest version.

exporter_1       | [2024-04-25 09:16:33 +0000] [8] [ERROR] Exception in worker process
exporter_1       | Traceback (most recent call last):
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/arbiter.py", line 609, in spawn_worker
exporter_1       |     worker.init_process()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/workers/base.py", line 134, in init_process
exporter_1       |     self.load_wsgi()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/workers/base.py", line 146, in load_wsgi
exporter_1       |     self.wsgi = self.app.wsgi()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/app/base.py", line 67, in wsgi
exporter_1       |     self.callable = self.load()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/app/wsgiapp.py", line 58, in load
exporter_1       |     return self.load_wsgiapp()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
exporter_1       |     return util.import_app(self.app_uri)
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/util.py", line 371, in import_app
exporter_1       |     mod = importlib.import_module(module)
exporter_1       |   File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
exporter_1       |     return _bootstrap._gcd_import(name[level:], package, level)
exporter_1       |   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
exporter_1       |   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
exporter_1       |   File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
exporter_1       |   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
exporter_1       |   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
exporter_1       |   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
exporter_1       |   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
exporter_1       | ModuleNotFoundError: No module named 'aleph.metrics'

I've also seen that the suggested port of the docker-compose for the aleph_exporter is 9100. That one is usually taken by the node_exporter so maybe it's better to use other one by default

@tillprochaska
Copy link
Contributor

I've seen that 3.15.6 didn't fix the bug. I know you didn't say it has but I wanted to try

Yes, you’re right! 3.15.6 is a security patch release, so we decided to not include anything else besides these patches. I’ll post an update here once the Prometheus feature is properly released.

@Rosencrantz Rosencrantz added this to the 3.15.7 milestone May 7, 2024
@tillprochaska
Copy link
Contributor

Sorry for the slow response. I’ve published a release candidate for a new release that should fix this issue. A final release will hopefully follow soon.

Note that if you want to test this release candidate you might need to adjust your docker-compose.yml file again to remove the command override for the api service. Also see the Compose config at the 3.17.0-rc1 tag.

@tillprochaska
Copy link
Contributor

This should be resolved in the latest release (3.17.0). Keeping this open as I haven’t yet thought about your suggestion regarding the default port:

I've also seen that the suggested port of the docker-compose for the aleph_exporter is 9100. That one is usually taken by the node_exporter so maybe it's better to use other one by default

@lyz-code
Copy link
Author

Hi @tillprochaska, 3.17.0 is able to start (•‿•) but I'm seeing unexpected metrics results:

$: docker exec -it aleph_exporter_1
root@645beaceaa79:/aleph# curl http://localhost:9100/metrics

# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="10",version="3.8.10"} 1.0
# HELP aleph_system_info Aleph system information
# TYPE aleph_system_info gauge
aleph_system_info{aleph_version="3.17.0",ftm_version="3.5.9"} 1.0
# HELP aleph_users Total number of users
# TYPE aleph_users gauge
aleph_users 11.0
# HELP aleph_collections Total number of collections by category
# TYPE aleph_collections gauge
aleph_collections{category="casefile"} 7.0
aleph_collections{category="leak"} 2.0
# HELP aleph_collection_users Total number of users that have created at least one collection
# TYPE aleph_collection_users gauge
aleph_collection_users{category="casefile"} 4.0
aleph_collection_users{category="leak"} 2.0
# HELP aleph_entitysets Total number of entity set by type
# TYPE aleph_entitysets gauge
# HELP aleph_entityset_users Number of users that have created at least on entity set of the given type
# TYPE aleph_entityset_users gauge
# HELP aleph_bookmarks Total number of bookmarks
# TYPE aleph_bookmarks gauge
aleph_bookmarks 0.0
# HELP aleph_bookmark_users Number of users that have created at least one bookmark
# TYPE aleph_bookmark_users gauge
aleph_bookmark_users 0.0
# HELP aleph_active_datasets Total number of active datasets
# TYPE aleph_active_datasets gauge
aleph_active_datasets 0.0
# HELP aleph_tasks Total number of pending or running tasks in a given stage
# TYPE aleph_tasks gauge

As you can see there are some missing metrics such as aleph_tasks, aleph_entitysets or aleph_entityset_users. And overall they look like less metrics than the ones listed in the PR, I'm I doing something wrong?

Also, have you already created a grafana dashboard to visualise the metrics? If not, shall we track this in this issue or should I open a new one?

Thanks as always

@tillprochaska
Copy link
Contributor

@lyz-code Thanks for the feedback!

The metrics you’re looking at are only the metrics exported by the separate exporter. However, the api, ingest-file, and worker containers also export metrics. The api container exports metrics related to the Aleph API, the ingest-file container exports metrics related to ingest tasks, and the worker exports metrics related to other background tasks.

docker compose exec api curl http://localhost:9100/metrics, docker compose exec ingest-file curl http://localhost:9100/metrics etc. should give you those other metrics.

Also, have you already created a grafana dashboard to visualise the metrics? If not, shall we track this in this issue or should I open a new one?

I fiddled around a little bit with a Grafana dashboard based on these metrics, but we’re not using Grafana internally (at least for now), so there are no immediate plans to publish/maintain an official dashboard. You can however open a separate feature request issue for this. And if you happen to build a dashboard yourself, there might be other Aleph admins who find this useful as well.

@lyz-code
Copy link
Author

Hi @tillprochaska thanks for the quick answer. Would it be possible that the aleph_exporter extracts the metrics from the rest of dockers? Right now our production architecture is as follows:

  • Aleph is deployed in an instance different from where Prometheus is deployed
  • Our Aleph docker compose uses two networks aleph and proxy. Only the ui has access to proxy so we reduce the attack surface.
  • On Aleph's host we only expose port 443 of the proxy and now port 9101 of the aleph_exporter

The only way for us to extract the metrics from the dockers would be to open a port per docker (api, worker, ingest-file) so that the external prometheus can scrape the metrics. I feel this is not desirable because we're exposing internal components of the application to another instance.

@tillprochaska
Copy link
Contributor

@lyz-code Hey, I can’t give you a definitive answer here, but as far as I know the recommended solution is to run Prometheus in the same network as the instances you’re monitoring.

You might be able to run a Prometheus instance on the same host to aggregate metrics from all Aleph containers, then make use of federation to allow your main Prometheus instance to scrape them. PushProx may be another solution. However, I do not have any personal experience with any of these approaches.

It’s unfortunately unlikely that we will implement aggregation of metrics from other containers in the exporter ourselves, as this would add lots of complexity and likely lead to problems for example when running multiple worker instances or automatically scaling the number of worker instances.

@lyz-code
Copy link
Author

Thanks @tillprochaska it makes sense what you say. I'm fine with the closing of the issue (•‿•)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that should work, but don’t
Projects
None yet
Development

No branches or pull requests

4 participants