Add a health check on the process that allows to test reachability #384

Ongy · 2024-12-10T14:45:51Z

Describe the feature

There should be an additional endpoint provided by the process (e.g. /healthz) that can be probed to detect general functioning.

Describe the reason for such feature

This allows us to employ better monitoring, for alerting when the server side goes offline for some reason. Additionally it allows to plug into load balancing services that are aware of the health of their backends to provide dynamic failover.

Describe alternatives you've considered

None.

erebe · 2024-12-11T17:07:25Z

Hello,
It can be possible, but for now there is no such metric to tell if the server is correctly running, beside that the listening port is open, which can be tested without an http healthcheck endpoint.

There is no limit of nb of connection or whatsoever for now.
Have you seen a case where the server was not responding ?

Ongy · 2024-12-11T17:25:43Z

The main reason I'd like a dedicated endpoint is to avoid side effects of the probing.

I think both the client side and the server side return some non-200 HTTP Status on a GET (instead of upgrade/connect) and log error messages.
When we tried t use the kubernetes TCP-probe on the client, it sent the 0.0.0.0 connection requests to the server that I mentioned in a PR previously.

We've not seen it not react at all, but the cloud native tooling is better at probing HTTP endpoints than just probing ports, and we've run into a situation where it failed due to our own mistakes.

erebe · 2024-12-12T19:53:21Z

I understand, at some point I thought of adding a Prometheus /metrics endpoint to get some metrics.
I would rather put this /healthz and /metrics on a separate port, because otherwise it would force people to use a reverse proxy. Starting the server without a reverse proxy would expose those 2 path to the wild.

That something I can plan to add in the future, but for the moment I don't have much free time on hand I can't commit on anything

Ongy · 2024-12-13T06:39:36Z

separate port is fine, though I'd like to have it in the same main execution logic, so in case anything's stuck it's stuck as well.

WRT. time: This is half a request, half coordination if I want to start something. so there's a decent chance I'll provide at least /healthz, maybe /metrics if I can think of good ones.

Ongy added the enhancement label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a health check on the process that allows to test reachability #384

Add a health check on the process that allows to test reachability #384

Ongy commented Dec 10, 2024

erebe commented Dec 11, 2024

Ongy commented Dec 11, 2024

erebe commented Dec 12, 2024 •

edited

Loading

Ongy commented Dec 13, 2024

Add a health check on the process that allows to test reachability #384

Add a health check on the process that allows to test reachability #384

Comments

Ongy commented Dec 10, 2024

Describe the feature

Describe the reason for such feature

Describe alternatives you've considered

erebe commented Dec 11, 2024

Ongy commented Dec 11, 2024

erebe commented Dec 12, 2024 • edited Loading

Ongy commented Dec 13, 2024

erebe commented Dec 12, 2024 •

edited

Loading