Ideas to upper bound prometheus-server's memory consumption #2222

consideRatio · 2023-02-17T08:26:17Z

I just slowly incremented prometheus-server to 20 GB of memory requests for pangeo-hubs cluster. It appears that it wasn't sufficient with 18 GB, because it peaked at close to 19 GB before it fell down to ~3-4GB when Head GC completed was logged ~5 minutes after startup.

This prometheus-server had a /data folder mounted from the attached PVC that was 5.8 GB.

kubectl exec -n support deploy/support-prometheus-server -c prometheus-server -- du -sh /data 
5.8G	/data

The problem we have is that "write-ahead log" (WAL) is being read during startup from the disk to summarize all metrics collected as I understand it, and that takes a lot of memory. Actually, the problem is that we can't know what this memory requirement is, because it grows over time as more metrics are collected.

Ideas

We upper-bound the WAL size on disk instead of collected metrics age
We work towards node sharing (New default machine types and profile list options - sharing nodes is great! #2121) so we get less metrics from nodes
We try to limit the metrics collected by prometheus to what we consume

Example on logs from a successfull startup

ts=2023-02-17T08:08:58.378Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25167 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25168 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=25169 maxSegment=25169
ts=2023-02-17T08:08:58.379Z caller=head.go:720 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=1.080338716s wal_replay_duration=1m28.600482965s wbl_replay_duration=184ns total_replay_duration=1m30.245179793s
ts=2023-02-17T08:09:06.733Z caller=main.go:1014 level=info fs_type=EXT4_SUPER_MAGIC
ts=2023-02-17T08:09:06.733Z caller=main.go:1017 level=info msg="TSDB started"
ts=2023-02-17T08:09:06.733Z caller=main.go:1197 level=info msg="Loading configuration file" filename=/etc/config/prometheus.yml
ts=2023-02-17T08:09:06.766Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.768Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.768Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.769Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-17T08:09:06.770Z caller=main.go:1234 level=info msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml totalDuration=37.243136ms db_storage=3µs remote_storage=2.821µs web_handler=958ns query_engine=1.917µs scrape=30.766642ms scrape_sd=3.27419ms notify=2.525µs notify_sd=4.572µs rules=574.073µs tracing=8.964µs
ts=2023-02-17T08:09:06.770Z caller=main.go:978 level=info msg="Server is ready to receive web requests."
ts=2023-02-17T08:09:06.770Z caller=manager.go:953 level=info component="rule manager" msg="Starting rule manager..."
ts=2023-02-17T08:11:56.487Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1674864000000 maxt=1674871200000 ulid=01GSF6Q33ZV1SA1YCH087EP9S2 duration=2m44.423969816s
ts=2023-02-17T08:12:15.236Z caller=head.go:1213 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=18.734578757s

Keep prometheus data for 1 year rather than 90 days #1779
pangeo-hubs, prometheus: server is crashing due to memory limits #2215
LEAP prometheus server is down/scheduler faiiling #2248 (200+ nodes -> many node exporters -> a lot of scraping -> many calico-typha -> little available CPU)
Overview of grafana and prometheus related issues #2214

The text was updated successfully, but these errors were encountered:

yuvipanda · 2023-02-18T02:59:25Z

I am curious why this only seems to affect the pangeo hubs one, and not say the 2i2c cluster

consideRatio · 2023-02-18T06:59:10Z

@yuvipanda i suspect a basic relation between WAL and memory during startup, where WAL would depend on amount of metrics collected i assume. Amount of metrics would be coupled to whats being scraped. Amount of data scraped relates to more endponts scraped, such as one node-exporter per node, such as one per dask worker node.

consideRatio · 2023-11-05T10:02:32Z

I think the approach of limiting the amount of metrics consumed is relevant, but I'll go for a close on this issue now, the other ideas was explored a bit.

This was referenced Feb 22, 2023

Overview of grafana and prometheus related issues #2214

Closed

LEAP prometheus server is down/scheduler faiiling #2248

Closed

damianavila added this to DEPRECATED Engineering and Product Backlog Feb 24, 2023

damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Feb 24, 2023

This was referenced Mar 8, 2023

ingress-controller pod that routes all incoming traffic to k8d pods got evicted? #2322

Closed

resource requests/limits updates, mainly to ingress-nginx and prometheus #2324

Merged

consideRatio added the tech:prometheus label Sep 9, 2023

consideRatio closed this as completed Nov 5, 2023

github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Nov 5, 2023

damianavila added this to Sprint Board Nov 13, 2023

damianavila moved this to Done 🎉 in Sprint Board Nov 13, 2023

damianavila assigned consideRatio Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas to upper bound prometheus-server's memory consumption #2222

Ideas to upper bound prometheus-server's memory consumption #2222

consideRatio commented Feb 17, 2023 •

edited

Loading

yuvipanda commented Feb 18, 2023

consideRatio commented Feb 18, 2023

consideRatio commented Nov 5, 2023

Ideas to upper bound prometheus-server's memory consumption #2222

Ideas to upper bound prometheus-server's memory consumption #2222

Comments

consideRatio commented Feb 17, 2023 • edited Loading

Ideas

Example on logs from a successfull startup

Related

yuvipanda commented Feb 18, 2023

consideRatio commented Feb 18, 2023

consideRatio commented Nov 5, 2023

consideRatio commented Feb 17, 2023 •

edited

Loading