You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm struggling to find coherent guidance on how to recover from the failure mode described in #10272without losing all historic data.
First, let me describe my case. prometheus --version:
prometheus, version 2.37.0 (branch: HEAD, revision: b41e075)
build user: root@0ebb6827e27f
build date: 20220714-15:13:18
go version: go1.18.4
platform: linux/amd64
The config is nearly stock, we've only set storage.tsdb.path, storage.tsdb.retention.time=90d, storage.tsdb.retention.size=80GB per our specific constraints. Storage is backed by AWS EBS volume.
Now, just like in #10272, this installation didn't raise per-process open FD limit — eventually triggering too many open files during compaction.
Following that, Prometheus spiraled into the loop:
get autorestarted from panic,
replay WAL that keeps growing,
trigger compaction,
panic with too many open files.
Attaching log, slightly edited for size — prometheus.log.gz — showing this loop, plus executions prior & after. The last successful checkpoint was for WAL segment 36244 on 2024-02-06T13:00:15.534Z.
We do still have the file checkpoint.00036244/00000000, it's 15M in size. However, (due to absence of well-defined recovery procedures) our DevOps engineer has decided to drop a prefix of older WAL segments, to recover the system from ENOSPC exhaustion (and e.g. becoming able to raise the FD limit). As a result we only have preserved the segments 00036554–00037208, 77GB total.
Of course, dropping the wal directory (or rather, renaming it away), makes the service start — but Prometheus then misses all previously recorded samples in all time-series. Isn't this supposed to be persisted in another place (the "blocks"?) rather than the WAL? Why removing WAL also removes samples for timestamps far earlier than those of wal segments?
Understandably, truncating the 120GB WAL by hand might have irreversibly screwed it — I've no expectation to recover data from its portion surviving (although, it'd be wonderful if Prometheus had tools enabling to do that).
But I do expect to recover the samples from prior to last checkpoint. It somehow isn't obvious how to do that — any hints?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I'm struggling to find coherent guidance on how to recover from the failure mode described in #10272 without losing all historic data.
First, let me describe my case. prometheus --version:
The config is nearly stock, we've only set
storage.tsdb.path
,storage.tsdb.retention.time=90d
,storage.tsdb.retention.size=80GB
per our specific constraints. Storage is backed by AWS EBS volume.Now, just like in #10272, this installation didn't raise per-process open FD limit — eventually triggering
too many open files
during compaction.Following that, Prometheus spiraled into the loop:
too many open files
.Attaching log, slightly edited for size — prometheus.log.gz — showing this loop, plus executions prior & after. The last successful checkpoint was for WAL segment
36244
on2024-02-06T13:00:15.534Z
.We do still have the file
checkpoint.00036244/00000000
, it's 15M in size. However, (due to absence of well-defined recovery procedures) our DevOps engineer has decided to drop a prefix of older WAL segments, to recover the system from ENOSPC exhaustion (and e.g. becoming able to raise the FD limit). As a result we only have preserved the segments00036554
–00037208
, 77GB total.Of course, dropping the
wal
directory (or rather, renaming it away), makes the service start — but Prometheus then misses all previously recorded samples in all time-series. Isn't this supposed to be persisted in another place (the "blocks"?) rather than the WAL? Why removing WAL also removes samples for timestamps far earlier than those of wal segments?Understandably, truncating the 120GB WAL by hand might have irreversibly screwed it — I've no expectation to recover data from its portion surviving (although, it'd be wonderful if Prometheus had tools enabling to do that).
But I do expect to recover the samples from prior to last checkpoint. It somehow isn't obvious how to do that — any hints?
I've checked storage operations guide in official docs.
I've checked the TSDB blog series by @codesome, which is certainly insightful, but doesn't give operational guidance in terms of recovery.
Beta Was this translation helpful? Give feedback.
All reactions