Unreliable backup command #25592

kernstock · 2024-11-27T09:07:39Z

The CLI backup command is not reliable and does not provide a usable UI, for the following reasons:

It does not implement commonly used output configuration options that would be suitable for backup scripts, i.e. it only has a single output mode: everything. This makes it impossible for the administrator to verify a backup's completeness and integrity as they would have to read the (lengthy) output for each backup created. For example, there could be a "quiet-mode" that only creates output on warnings or errors, which could be used in backup scripts and cronjobs to have the admin notified if something goes wrong.
Backing up frequently creates a lot of data as each snapshot is chunked independently and common data of subsequent snapshots cannot be reused. In instances with many or large buckets, this fills storage quite quickly and forces admins to only keep a low number of snapshots, where effective data is heavily redundant. This even may go to only keeping a single snapshot and removing it prior to the next backup and if this goes wrong and the instance fails, all data is lost.
InfluxDB is not entering a backup mode where it would lock bucket access during a backup. Thus it is still accepting read and write requests during the backup (as well during a restore). This is something many databases have implemented for a reason: to prevent inconsistent data being backed up. To prevent this, an admin would have to ensure that all access to the instance is blocked during backup/restore, which can be quite a task when the instance is usually accessible in a large local network and not all access is piped through a single instance such as a web server.
In our instance (with a data dir of ~50GB) the backup is really slow (see logs below).

The specific issue that we ran into: After a server reboot, our instance was unresponsive with no clue in the logs, what was wrong. When trying to test restore to a blank instance (which we had tested earlier this year) we saw that about 50% of the weekly saved snapshot were corrupt (data chunks in .tar.gz as well as boltDB and SQLite dumps were present, manifest files were somehow missing). The backup outputs, piped from the server backup script to a log file, only showed an error message that the InfluxDB API could not be reached -- we had forgotten to pipe the output in append mode and rotate the log file. The youngest consistent weekly was 8 weeks old. As the instance could not be repaired with reasonable amount of work, we decided to restore and settle with the loss of 8 weeks of data.

Environment info:

InfluxDB version: InfluxDB v2.7.10 (git: f302d97) build_date: 2024-08-16T20:19:28Z
InfluxDB CLI version: Influx CLI dev (git: a79a2a1b825867421d320428538f76a4c90aa34c) build_date: 2024-04-16T14:34:32Z
System info: Linux 5.15.0-126-generic x86_64

Logs:

This is the routine output a recent backup:

2024/11/23 02:30:03 INFO: Downloading metadata snapshot
2024/11/23 02:30:03 INFO: Backing up TSM for shard 5031
2024/11/23 02:30:03 INFO: Backing up TSM for shard 5032
2024/11/23 02:30:03 INFO: Backing up TSM for shard 4980
[This goes on...]
2024/11/23 03:26:55 INFO: Backing up TSM for shard 6829
2024/11/23 03:26:56 INFO: Backing up TSM for shard 6886
2024/11/23 03:27:11 INFO: Backing up TSM for shard 6887
2024/11/23 03:29:32 INFO: Backing up TSM for shard 6952

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreliable backup command #25592

Unreliable backup command #25592

kernstock commented Nov 27, 2024

Unreliable backup command #25592

Unreliable backup command #25592

Comments

kernstock commented Nov 27, 2024