-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System-wide hang after a few weeks #4131
Comments
That error is rather unusual. I haven't seen that specific error somewhere else so far.
A kernel crash is a problem of the underlying system, not of restic.
That indeed looks like a kernel crash.
Several terabytes will definitely require quite a bit of memory. But that would normally just result in restic getting killed by the kernel's oom killer... |
@MichaelEischer I'm just curious - why does restic needs more memory for lager backups? is it because of the metadata or because of the size of the files? 🤔 |
@AlBundy33 Restic uses an index which keeps information about each stored file chunk and which is used to deduplicate data. For performance / simplicity that index is fully loaded into memory. As each file chunk is listed in the index, it will grow as more data is added to a repository. The file metadata inside a folder also requires additional memory, but unless you have folder which directly contains hundreds of thousands of files (not recursively), the main memory user will be the index. |
So far it hasn't crashed a third time, I'll try to update and see if I can mount some other way if it does crash again... |
So, I ran into a restic hang recently (version 0.15.0) and diagnosed it down to a restic interaction with Transparent Huge Pages. I'm not 100% up to speed on THP support in Go, but it seems like restic is wanting to use a hugepage and that amount of contiguous memory is not immediately available. The system just... hangs a bit while kcompactd and kswapd try and rearrange memory enough to make this possible. So, the behavior I see is that khugepaged and kcompactd slam up to 100% CPU usage. If I close some stuff down, I can get it to immediately wake up and restic can continue. Otherwise, allocations all over the place suffer dramatically while the kernel attempts to resolve the issue. |
I've had two more cases of the machine I'm trying to use for restic hanging during the backup. No smoking guns so far (no obvious ramping up of CPU load or memory usage leading up to the hang, nothing in the logs, etc). I didn't have a monitor hooked up so no idea if it was a kernel crash these times. Trying some of the suggestions above now. |
Maybe some of the suggestions from https://forum.restic.net/t/server-unresponsive-during-restic-backups/5739/24 help? |
Yeah, I'll try those too. I tried updating to the latest restic (downloaded the binary directly instead of using
|
That is very odd. Apparently the syscall to close one of the temporary files used by restic has hung for over 2 minutes. Either the system is swapping itself to death or something else is very wrong. Do you also have the start of the stacktrace output? I wonder why the stacktrace is printed at all. |
I unfortunately wasn't logging stderr. I will fix that before I restart it. (Sorry, only get to work on this in my spare time.) The earlier crash (where I didn't have a stack trace) I did have enough logs to know the machine was only at about 50% memory usage, so I doubt it was swapping to death. In any case I've now disabled the swap, to take that out of the equation. |
Out of curiosity - is restic cache enabled during your normal use of restic? |
So, it happened again (and hung a while ago, I just didn't look into it until just now). I ran it like this:
The contents of
...though its timestamp is over a week before the system hung. I'm guessing something about my command above failed to report all the logs? What should I do to get more useful logging? |
Oh and I have no swap at all on the machine, and nothing else was happening (it's just a stock debian install with only restic running). Not sure what restic cache is. |
If swap is already off, consider adding |
You could also try memtest, prime95 and a local repo to ensure that this is not a Hardware or network-issue. |
Trying user mode memtester and mprime torture test now (memtest86+ would be tricker since this is a headless machine in a rack). If the machine survives a few days of that I'll try a local repo to see if that does anything different. |
Well I ran memtester and the mprime torture test for about 28 days and they had no issues. Next step is checking a local repo. |
Ok I'm trying a local backup. I have a 2TB local drive, and did: export RESTIC_REPOSITORY="/home/ianh/restic-local-backup"
export RESTIC_PASSWORD_FILE="/home/ianh/restic/password.key"
mkdir -p $RESTIC_REPOSITORY
restic init
GOMAXPROCS=1 nice ionice restic backup --read-concurrency 1 /mnt/scratch 2>&1 | tee -a log.current We'll see what happens. This is backing up a subset of the files the previous attempts tried to back up because of the limited space on this volume. The source is still coming from a CIFS share. |
Well it did the 800GB backup mostly without issue (well it said "no such file or directory" for two of the files but let's ignore that for now while our bar for "issue" is "entire machine hangs") while backing up from the CIFS share to a local disk... Gonna try some different files in case it's something about the files I'm backing up. |
Yeah no problems there either (until it ran out of disk space). This suggests the problem is specifically with the AWS backend maybe. I'll try using the backblaze backend next... |
Well here's an interesting datapoint. I'm pretty sure after the last backup attempt (which was not yet the one to backblaze) I left the machine idle, and today I find it's gone offline again. It was online for literally months including several weeks of extensive activity without any trouble, then it hung a day or so after doing a backup. I have no explanation... |
Maybe it's a CIFS problem after all, let's try extensive activity involving that drive but no restic... |
Well I ran |
Going to leave the machine idle for a while now and see if it hangs like it did last month. |
Well, it did. So I guess restic isn't to blame. |
@Hixie Good luck with hunting the problem down, and thanks for keeping us up to date. |
Sorry to reply in this old issue but did you ever solve this problem @Hixie? This seems almost exactly like an issue I'm having. That is, my system completely freezes periodically with no errors logged to the journal and keyboard input does not work. After basically replacing the entire system piece by piece (thinking it was hardware related) I'm now seeing heavily correlation to my Restic runs and the freezes happening. I am backing up to Backblaze. |
Every few months I have system hangs, but I haven't been able to pinpoint the problem at restic. I've seen it happen with the machine idle and I've seen it happen while rsync is running. |
My guess would be that your system runs out of memory. There's a certain chance that a Linux system may freeze in that case. Depending on how you run restic, it might be possible to e.g. limit the amount of memory a systemd unit is allowed to use. |
(My case is definitely not memory pressure, FWIW; I verified this in various ways, some of which are documented in this thread. But I also don't believe my case is related to restic; I suspect some sort of hardware issue or a kernel bug related to network file systems.) |
I use Netdata to monitor my system and memory usage has never been particularly high at the time of or shortly before it froze. Nor has CPU or any other metric that I've noticed. |
Output of
restic version
How did you run restic exactly?
The files being backed up (/mnt/*) are CIFS mounts from a Drobo NFS. The Drobo and the NUC running restic have fixed IPs. Both are protected by separate UPSes. Both are connected to the network via gigabit Ethernet (no wifi).
What backend/server/service did you use to store the repository?
Backblaze via the s3 restic backend.
Expected behavior
System should back up successfully.
Actual behavior
I've done this twice so far, once on a headless raspberry pi, and once on a headless NUC (the details for the latter are above). In both cases they were running some variant of Linux, accessed over SSH.
In both cases, everything seemed to be going great for the first few days (I have terabytes of data to back up). However, in both cases, after a few weeks, I went to check on the computer and could not ssh into it. In the case of the raspberry pi, I was never able to figure out what happened. The microSD card was trashed, and I could never get anything out of it (e.g. logs) to determine what happened. The problem occurred during a heat wave and I originally just assumed that, combined with the low capabilities of the pi and other software running on it at the time, was the problem.
More recently though I got a new NUC, freshly installed, no other tasks running on it at all. I set it up to do the backup as described above. I cleared the backblaze bucket first so it was a fresh backup too. At first everything seemed great, but when I checked up on it yesterday, the machine had hard hung. It was no longer doing DHCP or ARP (and its lease had expired). I connected a display and keyboard to the NUC to see what was happening; the keyboard did not respond (capslock did not toggle its LED, for example), and on the screen there was what I presume was a kernel crash. I regret that I did not think to take a photo, thinking it would be visible in the logs.
Unfortunately when I examined the logs I found nothing. Based on my DHCP logs, the machine disconnected from the network around 9am (I have short leases of around 15 minutes), but the latest entries in the log files all dated from hours earlier or were mundane entries like renewing DHCP. Whatever caused the problem prevented the last few seconds of logs from being written to disk; some of the log files ended with NULs (upon rebooting, there was a message about the filesystem having to replay the journal). Also I was not, at the time, logging restic output to disk.
Steps to reproduce the behavior
I don't know. So far I have had a 100% success rate at reproducing this by just using restic normally to back up several terabytes and then just waiting a few weeks (about 50% through the total backup, very roughly), but obviously two data points do not make a trend and in particular the first hang happened among many confounding circumstances.
Do you have any idea what may have caused this?
No.
Do you have an idea how to solve the issue?
No.
Thoughts
I'm running restic again now with the log being
tee
d to disk and with data from /proc/meminfo and /proc/pressure/* being logged every minute in case that shows a trend. I am filing this issue in part in the hope that you will recognise the symptoms as those of some misconfiguration error I've made, and in part in the hope that you will suggest other things I could add to my "log system state every minute" script so that we can debug the cause if it happens again.FWIW, currently I'm running:
...every minute and the first and latest log entries are:
I suppose the problem could be something like a bug in the CIFS kernel module where after some significant amount of network traffic, it crashes.
Did restic help you today? Did it make you happy in any way?
I love the potential of restic! I really hope this is just an aberration because the promise of easy backups is extremely attractive. So far all my interactions with y'all have been super positive.
The text was updated successfully, but these errors were encountered: