Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System-wide hang after a few weeks #4131

Closed
Hixie opened this issue Jan 5, 2023 · 32 comments
Closed

System-wide hang after a few weeks #4131

Hixie opened this issue Jan 5, 2023 · 32 comments
Labels
state: need feedback waiting for feedback, e.g. from the submitter

Comments

@Hixie
Copy link

Hixie commented Jan 5, 2023

Output of restic version

ianh@feral:~$ restic version
restic 0.11.0 compiled with go1.15.9 on linux/amd64
ianh@feral:~$ uname -a
Linux feral 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
ianh@feral:~$ cat /proc/cpuinfo | grep model
model           : 122
model name      : Intel(R) Celeron(R) N4000 CPU @ 1.10GHz
model           : 122
model name      : Intel(R) Celeron(R) N4000 CPU @ 1.10GHz
ianh@feral:~$ cat /proc/meminfo | head -3
MemTotal:        3841828 kB
MemFree:          124884 kB
MemAvailable:    3083456 kB
ianh@feral:~$ df -hal
Filesystem      Size  Used Avail Use% Mounted on
sysfs              0     0     0    - /sys
proc               0     0     0    - /proc
udev            1.9G     0  1.9G   0% /dev
devpts             0     0     0    - /dev/pts
tmpfs           376M  816K  375M   1% /run
/dev/mmcblk0p2   55G  1.4G   51G   3% /
securityfs         0     0     0    - /sys/kernel/security
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
cgroup2            0     0     0    - /sys/fs/cgroup
pstore             0     0     0    - /sys/fs/pstore
efivarfs           0     0     0    - /sys/firmware/efi/efivars
none               0     0     0    - /sys/fs/bpf
systemd-1          -     -     -    - /proc/sys/fs/binfmt_misc
hugetlbfs          0     0     0    - /dev/hugepages
mqueue             0     0     0    - /dev/mqueue
debugfs            0     0     0    - /sys/kernel/debug
tracefs            0     0     0    - /sys/kernel/tracing
configfs           0     0     0    - /sys/kernel/config
fusectl            0     0     0    - /sys/fs/fuse/connections
/dev/mmcblk0p1  511M  5.8M  506M   2% /boot/efi
/dev/nvme0n1p1  1.8T   23G  1.7T   2% /home
tmpfs           376M     0  376M   0% /run/user/1000
binfmt_misc        0     0     0    - /proc/sys/fs/binfmt_misc

How did you run restic exactly?

export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export RESTIC_REPOSITORY="s3:s3.us-west-000.backblazeb2.com/..."
export RESTIC_PASSWORD_FILE=/home/ianh/restic/password.key
restic init
# the output is lost to time, but matched the usual output seen in tutorials
restic backup /mnt/*
# the output is what I would expect (i.e. just listing files it is backing up), although occasionally I do see lines like:
Save(<data/...>) returned error, retrying after 619.645675ms: client.PutObject: An internal error occurred.  Please retry your upload.                                                                                                                   |

The files being backed up (/mnt/*) are CIFS mounts from a Drobo NFS. The Drobo and the NUC running restic have fixed IPs. Both are protected by separate UPSes. Both are connected to the network via gigabit Ethernet (no wifi).

What backend/server/service did you use to store the repository?

Backblaze via the s3 restic backend.

Expected behavior

System should back up successfully.

Actual behavior

I've done this twice so far, once on a headless raspberry pi, and once on a headless NUC (the details for the latter are above). In both cases they were running some variant of Linux, accessed over SSH.

In both cases, everything seemed to be going great for the first few days (I have terabytes of data to back up). However, in both cases, after a few weeks, I went to check on the computer and could not ssh into it. In the case of the raspberry pi, I was never able to figure out what happened. The microSD card was trashed, and I could never get anything out of it (e.g. logs) to determine what happened. The problem occurred during a heat wave and I originally just assumed that, combined with the low capabilities of the pi and other software running on it at the time, was the problem.

More recently though I got a new NUC, freshly installed, no other tasks running on it at all. I set it up to do the backup as described above. I cleared the backblaze bucket first so it was a fresh backup too. At first everything seemed great, but when I checked up on it yesterday, the machine had hard hung. It was no longer doing DHCP or ARP (and its lease had expired). I connected a display and keyboard to the NUC to see what was happening; the keyboard did not respond (capslock did not toggle its LED, for example), and on the screen there was what I presume was a kernel crash. I regret that I did not think to take a photo, thinking it would be visible in the logs.

Unfortunately when I examined the logs I found nothing. Based on my DHCP logs, the machine disconnected from the network around 9am (I have short leases of around 15 minutes), but the latest entries in the log files all dated from hours earlier or were mundane entries like renewing DHCP. Whatever caused the problem prevented the last few seconds of logs from being written to disk; some of the log files ended with NULs (upon rebooting, there was a message about the filesystem having to replay the journal). Also I was not, at the time, logging restic output to disk.

Steps to reproduce the behavior

I don't know. So far I have had a 100% success rate at reproducing this by just using restic normally to back up several terabytes and then just waiting a few weeks (about 50% through the total backup, very roughly), but obviously two data points do not make a trend and in particular the first hang happened among many confounding circumstances.

Do you have any idea what may have caused this?

No.

Do you have an idea how to solve the issue?

No.

Thoughts

I'm running restic again now with the log being teed to disk and with data from /proc/meminfo and /proc/pressure/* being logged every minute in case that shows a trend. I am filing this issue in part in the hope that you will recognise the symptoms as those of some misconfiguration error I've made, and in part in the hope that you will suggest other things I could add to my "log system state every minute" script so that we can debug the cause if it happens again.

FWIW, currently I'm running:

echo ""
date
cat /proc/meminfo | grep MemFree
grep full /proc/pressure/*

...every minute and the first and latest log entries are:

Wed 04 Jan 2023 04:14:55 PM PST
MemFree:          112780 kB
/proc/pressure/io:full avg10=3.92 avg60=3.93 avg300=3.99 total=75934965
/proc/pressure/memory:full avg10=0.00 avg60=0.04 avg300=0.00 total=601574

[...]

Thu 05 Jan 2023 01:20:51 PM PST
MemFree:          115972 kB
/proc/pressure/io:full avg10=3.23 avg60=3.20 avg300=3.20 total=7854277124
/proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=73134839

I suppose the problem could be something like a bug in the CIFS kernel module where after some significant amount of network traffic, it crashes.

Did restic help you today? Did it make you happy in any way?

I love the potential of restic! I really hope this is just an aberration because the promise of easy backups is extremely attractive. So far all my interactions with y'all have been super positive.

@AlBundy33
Copy link

AlBundy33 commented Jan 7, 2023

I have no idea but you could try

  • the latest version of restic
  • try to mount your source in another way (maybe nfs or sftp)
  • backup to another target (maybe nfs, sftp, local or usb)

could it be related to #2659 or #4130?

@MichaelEischer
Copy link
Member

Save(<data/...>) returned error, retrying after 619.645675ms: client.PutObject: An internal error occurred. Please retry your upload.

That error is rather unusual. I haven't seen that specific error somewhere else so far.

and on the screen there was what I presume was a kernel crash.

A kernel crash is a problem of the underlying system, not of restic.

Whatever caused the problem prevented the last few seconds of logs from being written to disk; some of the log files ended with NULs (upon rebooting, there was a message about the filesystem having to replay the journal).

That indeed looks like a kernel crash.

So far I have had a 100% success rate at reproducing this by just using restic normally to back up several terabytes

Several terabytes will definitely require quite a bit of memory. But that would normally just result in restic getting killed by the kernel's oom killer...

@AlBundy33
Copy link

@MichaelEischer I'm just curious - why does restic needs more memory for lager backups? is it because of the metadata or because of the size of the files? 🤔

@MichaelEischer
Copy link
Member

@AlBundy33 Restic uses an index which keeps information about each stored file chunk and which is used to deduplicate data. For performance / simplicity that index is fully loaded into memory. As each file chunk is listed in the index, it will grow as more data is added to a repository.

The file metadata inside a folder also requires additional memory, but unless you have folder which directly contains hundreds of thousands of files (not recursively), the main memory user will be the index.

@Hixie
Copy link
Author

Hixie commented Jan 8, 2023

So far it hasn't crashed a third time, I'll try to update and see if I can mount some other way if it does crash again...

@MichaelEischer MichaelEischer added the state: need feedback waiting for feedback, e.g. from the submitter label Jan 11, 2023
@jboyens
Copy link

jboyens commented Jan 23, 2023

So, I ran into a restic hang recently (version 0.15.0) and diagnosed it down to a restic interaction with Transparent Huge Pages.

I'm not 100% up to speed on THP support in Go, but it seems like restic is wanting to use a hugepage and that amount of contiguous memory is not immediately available. The system just... hangs a bit while kcompactd and kswapd try and rearrange memory enough to make this possible.

So, the behavior I see is that khugepaged and kcompactd slam up to 100% CPU usage. If I close some stuff down, I can get it to immediately wake up and restic can continue. Otherwise, allocations all over the place suffer dramatically while the kernel attempts to resolve the issue.

@Hixie
Copy link
Author

Hixie commented Jan 29, 2023

I've had two more cases of the machine I'm trying to use for restic hanging during the backup. No smoking guns so far (no obvious ramping up of CPU load or memory usage leading up to the hang, nothing in the logs, etc). I didn't have a monitor hooked up so no idea if it was a kernel crash these times. Trying some of the suggestions above now.

@MichaelEischer
Copy link
Member

Maybe some of the suggestions from https://forum.restic.net/t/server-unresponsive-during-restic-backups/5739/24 help?

@Hixie
Copy link
Author

Hixie commented Jan 30, 2023

Yeah, I'll try those too.

I tried updating to the latest restic (downloaded the binary directly instead of using apt), and after a few hours the machine stopped responding to ssh (but still responds to pings for now). The last bit of output on the ssh connection was:

        /usr/local/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc006687fe8 sp=0xc006687fe0 pc=0x466e21
created by golang.org/x/sync/errgroup.(*Group).Go
        /home/build/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa5

goroutine 104 [syscall, 2 minutes]:
syscall.Syscall(0xc00efe15c0?, 0x40?, 0xc0062a1d00?, 0x80b9ca?)
        /usr/local/go/src/syscall/syscall_linux.go:68 +0x27 fp=0xc0062a1c18 sp=0xc0062a1ba8 pc=0x47ccc7
syscall.Close(0xc0028d4480?)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:295 +0x2a fp=0xc0062a1c48 sp=0xc0062a1c18 pc=0x47a1ea
internal/poll.(*FD).destroy(0xc0148eb560)
        /usr/local/go/src/internal/poll/fd_unix.go:84 +0x51 fp=0xc0062a1c70 sp=0xc0062a1c48 pc=0x49f2f1
internal/poll.(*FD).decref(0x10?)
        /usr/local/go/src/internal/poll/fd_mutex.go:213 +0x53 fp=0xc0062a1c90 sp=0xc0062a1c70 pc=0x49e153
internal/poll.(*FD).Close(0xc0148eb560)
        /usr/local/go/src/internal/poll/fd_unix.go:107 +0x4f fp=0xc0062a1cb8 sp=0xc0062a1c90 pc=0x49f38f
os.(*file).close(0xc0148eb560)
        /usr/local/go/src/os/file_unix.go:252 +0xad fp=0xc0062a1d10 sp=0xc0062a1cb8 pc=0x4ae60d
os.(*File).Close(...)
        /usr/local/go/src/os/file_posix.go:25
github.com/restic/restic/internal/repository.(*Repository).savePacker(0xc0003c61a0, {0x12d7850, 0xc0028d4480}, 0x1, 0xc00f07a618)
        /restic/internal/repository/packer_manager.go:177 +0x6e6 fp=0xc0062a1ed0 sp=0xc0062a1d10 pc=0x826ac6
github.com/restic/restic/internal/repository.newPackerUploader.func1()
        /restic/internal/repository/packer_uploader.go:37 +0xff fp=0xc0062a1f78 sp=0xc0062a1ed0 pc=0x826fbf
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /home/build/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x64 fp=0xc0062a1fe0 sp=0xc0062a1f78 pc=0x74b2e4
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0062a1fe8 sp=0xc0062a1fe0 pc=0x466e21
created by golang.org/x/sync/errgroup.(*Group).Go
        /home/build/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa5

goroutine 109 [semacquire, 2 minutes]:
runtime.gopark(0xae601d9c9766f9cd?, 0xc0002554e8?, 0xc0?, 0x89?, 0x761065?)
        /usr/local/go/src/runtime/proc.go:363 +0xd6 fp=0xc0002554a8 sp=0xc000255488 pc=0x438876
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:369
runtime.semacquire1(0xc00038088c, 0x30?, 0x3, 0x1)
        /usr/local/go/src/runtime/sema.go:150 +0x1fe fp=0xc000255510 sp=0xc0002554a8 pc=0x448d9e
sync.runtime_SemacquireMutex(0xf?, 0x1?, 0xc0002555c8?)
        /usr/local/go/src/runtime/sema.go:77 +0x25 fp=0xc000255540 sp=0xc000255510 pc=0x462fa5
sync.(*Mutex).lockSlow(0xc000380888)
        /usr/local/go/src/sync/mutex.go:171 +0x165 fp=0xc000255590 sp=0xc000255540 pc=0x470e25
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:90
github.com/restic/restic/internal/repository.(*packerManager).SaveBlob(0xc000380870, {0x12d7850, 0xc0028d4500}, 0x0?, {0x27, 0x61, 0xcb, 0xcf, 0x2, 0x5e, ...}, ...)
        /restic/internal/repository/packer_manager.go:68 +0xaf fp=0xc000255660 sp=0xc000255590 pc=0x825d2f
github.com/restic/restic/internal/repository.(*Repository).saveAndEncrypt(0xc0003c61a0, {0x12d7850, 0xc0028d4500}, 0x1, {0xc021e66000, 0x1351ff, 0x800000}, {0x27, 0x61, 0xcb, ...})
        /restic/internal/repository/repository.go:432 +0x358 fp=0xc000255790 sp=0xc000255660 pc=0x82b618
github.com/restic/restic/internal/repository.(*Repository).SaveBlob(0xc0003c61a0, {0x12d7850, 0xc0028d4500}, 0x1, {0xc021e66000, 0x1351ff, 0x800000}, {0x0, 0x0, 0x0, ...}, ...)
        /restic/internal/repository/repository.go:843 +0x27e fp=0xc000255d08 sp=0xc000255790 pc=0x82e57e
github.com/restic/restic/internal/archiver.(*BlobSaver).saveBlob(0xc021b41ab8?, {0x12d7850?, 0xc0028d4500?}, 0x1?, {0xc021e66000?, 0x1351ff, 0x800000?})
        /restic/internal/archiver/blob_saver.go:68 +0x85 fp=0xc000255dd8 sp=0xc000255d08 pc=0x797fe5
github.com/restic/restic/internal/archiver.(*BlobSaver).worker(0x0?, {0x12d7850, 0xc0028d4500}, 0xc00010b260)
        /restic/internal/archiver/blob_saver.go:95 +0x131 fp=0xc000255f48 sp=0xc000255dd8 pc=0x798271
github.com/restic/restic/internal/archiver.NewBlobSaver.func1()
        /restic/internal/archiver/blob_saver.go:33 +0x29 fp=0xc000255f78 sp=0xc000255f48 pc=0x797e29
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /home/build/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x64 fp=0xc000255fe0 sp=0xc000255f78 pc=0x74b2e4
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc000255fe8 sp=0xc000255fe0 pc=0x466e21
created by golang.org/x/sync/errgroup.(*Group).Go
        /home/build/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa5

goroutine 112 [select, 2 minutes]:
runtime.gopark(0xc002997b70?, 0x2?, 0x10?, 0x46?, 0xc002997b4c?)
        /usr/local/go/src/runtime/proc.go:363 +0xd6 fp=0xc0001739d8 sp=0xc0001739b8 pc=0x438876
runtime.selectgo(0xc000173b70, 0xc002997b48, 0xf7578b007651d0bd?, 0x1, 0x40d5bf?, 0x1)
        /usr/local/go/src/runtime/select.go:328 +0x7bc fp=0xc000173b18 sp=0xc0001739d8 pc=0x447cdc
github.com/restic/restic/internal/archiver.(*BlobSaver).Save(0x7a6380f7?, {0x12d7850?, 0xc0028d4500?}, 0x30?, 0x106e1a0?, 0x800001?)
        /restic/internal/archiver/blob_saver.go:47 +0x98 fp=0xc000173ba0 sp=0xc000173b18 pc=0x797ed8
github.com/restic/restic/internal/archiver.(*BlobSaver).Save-fm({0x12d7850?, 0xc0028d4500?}, 0x2c?, 0x800000?, 0xc007d544e0?)
        <autogenerated>:1 +0x45 fp=0xc000173be0 sp=0xc000173ba0 pc=0x79eaa5
github.com/restic/restic/internal/archiver.(*FileSaver).saveFile(0xc0003809c0, {0x12d7850, 0xc0028d4500}, 0xc0000f4600?, {0xc007f34800, 0x77}, {0xc007f34680, 0x77}, {0x12dbf10, 0xc000012148}, ...)
        /restic/internal/archiver/file_saver.go:207 +0x8de fp=0xc000173d60 sp=0xc000173be0 pc=0x79939e
github.com/restic/restic/internal/archiver.(*FileSaver).worker(0xc0003809c0, {0x12d7850, 0xc0028d4500}, 0xc00010b380)
        /restic/internal/archiver/file_saver.go:264 +0x105 fp=0xc000173f48 sp=0xc000173d60 pc=0x799e25
github.com/restic/restic/internal/archiver.NewFileSaver.func2()
        /restic/internal/archiver/file_saver.go:54 +0x29 fp=0xc000173f78 sp=0xc000173f48 pc=0x7987a9
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /home/build/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x64 fp=0xc000173fe0 sp=0xc000173f78 pc=0x74b2e4
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc000173fe8 sp=0xc000173fe0 pc=0x466e21
created by golang.org/x/sync/errgroup.(*Group).Go
        /home/build/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa5
                  

@MichaelEischer
Copy link
Member

goroutine 104 [syscall, 2 minutes]:

That is very odd. Apparently the syscall to close one of the temporary files used by restic has hung for over 2 minutes. Either the system is swapping itself to death or something else is very wrong.

Do you also have the start of the stacktrace output? I wonder why the stacktrace is printed at all.

@Hixie
Copy link
Author

Hixie commented Feb 1, 2023

I unfortunately wasn't logging stderr. I will fix that before I restart it. (Sorry, only get to work on this in my spare time.)

The earlier crash (where I didn't have a stack trace) I did have enough logs to know the machine was only at about 50% memory usage, so I doubt it was swapping to death. In any case I've now disabled the swap, to take that out of the equation.

@borkd
Copy link

borkd commented Feb 22, 2023

Out of curiosity - is restic cache enabled during your normal use of restic?

@Hixie
Copy link
Author

Hixie commented Feb 27, 2023

So, it happened again (and hung a while ago, I just didn't look into it until just now). I ran it like this:

(cd restic; . restic-config.sh; GOMAXPROCS=1 nice ionice restic backup --read-concurrency 1 /mnt/foo 2>&1 | tee -a log.current)
# restic/ is where i have restic-config.sh, it's just a random mostly empty directory
# /mnt/foo is what i'm backing up
# restic-config.sh sets AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, RESTIC_REPOSITORY (to a string beginning with "s3:s3.us-west-000.backblazeb2.com/), and RESTIC_PASSWORD_FILE

The contents of log.current after that were:

using parent snapshot b69fc099

Files:          71 new,    44 changed, 598893 unmodified
Dirs:            5 new,    42 changed, 56477 unmodified
Added to the repository: 58.376 MiB (58.385 MiB stored)

processed 599008 files, 20.688 GiB in 4:19
snapshot 3e44e9e0 saved

...though its timestamp is over a week before the system hung. I'm guessing something about my command above failed to report all the logs?

What should I do to get more useful logging?

@Hixie
Copy link
Author

Hixie commented Feb 27, 2023

Oh and I have no swap at all on the machine, and nothing else was happening (it's just a stock debian install with only restic running). Not sure what restic cache is.

@borkd
Copy link

borkd commented Feb 27, 2023

If swap is already off, consider adding --no-cache to restic invocation, at least until you finish troubleshooting. Trashed microsd card and system unresponsiveness could be the effect of excessive media wear. Exploring what other software is running and persisting data might be useful. If the only candidate is indeed restic then moving the cache elsewhere or disabling it could help. Disabled cache may incur extra transactions to/from s3 though, so you need to decide if that's ok. Check online documentation if you want to know more about restic cache.

@AlBundy33
Copy link

You could also try memtest, prime95 and a local repo to ensure that this is not a Hardware or network-issue.

@Hixie
Copy link
Author

Hixie commented Feb 27, 2023

Trying user mode memtester and mprime torture test now (memtest86+ would be tricker since this is a headless machine in a rack). If the machine survives a few days of that I'll try a local repo to see if that does anything different.

@Hixie
Copy link
Author

Hixie commented Mar 26, 2023

Well I ran memtester and the mprime torture test for about 28 days and they had no issues. Next step is checking a local repo.

@Hixie
Copy link
Author

Hixie commented Apr 22, 2023

Ok I'm trying a local backup. I have a 2TB local drive, and did:

export RESTIC_REPOSITORY="/home/ianh/restic-local-backup"
export RESTIC_PASSWORD_FILE="/home/ianh/restic/password.key"
mkdir -p $RESTIC_REPOSITORY
restic init
GOMAXPROCS=1 nice ionice restic backup --read-concurrency 1 /mnt/scratch 2>&1 | tee -a log.current

We'll see what happens. This is backing up a subset of the files the previous attempts tried to back up because of the limited space on this volume. The source is still coming from a CIFS share.

@Hixie
Copy link
Author

Hixie commented Apr 23, 2023

Well it did the 800GB backup mostly without issue (well it said "no such file or directory" for two of the files but let's ignore that for now while our bar for "issue" is "entire machine hangs") while backing up from the CIFS share to a local disk... Gonna try some different files in case it's something about the files I'm backing up.

@Hixie
Copy link
Author

Hixie commented Apr 24, 2023

Yeah no problems there either (until it ran out of disk space). This suggests the problem is specifically with the AWS backend maybe. I'll try using the backblaze backend next...

@Hixie
Copy link
Author

Hixie commented Apr 27, 2023

Well here's an interesting datapoint. I'm pretty sure after the last backup attempt (which was not yet the one to backblaze) I left the machine idle, and today I find it's gone offline again.

It was online for literally months including several weeks of extensive activity without any trouble, then it hung a day or so after doing a backup. I have no explanation...

@Hixie
Copy link
Author

Hixie commented Apr 27, 2023

Maybe it's a CIFS problem after all, let's try extensive activity involving that drive but no restic...

@Hixie
Copy link
Author

Hixie commented May 24, 2023

Well I ran while (true); do find -type f -print0 | xargs -0 hd; done in the directory with the CIFS mount for about 4 weeks straight and nothing bad happened...

@Hixie
Copy link
Author

Hixie commented May 24, 2023

Going to leave the machine idle for a while now and see if it hangs like it did last month.

@Hixie
Copy link
Author

Hixie commented Jun 10, 2023

Well, it did. So I guess restic isn't to blame.

@Hixie Hixie closed this as completed Jun 10, 2023
@MichaelEischer
Copy link
Member

@Hixie Good luck with hunting the problem down, and thanks for keeping us up to date.

@PHLAK
Copy link

PHLAK commented Nov 17, 2023

Sorry to reply in this old issue but did you ever solve this problem @Hixie?

This seems almost exactly like an issue I'm having. That is, my system completely freezes periodically with no errors logged to the journal and keyboard input does not work. After basically replacing the entire system piece by piece (thinking it was hardware related) I'm now seeing heavily correlation to my Restic runs and the freezes happening. I am backing up to Backblaze.

@Hixie
Copy link
Author

Hixie commented Nov 17, 2023

Every few months I have system hangs, but I haven't been able to pinpoint the problem at restic. I've seen it happen with the machine idle and I've seen it happen while rsync is running.

@MichaelEischer
Copy link
Member

My guess would be that your system runs out of memory. There's a certain chance that a Linux system may freeze in that case. Depending on how you run restic, it might be possible to e.g. limit the amount of memory a systemd unit is allowed to use.

@Hixie
Copy link
Author

Hixie commented Nov 17, 2023

(My case is definitely not memory pressure, FWIW; I verified this in various ways, some of which are documented in this thread. But I also don't believe my case is related to restic; I suspect some sort of hardware issue or a kernel bug related to network file systems.)

@PHLAK
Copy link

PHLAK commented Nov 17, 2023

My guess would be that your system runs out of memory.

I use Netdata to monitor my system and memory usage has never been particularly high at the time of or shortly before it froze. Nor has CPU or any other metric that I've noticed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: need feedback waiting for feedback, e.g. from the submitter
Projects
None yet
Development

No branches or pull requests

6 participants