-
-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: vpn constantly restarts due to being unhealthy #2154
Comments
have the same problem, did you find the solution? |
No I can't figure it out. My system is just down till there is a fix |
I'm having the same problem as well, also with openvpn through proton. Had previously worked without issue for at least the past year and stopped working some time around the last week or so. I tried rolling my openvpn credentials to see if I could revive gluetun, but no luck. I'm using a mix of openvpn and wireguard tunnels through proton in other applications and haven't had any problems so I think this is something unique to gluetun. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Same problem for me here. I've tried multiple versions and they all result in: INFO [healthcheck] program has been unhealthy for 26s: restarting VPN (see https://github.com/qdm12/gluetun-wiki/blob/main/faq/healthcheck.md) Has something external used in these healthchecks changed? |
This comment was marked as off-topic.
This comment was marked as off-topic.
I tried this fix (both the :test docker tag and :latest) but it changed absolutely nothing for me. My container is still spam restarting with these exact same errors. Nothing external changed for me, I didn’t even update anything. My whole docker network just came crumbling down one day when my gluetun container started flipping out. I don’t get any network connection at all from my containers, and I’ve verified that it’s not a DOT problem, and I’ve also tried a new VPN config. Edit: I’ve just switched to linuxserver.io’s wireguard container. I’d recommend the same to anyone else who’s having this problem if they’re willing to read the docs. It works just as well as gluetun. I hate to shill another project in the issues, but for some reason it doesn’t seem to be a priority fixing this. |
I tried it on latest as well, problem still occurs. |
I downgraded to :v3 and it's working again for me. Obviously not ideal, but it works. |
FWIW, I'm still using gluetun but switched my protocol from OpenVPN to Wireguard and it's been stable ever since. That may be more preferable to most, compared to changing to a totally different container. I'd prefer to use OpenVPN over Wireguard for this application, since I have to specify one specific server to establish the Wireguard tunnel, which may be down or heavily loaded. With OpenVPN I can just specify the VPN server city and get auto-assigned to an available VPN server. For now I'm biding my time on Wireguard and plan to switchback to OpenVPN when this bug is hopefully fixed soon. |
@vogtrj I'm using Wireguard and still have the issue. Glad it's working for you though. |
Message for everyone: I've spent one hour writing an answer to each of your comments, your username is tagged below, please read it. And in general, applicable to all of you:
And
shows you did not update your servers. This is likely the only problem, update your servers. It's a single command to run, please run it. @vogtrj your provider is different, please create another issue. Also make sure you've updated your servers, as well.
Not really a bug, so it won't be fixed. It's a VPN server issue or your servers data is outdated and you can update it. @misku
No, there would be no point. If network doesn't work for 6 seconds, failing each request sent every second, then it likely won't recover, and it's better to just restart the VPN internally. This was complex and implemented to resolve many problems (years ago) of VPN connection being 'up' but not working. For debugging (although I'm not sure what you can debug, since the VPN connection is dead anyway), you can change the VPN auto healing parameters, again, read the healthcheck page mentioned above. Your issue might be different (on top of being PIA instead of PureVPN). Please create a separate issue if you want to continue discussing your problem. But from a quick look, it may be related to the DNS over TLS setup downloading files
It seems to go unhealthy after or while downloading files
No. Update your servers data.
The reasons are:
Don't use the |
@qdm12 thanks for your detailed reponse. As I mentioned in my initial post, I've tried this with a manual configuration downloaded from PureVPN directly as well which resulted in the same problem. Nevertheless, I'll try to update the servers again and let you know. |
Didn't change anything. I ultimately ran into some errors while updating servers, but I do have servers in servers.json so assuming they are updated.
This one happened last time. Guessing it's because of restarting/rerunning the command a couple of times.
|
@qdm12 Thank you for taking the time to answer and updating the healthcheck page. Just wanted to let you know I did update the servers data before posting here 😄 If |
Sorry in advance for this wall of text, I hope this helps anyone at all: tldr: For me, UDP-based VPNs (both Wireguard and OpenVPN) experiences this issue, but TCP-based OpenVPN works without connection restarts. I've been experiencing this issue for a while, and I've been trying to troubleshoot it - because health check failures could fail for any number of reasons, a lot of people in this thread might be seeing various different issues, like some users might have intermittently failing DNS lookups causing healthchecks to fail (perhaps a local PiHole rate-limiting your lookups with it's default settings?), or others might have issues with specifically UDP traffic and UDP traffic only (which is what I've finally confirmed happened in my case), the VPN service or their ISP not being stable, or users might be getting rate limited by the site the health check uses (cloudflare.com by default) because they're hitting it so often all being possible causes. General shape of my version of the issue:
Tests performed with no changes:
TCP worked:Because OpenVPN supports TCP as another protocol option, I created a TCP-specific Proton VPN configuration for the same server and otherwise exact same settings, and the connectivity no longer fails; over 5 days under high load with no disconnections and no further failures whatsoever. In my case, it appears that something within the networking chain specifically is getting tripped up by UDP and UDP alone, but it's not clear at all what was involved because the exact same set of software is working flawlessly with UDP + Wireguard on another server with the exact same software configuration, on this exact same ISP to this exact same VPN server, in this immediate area, with zero disconnects for weeks. I'm honestly baffled, but I hope this wall of text gives others ideas on things they might not have checked yet such as testing Proton VPN's integrated DNS servers located at 10.8.8.1 / 10.7.7.1, or testing OpenVPN with TCP protocol if they're currently using UDP. |
@kainzilla Thank you for sharing that, I think that must be my issue! I'm using ProtonVPN over Wireguard with UDP. I'll try changing to TCP and report back. |
Similar, I've been using UDP but do have TCP options. Will test and see if there is a difference. |
PureVPN uses openvpn tcp I think, definitely openvpn. |
This comment was marked as off-topic.
This comment was marked as off-topic.
I have the same issue. I was very happy with qmcgaw/gluetun, and it was working flawlessly for many months. Now the healthcheck is consistently failing and not allowing the tunnel to come up. I use Privado as a VPN, and it requires UDP (so I can't test or use TCP). I've tried to roll back to v3.37.0, and even v3.32 but those have the same issue. I've even tried it on a fresh k3s install on a separate system, and it's the same. I'm quite perplexed. I have checked all the items that the healthcheck wiki says to check, and they are all fine. It would be great if the healthcheck gave a bit more information on what is failing, instead of just saying " program has been unhealthy". For now, I'm using a completely different VPN solution. Hopefully this can get resolved. I really liked this solution. |
@kainzilla Awesome write up, thank you for this. I'll link this in the wiki. This is very scientific and really narrows down the problem source area to, well, UDP. Actually a few things crossing my mind related to this:
On the other hand, I'm quite curious (I think I have seen it somewhere else) about Gluetun taking the whole host network down...
Can you create a separate issue for this? Maybe it exchausts the TCP dialing somehow. Especially if the healthcheck timeout is at 3600s, it won't touch the vpn and will just keep on retrying to tcp dial Cloudflare.com:443. Very odd indeed.
You can set LOG_LEVEL=debug to see details, but it's 99% chance just an
Yes it's probably due either to their VPN server or the udp connection being unreliable and cutting off for longer than 6s (or both). Maybe an openvpn configuration problem even in the official one, at least for udp. Try fiddling with the ping, mssfix and mtu options?
|
@qdm12 Thank for the reply! I'm working on testing out some of the suggestions to see if I can further map out the issue.
No change I can perceive - it appears to happen still approx. every 1-4 hours, which seems consistent with the 'before' behavior.
I'm in progress on testing this suggestion now for the OpenVPN UDP connection type - so far I've actually been using the
The TCP configuration file is identical aside from
I'll be happy to test new values for these if you have suggestions - even in the working TCP configuration it does occasionally log a message about MTU mismatch that I haven't dug into, I'll collect those messages to post as well. Because those log entries only show on the OpenVPN connection type and not Wireguard (which appears to have an untouched 1400 MTU and no log entries for MTU), I think it won't relate to the UDP disconnects but I'm happy to test things out.
Absolutely! Let me confirm I can re-create that specific issue however after checking over some of the other tests - after the single instance of it happening, because changing healthcheck setting caused the network loss I'd reverted it immediately; digging into this could be interesting. |
Looks like the issue I posted is not related at all. I was using Privado, with Country: Switzerland. I changed Country to Canada, and it started working. |
I am having a similar issue that seems to be related to bandwith (not sure). When downloading a torrent at 80-100mbps I get unhealthy checks on the container even though the download works fine. I am also not getting any information in the logs, the container just switches to unhealthy, then healthy and back and forth. I wish I could check/provide the logs for the error but the logs just stop at some point before the error, even in debug mode. Using Windscribe on the latest built on 2024-05-04T16:22:29.394Z (commit ef6874f). Below are the debug logs. After the healthy check it just flips between healthy and unhealthy even though the torrent in the background is downloading just fine. When I'm not downloading anything it stays healthy.
|
Healthcheck logic was changed a bit in 6042a9e this could help a bit (see the commit message for details). |
@r3ps4J I recently ran into this constant restart issue with Gluetun, which lead me to this thread. First, I want to thank you, @kainzilla, and @qdm12 for a very thoughtful discussion on this tricky issue. I wanted to post the diff that stabilized my setup along with the details of my various fixes. My configuration:
The issue for me, I suspect, was one/all of the following:
The error in the logs:
I'd also see the following error on my Homepage Gluetun "widget" about the IP address being empty:
Here's the diff that appears to have resolved the issue: diff --git a/playbooks/services/arr-clients.yaml b/playbooks/services/arr-clients.yaml
index 941b363..895eda7 100644
--- a/playbooks/services/arr-clients.yaml
+++ b/playbooks/services/arr-clients.yaml
@@ -28,7 +28,7 @@
group: Downloads
weight: 100
gluetun:
- image: qmcgaw/gluetun:v3
+ image: qmcgaw/gluetun:latest
enabled: true
ports:
- 8000:8000/tcp # Control server
@@ -124,11 +124,15 @@
- "{{ paths.local }}/gluetun:/gluetun"
environment:
- "VPN_SERVICE_PROVIDER=cyberghost"
- "OPENVPN_USER={{ vpn.openvpn.user }}"
- "OPENVPN_PASSWORD={{ vpn.openvpn.pass }}"
- "TZ={{ tz }}"
+ - "UPDATER_PERIOD=24h"
+ - "UPDATER_VPN_SERVICE_PROVIDERS=cyberghost"
- "SERVER_COUNTRIES=United States"
- "FIREWALL_OUTBOUND_SUBNETS=192.168.1.0/24"
+ - "PUBLICIP_API=ipinfo"
+ - "PUBLICIP_API_TOKEN={{ vpn.publicip_api_token }}"
- name: Create main qBittorrent directory
file:
@@ -176,6 +180,11 @@
- "{{ paths.local }}/qbittorrent/config:/config"
- "{{ paths.data }}/qbittorrent/downloads:/downloads"
- "{{ paths.data }}/qbittorrent/cross-seeds:/cross-seeds"
+ healthcheck:
+ test: ["CMD-SHELL", "ping -c 1 google.com && curl --fail http://localhost:{{ torrent.ports.http }}/"]
+ interval: 1m
+ timeout: 10s
+ retries: 3
- name: CNAME {{ torrent.dns }}.{{ dns.wtf }} to {{ ansible_host }}
amazon.aws.route53:
@@ -227,6 +236,11 @@
volumes:
- "{{ paths.local }}/sabnzbd/config:/config"
- "{{ paths.downloads }}/Downloads:/config/Downloads"
+ healthcheck:
+ test: ["CMD-SHELL", "ping -c 1 google.com && curl --fail http://localhost:{{ nzb.ports.http }}/"]
+ interval: 1m
+ timeout: 10s
+ retries: 3 The above health checks, along with
Once my container was healthy, you end up with this in the Gluetun logs:
There appears to be a back off of some sort as it's now polling IPInfo.io every 6 minutes or so. |
Thank you everyone who has put time and effort in diagnosing this issue. I've been experiencing this issue for a few days now after running the exact same setup without issues for the last months. After starting the container I sometimes get the expected throughput for a few minutes (around 500 MBit/s) but sometimes the following issue will start immediately: The fact that it doesn't die immediately everytime seems weird. I've thought about ISP throttling, but from what I can gather that doesn't seem likely, especially because I don't create huge amounts of traffic. A few gig every now and then. I've tried:
If I can help narrow down the issue in any way, please let me know. |
How do you get around the fact that vpn providers have a certain number of servers each, so ultimately some of us using the same companies (mullvad, pia etc) are ultimately going to have the same IPs in terms of ip infos 30k a month checks? Surely it's a requirement to signup and get an api key for them, given the user base of gluetun is increasing daily? |
@matthenning this is very weird indeed. Please let us know if you resolve it, because I have no clue what it could be (especially in Germany).
I don't, and yes it is or will be problematic. Let's move this discussion to #2190 I could change the code to cycle/pick at random an ip data service, so it gives more room. I don't think there is an ultimate solution here though, but feel free to comment back on #2190 if you have an idea 😉 We could maybe use approximate free ip databases at some point, to have at least the country for an IP address. EDIT: also this is irrelevant to the healthcheck, the public ip is fetched independently. |
Hi, |
After a scheduled system upgrade & reboot on a RHEL server, I woke up to this unhealthy restart bug. Container restart didn't help. However adding this to my compose and recreating it brought back the VPN. Can try with simple recreate first if you don't want to add the dns entry.
Running version latest built on 2024-08-23T13:50:02.262Z (commit ff7cadb) Hope this helps someone. |
Speaking personally, I have had success just setting the protocol to TCP and the Credit to @kainzilla and miguelsousa46 on Reddit for these solutions (the latter using 120s instead). |
have you got a compose you can share? |
A small update to my prior testing scenarios that had been outlined in this post - the second, nearly-identical configuration at a second location switched from using an old consumer Google Wifi router to a dedicated OPNsense-based router, identical to the "problem" site, and the same issue started occurring. The same solution of using TCP instead of UDP worked around the issue. It wasn't possible to revert to the prior router to test if the issue resolved again, but it was hard-confirmed that nothing else had changed, and the failure was confirmed less than 48 hours after the change. OPNsense (and the very similar pfSense) uses open-source software that shows up in many other consumer products, and is based on the BSD OS. It's possible there's an obscure and rarely-triggered issue with UDP traffic handling that could cause this issue for some routers that either use the same open source network software or are outright based on OPNsense/pfSense. I've seen notes regarding UDP handling improvements in OPNsense release notes over the last year, but haven't yet tested if recent versions such as 24.7+ no longer experience the issue. If I test this, I'll update again in the future. Also, please keep in mind that because this "VPN restarts due to health" bug's symptoms can cover a lot of potential causes, many readers in here are experiencing VPN health restarts for other reasons than routers handling UDP poorly, so using TCP protocol with OpenVPN won't be the solution for everyone, but it's worth testing. |
I think this is an issue with v3.39.1. I reverted back to v3.39 (image: qmcgaw/gluetun:v3.39) and it's working so far EDIT by qdm12: v3.39 and v3.39.1 are the same 😄 v3.39 points to the last bugfix release so here it's v3.39.1 |
Just wanted to toss in I too am having the same issues. I am running an OPNsense router, gluetun+qbittorent in a stack via portainer using paid protonvpn access. Just recently noticed the constant health check failures after watching downloads hit speed then drop out over and over again. I did a combination of what @kainzilla did by changing to TCP and also changing the DNS to protonvpn and got a more stable connection. I'm still getting health check fails, they are just much more spaced out at 40+ minutes versus 1-3minutes. |
This is happening with me for the past few weeks at least, with Private Internet Access. I have paused all traffic to rule out any traffic issues and it's still restarting. |
I look forward to this issue being 'fixed' once and for all. In the meantime, I have set up a vm to only use wireguard, running qbittrorrent and a python script that updates protonvpn port changes. Never know, this setup might become the new norm. |
Is this urgent?
None
Host OS
Debian 12 Bookworm
CPU arch
aarch64
VPN service provider
PureVPN
What are you using to run the container
docker-compose
What is the version of Gluetun
Running version latest built on 2024-03-07T12:32:25.391Z (commit 3254fc8)
What's the problem 🤔
My gluetun container started constantly restarting the vpn. I understand this is the "auto-healing" mechanism, but I can't figure out what causes it. Especially since I haven't changed anything in my gluetun configuration. Unsure if it's actually a bug or just a user error, but any help would be appreciated.
I checked the healthcheck page as well, so find my answers for each step below:
It should be correct, but just in case I also tried a manual configuration downloaded from PureVPN's dashboard with the latest IP which resulted in the same problem.
SERVER_REGIONS
I removed the countries filter altogether, but no luck.
I haven't changed my firewall or installed a new one, and it worked before.
It is definitely working outside the gluetun container.
v4.5.1
?? Then downgrade back tov4.5.1
. See @Miexil's comment.Running on Debian 12 Bookworm so not relevant.
Here I am lol!
Share your logs (at least 10 lines)
Share your configuration
The text was updated successfully, but these errors were encountered: