-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker stats hangs #39523
Comments
The strace likely won't have much details (as it would only strike the CLI, which is likely not where the issue is) I attached the files here for easier finding; |
@cpuguy83 @kolyshkin ptal |
@jarmediagmbh is there anything useful in the daemon logs? Could you describe a bit what kind of load / containers you're running? (long-lived / short-lived containers? do you use health-checks? do container produce a lot of logs? do you use bind-mounts?) |
is there anything useful in the daemon logs?
Could you describe a bit what kind of load / containers you're running? (long-lived / short-lived containers? do you use health-checks? do container produce a lot of logs? do you use bind-mounts? |
All containers are running "interactively" (so a shell session attached to it?) |
Wow. I have found a suspicous thing: Besides our dev-containers we are using zalenium (https://github.com/zalando/zalenium) on the same docker host. It automatically generated containers with google chrome running in it, so you can remotely and programmatically control the chrome-instances e.g. for test automation for websites. After I did a "docker-compose down" on the zalenium directories (zalenium consists of a control-container and several chrome-instance-containers), docker stats did work again. However, I am sure, that the last time we had this issue, we removed all containers from the host and docker stats did not work aswell. |
No, we ssh into the conatiners to work on them. There is no "docker exec -it [CONTAINER] /bin/bash", if you ment that. |
Thanks; yes that's what I meant ( |
In the past we had a lot of problems using docker exec / docker run, because the commands sometimes did not terminate. Thus, even our control-scripts use ssh into the containers and not "docker exec". |
Starting and stopping a few tens (~30) containers simultaneously and continuously with periodic docker stats API invocation, the stats API gets hang at a random time and often makes deadlocks for my container termination routines which calls the stats API to sync the final stats to an internal database. FYI: I have an alternative stats implementation which reads and parses the sysfs directly, and using it instead of the docker stats API resolves the deadlock issue completely, under heavy-load tests. So I'm pretty sure that the problem resides inside the docker daemon somewhere.... |
We have the same issue, where docker stats suddenly stops responding. Issue description
Docker stats suddenly stops responding on all containers without an apparent cause. This never resolves (has to be terminated): docker stats --no-stream dd7e99646619 This will error with "timeout waiting for stats": docker stats dd7e99646619 Docker info & versionAll instances are the same: docker-info.txt LogsThese are the docker daemon logs from around the time the stats endpoint stopped responding. Instance AIssue occurred around 10:00 and 13:00. Instance BIssue occurred around 22:00. Instance CIssue occurred around 10:00. MetricsI have all the Prometheus "node" metrics from the instances. I have observed:
What have you triedI have tried sending the stats request directly to the docker socket in various combinations like so:
This request never resolves. What fixes the problemRestarting the docker daemon fixes the issue. ImportantI currently have three distinct instances with the problematic docker at hand, have not yet restarted the daemon. |
On the stuck instances, please grab a goroutine stack dump. You can get this from the endpoint t "/debug/pprof/goroutine?debug=2" Also please provide the output of "docker info" and "docker version", assuming those work (they shouldn't generally get wedged). |
Thanks for the quick response, here it is. Instance A: Instance B: Instance C: |
@zatlodan Thanks, I can see there are a couple of places this is getting wedged trying to close the (cloudwatch) log driver because the log buffer is full. Additionally the same thing is happening in the local log cache (used to allow This is causing the container state to remain locked indefinitely which is causing other issues, namely preventing an image removal from completing because some container is using the image and its trying to take the container lock to see if the container is running... I think this is also another bug, we shouldn't care if the container is running or not here... but there are likely historical reasons for why this is the way it is. |
Thanks for the quick investigation 👍 I can see that some of the containers indeed can't send logs due to non existing cloud watch log groups. But on the impacted instances there are only two containers from the total ~24 that have this issue currently, Although I can see that the problematic image (with a non-existing log group) has been deployed to each affected instance at the time when the issue arose, indicating a potential correlation I will try to fix the logging issues and keep you updated. |
@zatlodan In all your stack dumps I can see lots of stats requests not blocked except on just new stats data (normal) and then some that are blocked waiting for the container lock to be freed (due to log writing being blocked). |
Just a quick update. We have fixed the logging misconfiguration issues so that all our docker containers are correctly sending logs to CloudWatch. There were no issues with hanging stats calls or missing metrics since. So its quite probable that the problem with the hanging stats API was caused by the misconfigured logging and the related issue with the log driver buffers that @cpuguy83 has mentioned. |
Thanks for the update! |
Description
Docker is working for a few weeks, then it is somehow not responding anymore. Some commands work, others not.
e.g. "docker stats" is hanging. "docker pa -a" works.
I searched through the various issues here that describe different hangs, but found nothing that solves my issue.
Restarting the docker daemon fixes the issue for a few days...
Steps to reproduce the issue:
Describe the results you received:
Docker hangs for whatever reason.
Describe the results you expected:
Docker should work as expected.
Additional information you deem important (e.g. issue happens only occasionally):
In other threads there was the hint of doing a SIGUSR1-Signal to dockerd and posting the result. Also, I have done an strace on the stuck "docker stats" command.
It can be found here: https://git.jar.media/snippets/8
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
No additional info.
The text was updated successfully, but these errors were encountered: