docker stats hangs #39523

invokablegmbh · 2019-07-15T05:47:56Z

Description

Docker is working for a few weeks, then it is somehow not responding anymore. Some commands work, others not.
e.g. "docker stats" is hanging. "docker pa -a" works.

I searched through the various issues here that describe different hangs, but found nothing that solves my issue.

Restarting the docker daemon fixes the issue for a few days...

Steps to reproduce the issue:

We just use docker and are starting / stopping containers.

Describe the results you received:
Docker hangs for whatever reason.

Describe the results you expected:
Docker should work as expected.

Additional information you deem important (e.g. issue happens only occasionally):
In other threads there was the hint of doing a SIGUSR1-Signal to dockerd and posting the result. Also, I have done an strace on the stuck "docker stats" command.
It can be found here: https://git.jar.media/snippets/8

Output of docker version:

Docker version 18.09.7, build 2d0083d

Output of docker info:

Containers: 58
 Running: 58
 Paused: 0
 Stopped: 0
Images: 776
Server Version: 18.09.7
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-9-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 48
Total Memory: 125.8GiB
Name: h19
ID: OJOK:YAOX:OTVB:ALUM:SZUE:ZKJD:T26K:OPEF:XP6O:33GQ:2GRT:KC36
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
No additional info.

The text was updated successfully, but these errors were encountered:

thaJeztah · 2019-07-15T10:24:28Z

The strace likely won't have much details (as it would only strike the CLI, which is likely not where the issue is)

I attached the files here for easier finding;
goroutine-stacks-2019-07-15T073325_0200.log.zip
strace_of_docker_stats.txt

thaJeztah · 2019-07-15T10:29:26Z

@cpuguy83 @kolyshkin ptal

thaJeztah · 2019-07-15T10:32:38Z

@jarmediagmbh is there anything useful in the daemon logs? Could you describe a bit what kind of load / containers you're running? (long-lived / short-lived containers? do you use health-checks? do container produce a lot of logs? do you use bind-mounts?)

invokablegmbh · 2019-07-15T10:52:06Z

is there anything useful in the daemon logs?

No. Before opening the issue here we tried so many different things. Of course we also analyzed several log files (syslog, daemonlog, dmesg etc.). We did not find suspicious things, just the normal "a new network interface appeared/disappeared etc.".

Could you describe a bit what kind of load / containers you're running? (long-lived / short-lived containers?
We have about 60 Containers running. Each container will be deleted and recreated each morning (we have a very special usecase.. ;) ). The containers offer development environments for web development, the load should not be sooo huge.

do you use health-checks?
no.

do container produce a lot of logs?
no.

do you use bind-mounts?
Yes. There are several mounts between a container and the host system.

thaJeztah · 2019-07-15T10:54:49Z

The containers offer development environments for web development, the load should not be sooo huge.

All containers are running "interactively" (so a shell session attached to it?)

invokablegmbh · 2019-07-15T10:56:18Z

Wow. I have found a suspicous thing: Besides our dev-containers we are using zalenium (https://github.com/zalando/zalenium) on the same docker host. It automatically generated containers with google chrome running in it, so you can remotely and programmatically control the chrome-instances e.g. for test automation for websites.

After I did a "docker-compose down" on the zalenium directories (zalenium consists of a control-container and several chrome-instance-containers), docker stats did work again. However, I am sure, that the last time we had this issue, we removed all containers from the host and docker stats did not work aswell.

invokablegmbh · 2019-07-15T10:57:22Z

The containers offer development environments for web development, the load should not be sooo huge.

All containers are running "interactively" (so a shell session attached to it?)

No, we ssh into the conatiners to work on them. There is no "docker exec -it [CONTAINER] /bin/bash", if you ment that.

thaJeztah · 2019-07-15T10:59:57Z

No, we ssh into the conatiners to work on them. There is no "docker exec -it [CONTAINER] /bin/bash", if you ment that.

Thanks; yes that's what I meant (docker exec or docker run -it) 👍

invokablegmbh · 2019-07-15T11:06:54Z

In the past we had a lot of problems using docker exec / docker run, because the commands sometimes did not terminate. Thus, even our control-scripts use ssh into the containers and not "docker exec".

achimnol · 2021-05-06T03:38:46Z

Starting and stopping a few tens (~30) containers simultaneously and continuously with periodic docker stats API invocation, the stats API gets hang at a random time and often makes deadlocks for my container termination routines which calls the stats API to sync the final stats to an internal database.
For testing, each container run just a simple "sleep" command for a few seconds, so there is no computational overheads of containers, though we are controlling the containers via our own "job scheduler" implementation.
Maybe there is a race condition issue inside the stats API implementation when the container terminates.

FYI: I have an alternative stats implementation which reads and parses the sysfs directly, and using it instead of the docker stats API resolves the deadlock issue completely, under heavy-load tests. So I'm pretty sure that the problem resides inside the docker daemon somewhere....

zatlodan · 2024-04-23T10:08:09Z

We have the same issue, where docker stats suddenly stops responding.

Issue description

3 virtual instances all have the same issue (occurred at different times).
Running ~30 containers.
Starting and stopping a few times a day.
The containers might start on different instance each time.
Running without issues for at least a week.

Docker stats suddenly stops responding on all containers without an apparent cause.

This never resolves (has to be terminated):

docker stats --no-stream dd7e99646619

This will error with "timeout waiting for stats":

docker stats dd7e99646619

Docker info & version

All instances are the same:

docker-info.txt
docker-version.txt

Logs

These are the docker daemon logs from around the time the stats endpoint stopped responding.
These are actually the only logs that are there +- a few hours

Instance A

Issue occurred around 10:00 and 13:00.

journalctl-logs.txt

Instance B

Issue occurred around 22:00.

journalctl-logs.txt

Instance C

Issue occurred around 10:00.

journalctl-logs.txt

Metrics

I have all the Prometheus "node" metrics from the instances.
There is no blatantly apparent cause or correlation with the metric data.

I have observed:

Drop in Disk IOps when the issue arises on two of the 3 instances.
Some network interfaces start receiving traffic and some stop receiving traffic.
Drop in memory usage.

What have you tried

I have tried sending the stats request directly to the docker socket in various combinations like so:

curl --unix-socket /var/run/docker.sock http:/v1.24/containers/452bb2e0564d/stats?stream=0
curl --unix-socket /var/run/docker.sock http:/v1.44/containers/452bb2e0564d/stats?stream=0
curl --unix-socket /var/run/docker.sock http:/v1.44/containers/452bb2e0564d/stats?stream=1

This request never resolves.

What fixes the problem

Restarting the docker daemon fixes the issue.

Important

I currently have three distinct instances with the problematic docker at hand, have not yet restarted the daemon.
And I can keep it that way for a few days.
So if we have a quick response, I can debug this further directly on the problematic instances.

cpuguy83 · 2024-04-23T13:49:45Z

On the stuck instances, please grab a goroutine stack dump.

You can get this from the endpoint t "/debug/pprof/goroutine?debug=2"

Also please provide the output of "docker info" and "docker version", assuming those work (they shouldn't generally get wedged).

zatlodan · 2024-04-23T15:58:56Z

Thanks for the quick response, here it is.

Instance A:
goroutinedump.txt
info.txt
version.txt
journalctl-logs.txt

Instance B:
goroutinedump.txt
info.txt
version.txt
journalctl-logs.txt

Instance C:
goroutinedump.txt
info.txt
version.txt
journalctl-logs.txt

cpuguy83 · 2024-04-23T17:00:19Z

@zatlodan Thanks, I can see there are a couple of places this is getting wedged trying to close the (cloudwatch) log driver because the log buffer is full. Additionally the same thing is happening in the local log cache (used to allow docker logs to work with cloudwatch). Its trying to send the log to cloudwatch but is blocked (because of the above problem) and it is preventing the log driver from being closed.

This is causing the container state to remain locked indefinitely which is causing other issues, namely preventing an image removal from completing because some container is using the image and its trying to take the container lock to see if the container is running... I think this is also another bug, we shouldn't care if the container is running or not here... but there are likely historical reasons for why this is the way it is.
I'll try and take a look at this as well.

zatlodan · 2024-04-24T10:00:57Z

Thanks for the quick investigation 👍

I can see that some of the containers indeed can't send logs due to non existing cloud watch log groups.
This is of course caused by a negligence on our part 😄

But on the impacted instances there are only two containers from the total ~24 that have this issue currently,
yet the stats endpoint is unresponsive on all containers on that instance.
That is a really weird behavior that one or two containers with full log buffers can cause such an issue.

Although I can see that the problematic image (with a non-existing log group) has been deployed to each affected instance at the time when the issue arose, indicating a potential correlation

I will try to fix the logging issues and keep you updated.

cpuguy83 · 2024-05-02T22:42:27Z

@zatlodan In all your stack dumps I can see lots of stats requests not blocked except on just new stats data (normal) and then some that are blocked waiting for the container lock to be freed (due to log writing being blocked).

zatlodan · 2024-05-18T23:19:39Z

Just a quick update.

We have fixed the logging misconfiguration issues so that all our docker containers are correctly sending logs to CloudWatch.
We have then restarted the docker service on all affected instances.

There were no issues with hanging stats calls or missing metrics since.

So its quite probable that the problem with the hanging stats API was caused by the misconfigured logging and the related issue with the log driver buffers that @cpuguy83 has mentioned.

cpuguy83 · 2024-05-18T23:55:50Z

Thanks for the update!

thaJeztah added area/daemon version/18.09 labels Jul 15, 2019

cpuguy83 linked a pull request Apr 23, 2024 that will close this issue

awslogs: Prevent close from being blocked on log #47748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker stats hangs #39523

docker stats hangs #39523

invokablegmbh commented Jul 15, 2019 •

edited

thaJeztah commented Jul 15, 2019

thaJeztah commented Jul 15, 2019

thaJeztah commented Jul 15, 2019

invokablegmbh commented Jul 15, 2019

thaJeztah commented Jul 15, 2019

invokablegmbh commented Jul 15, 2019

invokablegmbh commented Jul 15, 2019

thaJeztah commented Jul 15, 2019

invokablegmbh commented Jul 15, 2019

achimnol commented May 6, 2021 •

edited

zatlodan commented Apr 23, 2024 •

edited

cpuguy83 commented Apr 23, 2024

zatlodan commented Apr 23, 2024

cpuguy83 commented Apr 23, 2024

zatlodan commented Apr 24, 2024 •

edited

cpuguy83 commented May 2, 2024

zatlodan commented May 18, 2024 •

edited

cpuguy83 commented May 18, 2024

docker stats hangs #39523

docker stats hangs #39523

Comments

invokablegmbh commented Jul 15, 2019 • edited

thaJeztah commented Jul 15, 2019

thaJeztah commented Jul 15, 2019

thaJeztah commented Jul 15, 2019

invokablegmbh commented Jul 15, 2019

thaJeztah commented Jul 15, 2019

invokablegmbh commented Jul 15, 2019

invokablegmbh commented Jul 15, 2019

thaJeztah commented Jul 15, 2019

invokablegmbh commented Jul 15, 2019

achimnol commented May 6, 2021 • edited

zatlodan commented Apr 23, 2024 • edited

Issue description

Docker info & version

Logs

Instance A

Instance B

Instance C

Metrics

What have you tried

What fixes the problem

Important

cpuguy83 commented Apr 23, 2024

zatlodan commented Apr 23, 2024

cpuguy83 commented Apr 23, 2024

zatlodan commented Apr 24, 2024 • edited

cpuguy83 commented May 2, 2024

zatlodan commented May 18, 2024 • edited

cpuguy83 commented May 18, 2024

invokablegmbh commented Jul 15, 2019 •

edited

achimnol commented May 6, 2021 •

edited

zatlodan commented Apr 23, 2024 •

edited

zatlodan commented Apr 24, 2024 •

edited

zatlodan commented May 18, 2024 •

edited