Skip to content

Commit 4f59d47

Browse files
committed
dockerfile: run buildkitd within a cgroup namespace for cgroup v2
Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to restrict its view of the unified cgroup hierarchy. This will ensure its `init` cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process. When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node. Example behavior without this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/buildkit/{runc-container-id} ``` Example behavior with this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id} ``` Note this was developed as an alternative approach to moby#6343 [kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace Signed-off-by: Dan Duvall <dduvall@wikimedia.org>
1 parent 975d8f0 commit 4f59d47

File tree

1 file changed

+36
-2
lines changed

1 file changed

+36
-2
lines changed

Dockerfile

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,7 @@ FROM scratch AS release
205205
COPY --link --from=releaser /out/ /
206206

207207
FROM alpine:${ALPINE_VERSION} AS buildkit-export-alpine
208-
RUN apk add --no-cache fuse3 git openssh openssl pigz xz iptables ip6tables \
208+
RUN apk add --no-cache fuse3 git openssh openssl pigz xz iptables ip6tables util-linux-misc \
209209
&& ln -s fusermount3 /usr/bin/fusermount
210210
COPY --link examples/buildctl-daemonless/buildctl-daemonless.sh /usr/bin/
211211
VOLUME /var/lib/buildkit
@@ -380,8 +380,42 @@ EOT
380380

381381
FROM buildkit-export AS buildkit-linux
382382
COPY --link --from=binaries / /usr/bin/
383+
COPY --link --chmod=755 <<EOF /usr/bin/buildkitd-entrypoint
384+
#!/bin/sh
385+
#
386+
# For cgroup v2, ensure buildkitd has a namespaced view of /sys/fs/cgroup by
387+
# running in a new cgroup and mount namespace and remounting /sys/fs/cgroup.
388+
# Assume we are already in our own cgroup ns if the current cgroup path is
389+
# "/".
390+
#
391+
# Note this is a workaround for the lack of cgroupns control in the Kubernetes
392+
# API. If KEP-5714 is adopted, this can eventually be removed.
393+
#
394+
# See https://github.com/kubernetes/enhancements/issues/5714
395+
396+
set -e
397+
398+
if [ -e /sys/fs/cgroup/cgroup.controllers ]; then
399+
if [ "\$(cut -d: -f3 /proc/self/cgroup)" != "/" ]; then
400+
echo creating cgroup namespace
401+
exec /usr/bin/unshare --cgroup --mount /usr/bin/with-cgroupfs-remount /usr/bin/buildkitd "\$@"
402+
fi
403+
fi
404+
405+
exec /usr/bin/buildkitd "\$@"
406+
EOF
407+
COPY --link --chmod=755 <<EOF /usr/bin/with-cgroupfs-remount
408+
#!/bin/sh
409+
set -e
410+
411+
options="\$(awk '\$2 == "/sys/fs/cgroup" { print \$4 }' /proc/self/mounts)"
412+
umount /sys/fs/cgroup
413+
mount -t cgroup2 -o "\$options" cgroup2 /sys/fs/cgroup
414+
415+
exec "\$@"
416+
EOF
383417
ENV BUILDKIT_SETUP_CGROUPV2_ROOT=1
384-
ENTRYPOINT ["buildkitd"]
418+
ENTRYPOINT ["/usr/bin/buildkitd-entrypoint"]
385419

386420
FROM buildkit-linux AS buildkit-linux-debug
387421
COPY --link --from=dlv /out/dlv /usr/bin/dlv

0 commit comments

Comments
 (0)