A story of Docker, QEMU, and memfd_create()

Last year I stumbled across a problem with the execution of a Docker container in a CI environment. The interesting case was that this is a container for a foreign architecture, which is supported by the --platform option and there are even official images on Docker Hub for this.

Initially, the problem presented itself like this:

$ docker run -it --rm --platform linux/arm64 [...] arm64v8/ubuntu:jammy
root@d6fb5c478cb6:/# ps
Error, do this: mount -t proc proc /proc

This means the ps(1) command could not run in this Docker container. At first I trusted the error message and thought that /proc might really not be mounted. However, that is usually taken care of by Docker and this following check confirmed that it is in fact mounted:

root@d6fb5c478cb6:/# mount | grep proc | head -n1
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)

Well, now what is actually the problem here with running a Docker container for a foreign architecture?

Docker containers for foreign architectures

In the docker run command above I left out an essential argument in the ellipsis as that might have given away something about the problem early. However, I wanted you to keep on guessing with me up to here. In reality there were a few more arguments, but the important one was -e TMPDIR=/path/does/not/exist. Of course, in the real CI environment that was one among many more arguments and it was purely accidental that the path given as TMPDIR did not actually exist inside the Docker container.

The first thing to note is that even if the path does not exist, everything works fine with the Docker container for the native host architecture amd64.

$ docker run -it --rm -e TMPDIR=/path/does/not/exist ubuntu:jammy
root@d2a623788b35:/# uname -m
x86_64
root@d2a623788b35:/# ps
    PID TTY          TIME CMD
      1 pts/0    00:00:00 bash
     10 pts/0    00:00:00 ps
$ docker run -it --rm --platform linux/arm64 -e TMPDIR=/path/does/not/exist arm64v8/ubuntu:jammy
root@df9268bce934:/# uname -m
aarch64
root@df9268bce934:/# ps
Error, do this: mount -t proc proc /proc

The important part is that a Docker container such as arm64v8/ubuntu:jammy for a foreign architecture with --platform linux/arm64 will use QEMU with userspace emulation to execute the ELF binaries for arm64 inside the container.

QEMU user space emulation

In the QEMU user space emulation mode, QEMU will run as an interpreter in user space to decode and emulate the instructions of the foreign machine code. Additionally, the system call interface of the Linux kernel depends on the architecture. Therefore when the program wants to use a system call, that will also be translated from the foreign architecture to the native host architecture.

Besides system calls, another interface the Linux kernel offers to user space is the /proc hierarchy. The procfs special filesystem offers information about the system and its current state as well as information about the current processes. For example, that filesystem is used by ps(1) to get the list of processes and the corresponding resources.

However, the information the kernel offers in these files differs between the architectures. The most obvious example is the data in the /proc/cpuinfo file, as some programs might check for the processor model and other details. Other cases are files like /proc/self/cmdline or /proc/self/maps, which allow to investigate the command line or the memory mappings of the current process.

For these, QEMU wants to hide the fact that the program is being emulated. Therefore, as the open system call already has to be emulated to translate between the foreign and host architecture, QEMU will also take a look at the path that is being opened with the open(2) system call family and check for these special cases below /proc. Instead of opening the real file descriptor that would be handled by the Linux kernel, QEMU creates a temporary file, fills it with data as it should be read by the emulated user space program, and returns the file descriptor to this temporary file.

Misleading error messages

Now this is the point where the argument -e TMPDIR=/path/does/not/exist finally comes into play. In order to create the temporary file, QEMU generates a unique filename in the directory specified by the environment variable TMPDIR. Now you can already guess where the problem comes from. When QEMU cannot open the temporary file, it will return an error as the result of the system call.

This is why ps thinks that /proc was not mounted, because it was actually not able to open("/proc/self/stat“, …). The error message is not that specific, but of course this condition should not happen under normal circumstances.

$ docker run -it --rm --platform linux/arm64 -e TMPDIR=/path/does/not/exist arm64v8/ubuntu:jammy
root@d0f81bdca417:/# cat /proc/self/stat
cat: /proc/self/stat: No such file or directory

Another way to trigger this problem is with the docker run --read-only argument, as then QEMU will try to create a temporary file at /tmp in the Docker container’s root filesystem, which will not be writable:

$ docker run -it --rm --platform linux/arm64 --read-only arm64v8/ubuntu:jammy
root@746c79d77a70:/# cat /proc/self/stat
cat: /proc/self/stat: Read-only file system

The result is an even more confusing error message, because from the perspective of the user space program this was not even an attempt to write to the filesystem. And for the latter situation I also found another report of the problem.

memfd_create() for temporary files

A proper solution for this problem would be that QEMU should not need to create a temporary file only to obtain a file descriptor that can be returned to the calling program. As that is a common problem, the Linux kernel eventually gained the memfd_create(2) system call. This can be used as an alternative to opening a temporary file with a real filename.

I submitted a patch to QEMU to use memfd_create(2) instead of opening a temporary file, if the system call is available (Linux kernel >= 3.17). After a bit of review feedback and another patch revision, the PATCH v2 was accepted. The new behavior was shipped with QEMU 7.1.0.

commit 5b63de6b54add51822db3c89325c6fc05534a54c
Author: Rainer Müller <raimue@codingfarm.de>
Date:   Fri Jul 29 17:49:51 2022 +0200
 
    linux-user: Use memfd for open syscall emulation
 
    For certain paths in /proc, the open syscall is intercepted and the
    returned file descriptor points to a temporary file with emulated
    contents.
 
    If TMPDIR is not accessible or writable for the current user (for
    example in a read-only mounted chroot or container) tools such as ps
    from procps may fail unexpectedly. Trying to read one of these paths
    such as /proc/self/stat would return an error such as ENOENT or EROFS.
 
    To relax the requirement on a writable TMPDIR, use memfd_create()
    instead to create an anonymous file and return its file descriptor.
 
    Signed-off-by: Rainer Müller <raimue@codingfarm.de>
    Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
    Message-Id: <20220729154951.76268-1-raimue@codingfarm.de>
    Signed-off-by: Laurent Vivier <laurent@vivier.eu>
 
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index b27a6552aa..ef53feb5ab 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -8260,16 +8260,22 @@ static int do_openat(CPUArchState *cpu_env, int dirfd, const char *pathname, int
         char filename[PATH_MAX];
         int fd, r;
 
-        /* create temporary file to map stat to */
-        tmpdir = getenv("TMPDIR");
-        if (!tmpdir)
-            tmpdir = "/tmp";
-        snprintf(filename, sizeof(filename), "%s/qemu-open.XXXXXX", tmpdir);
-        fd = mkstemp(filename);
+        fd = memfd_create("qemu-open", 0);
         if (fd < 0) {
-            return fd;
+            if (errno != ENOSYS) {
+                return fd;
+            }
+            /* create temporary file to map stat to */
+            tmpdir = getenv("TMPDIR");
+            if (!tmpdir)
+                tmpdir = "/tmp";
+            snprintf(filename, sizeof(filename), "%s/qemu-open.XXXXXX", tmpdir);
+            fd = mkstemp(filename);
+            if (fd < 0) {
+                return fd;
+            }
+            unlink(filename);
         }
-        unlink(filename);
 
         if ((r = fake_open->fill(cpu_env, fd))) {
             int e = errno;

Upstream: https://gitlab.com/qemu-project/qemu/-/commit/5b63de6b54add51822db3c89325c6fc05534a54c

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.