From 277b1675cac5aa87379ace28f598b32caa1144a1 Mon Sep 17 00:00:00 2001
From: Alex Pyrgiotis <alex.p@freedom.press>
Date: Wed, 22 May 2024 14:20:04 +0300
Subject: [PATCH] doc: Add design document for the gVisor integration

Add a design document for the gVisor integration, which is currently
under review. The associated pull request has lots of architectural
discussions about integrating gVisor, so in this document we collect
them all in one place.

Refs #590
---
 docs/developer/gvisor.md | 284 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 284 insertions(+)
 create mode 100644 docs/developer/gvisor.md

diff --git a/docs/developer/gvisor.md b/docs/developer/gvisor.md
new file mode 100644
index 000000000..3f41dda2c
--- /dev/null
+++ b/docs/developer/gvisor.md
@@ -0,0 +1,284 @@
+# gVisor integration
+
+Dangerzone has relied on the container runtime available in each supported
+operating system (Docker Desktop on Windows / macOS, Podman on Linux) to isolate
+the host from the sanitization process. The problem with this type of isolation
+is that it exposes a rather large attack surface; the Linux kernel.
+
+[gVisor](https://gvisor.dev/) is an application kernel, that emulates a
+substantial portion of the Linux Kernel API in Go. What's more interesting to
+Dangerzone is that it also offers an OCI runtime (`runsc`) that enables
+containers to transparently run this application kernel.
+
+As of writing this, Dangerzone uses two containers to sanitize a document:
+* The first container reads a document from stdin, converts each page to pixels,
+  and writes them to stdout.
+* The second container reads the pixels from a mounted volume (the host has
+  taken care of this), and saves the final PDF to another mounted volume.
+
+Our threat model considers the computation and output of the first container
+as **untrusted**, and the computation and output of the second container as
+trusted. For this reason, and because we are about to remove the need for the
+second container, our integration plan will focus on the first container.
+
+## Design overview
+
+Our integration goals are to:
+* Make gVisor available to all of our supported platforms.
+* Do not ask from users to run any commands on their system to do so.
+
+Because gVisor does not support Windows and macOS systems out of the box,
+Dangerzone will be responsible for "shipping" gVisor to those users. It will do
+so using nested containers:
+* The **outer** container is the Docker/Podman container that Dangerzone uses
+  already. This container acts as our **portability** layer. It's main purpose
+  is to bundle all the necessary configuration files and program to run gVisor
+  in all of our platforms.
+* The **inner** container is the gVisor container, created with `runsc`. This
+  container acts as our **isolation layer**. It is responsible for running the
+  Python code that rasterizes a document, in a way that will be fully isolated
+  from the host.
+
+### Building the container image
+
+This nested container approach directly affects the container image as well,
+which will also have two layers:
+* The **outer** container image will contain just Python3 and `runsc`, the
+  latter downloaded from the official gVisor website. It will also contain an
+  entrypoint that will launch `runsc`. Finally, it will contain the **inner**
+  container image (see below) as filesystem clone under
+  `/dangerzone-image/rootfs`.
+* The **inner** container image is practically the original Dangerzone image, as
+  we've always built it, which contains the necessary tooling to rasterize a
+  document.
+
+### Spawning the container
+
+Spawning the container now becomes a multi-stage process:
+
+The `Container` isolation provider spawns the container as before, with the
+following changes:
+
+* It adds two Linux capabilities to the **outer** container that didn't exist
+  before: `SETFCAP` and `SYS_CHROOT`. Those capabilities are necessary to run
+  `runsc` rootless, and are not inherited by the **inner** container.
+* It removes the `--userns keep-id` argument, which mapped the user outside the
+  container to the same UID (normally `1000`) within the container. This was
+  originally required when we were mounting host directories within the
+  container, but this no longer applies to the gVisor integration. By removing
+  this flag, the host user maps to the root user within the container (UID `0`).
+  - In distributions that offer Podman version 4 or greater, we use the
+    `--userns nomap` flag. This flag greatly minimizes the attack surface,
+    since the host user is not mapped within the container at all.
+* In distributions that offer Podman 3.x, we add a seccomp filter that adds the
+  `ptrace` syscall, which is required for running gVisor.
+
+Then, the following happens when Podman/Docker spawns the container:
+
+1. _(outer container)_ The entrypoint code finds from `sys.argv` the command
+   that Dangerzone passed to the `docker run` / `podman run` invocation.
+   Typically, this command is:
+
+   ```
+   /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels
+   ```
+
+2. _(outer container)_ The entrypoint code then creates an OCI config for
+   `runsc` with the following properties:
+   * Use UID/GID 1000 in the **inner** container image.
+   * Run the command we detected on step 1.
+   * Drop all Linux capabilities.
+   * Limit the number of open files to 4096.
+   * Use the `/dangerzone-image/rootfs` directory as the root path for the
+     **inner** container.
+   * Mount a gVisor view of the `procfs` hierarchy under `/proc` , and then
+     mount `tmpfs` in the `/dev`, `/sys` and `/tmp` mount points. This way, no
+     host-specific info may leak to the **inner** container.
+     - Mount `tmpfs` on some more mountpoints where we want write access.
+3. _(outer container)_ If `RUNSC_DEBUG` has been specified, add some debug
+   arguments to `runsc` (applies to development environments only).
+4. _(outer container)_ If `RUNSC_FLAGS` has been specified, pass some
+   user-specified flags to `runsc` (applies to development environments only).
+5. _(outer container)_ Spawn `runsc` as a Python subprocess, and wait for it to
+   complete.
+6. _(inner container)_ Read the document from stdin and write pixels to stdout.
+   - In practice, nothing changes here, as far as the document conversion is
+     concerned. The Python process transparently uses the emulated Linux Kernel
+     API that gVisor provides.
+7. _(outer container)_ Exit the container with the same exit code as the inner
+   container.
+
+## Implementation details
+
+### Creating the outer container image
+
+In order to achieve the above, we add one more build stage in our Dockerfile
+(see [multi-stage builds](https://docs.docker.com/build/building/multi-stage/))
+that copies the result of the previous stages under `/dangerzone-image/rootfs`.
+Also, we install `runsc` and Python, and copy our entrypoint to that layer.
+
+Here's how it looks like:
+
+```dockerfile
+# NOTE: The following lines are appended to the end of our original Dockerfile.
+
+# Install some commands required by the entrypoint.
+FROM alpine:latest
+RUN apk --no-cache -U upgrade && \
+    apk --no-cache add \
+    python3 \
+    su-exec
+
+# Add the previous build stage (`dangerzone-image`) as a filesystem clone under
+# the /dangerzone-image/rootfs directory.
+RUN mkdir --mode=0755 -p /dangerzone-image/rootfs
+COPY --from=dangerzone-image / /dangerzone-image/rootfs
+
+# Download and install gVisor, based on the official instructions.
+RUN GVISOR_URL="https://storage.googleapis.com/gvisor/releases/release/latest/$(uname -m)"; \
+    wget "${GVISOR_URL}/runsc" "${GVISOR_URL}/runsc.sha512" && \
+    sha512sum -c runsc.sha512 && \
+    rm -f runsc.sha512 && \
+    chmod 555 runsc /entrypoint.py && \
+    mv runsc /usr/bin/
+
+COPY gvisor_wrapper/entrypoint.py /
+ENTRYPOINT ["/entrypoint.py"]
+```
+
+### OCI config
+
+The OCI config that gets produced is similar to this:
+
+```json
+{
+    "ociVersion": "1.0.0",
+    "process": {
+        "user": {
+            "uid": 1000,
+            "gid": 1000
+        },
+        "args": [
+            "/usr/bin/python3",
+            "-m",
+            "dangerzone.conversion.doc_to_pixels"
+        ],
+        "env": [
+            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
+            "PYTHONPATH=/opt/dangerzone",
+            "TERM=xterm"
+        ],
+        "cwd": "/",
+        "capabilities": {
+            "bounding": [],
+            "effective": [],
+            "inheritable": [],
+            "permitted": [],
+        },
+        "rlimits": [
+            {
+                "type": "RLIMIT_NOFILE",
+                "hard": 4096,
+                "soft": 4096
+            }
+        ]
+    },
+    "root": {
+        "path": "rootfs",
+        "readonly": true
+    },
+    "hostname": "dangerzone",
+    "mounts": [
+        {
+            "destination": "/proc",
+            "type": "proc",
+            "source": "proc"
+        },
+        {
+            "destination": "/dev",
+            "type": "tmpfs",
+            "source": "tmpfs",
+            "options": [
+                "nosuid",
+                "noexec",
+                "nodev"
+            ]
+        },
+        {
+            "destination": "/sys",
+            "type": "tmpfs",
+            "source": "tmpfs",
+            "options": [
+                "nosuid",
+                "noexec",
+                "nodev",
+                "ro"
+            ]
+        },
+        {
+            "destination": "/tmp",
+            "type": "tmpfs",
+            "source": "tmpfs",
+            "options": [
+                "nosuid",
+                "noexec",
+                "nodev"
+            ]
+        },
+        {
+            "destination": "/home/dangerzone",
+            "type": "tmpfs",
+            "source": "tmpfs",
+            "options": [
+                "nosuid",
+                "noexec",
+                "nodev"
+            ]
+        },
+        {
+            "destination": "/usr/lib/libreoffice/share/extensions/",
+            "type": "tmpfs",
+            "source": "tmpfs",
+            "options": [
+                "nosuid",
+                "noexec",
+                "nodev"
+            ]
+        }
+    ],
+    "linux": {
+        "namespaces": [
+            {
+                "type": "pid"
+            },
+            {
+                "type": "network"
+            },
+            {
+                "type": "ipc"
+            },
+            {
+                "type": "uts"
+            },
+            {
+                "type": "mount"
+            }
+        ]
+    }
+}
+
+```
+
+## Security considerations
+
+* gVisor does not have an official release on Alpine Linux. The developers
+  provide gVisor binaries from a GCS bucket. In order to verify the integrity of
+  these binaries, they also provide a SHA-512 hash of the files.
+  - If we choose to pin the hash, then we essentially pin gVisor, and we may
+    lose security updates.
+
+## Alternatives
+
+gVisor can be integrated with Podman/Docker, but this is the case only on Linux.
+Because we want gVisor on Windows and macOS as well, we decided to not move
+forward with this approach.