Skip to content

Commit

Permalink
doc: Add design document for the gVisor integration
Browse files Browse the repository at this point in the history
Add a design document for the gVisor integration, which is currently
under review. The associated pull request has lots of architectural
discussions about integrating gVisor, so in this document we collect
them all in one place.

Refs #590
  • Loading branch information
apyrgio committed Jun 12, 2024
1 parent 5b00f56 commit 277b167
Showing 1 changed file with 284 additions and 0 deletions.
284 changes: 284 additions & 0 deletions docs/developer/gvisor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
# gVisor integration

Dangerzone has relied on the container runtime available in each supported
operating system (Docker Desktop on Windows / macOS, Podman on Linux) to isolate
the host from the sanitization process. The problem with this type of isolation
is that it exposes a rather large attack surface; the Linux kernel.

[gVisor](https://gvisor.dev/) is an application kernel, that emulates a
substantial portion of the Linux Kernel API in Go. What's more interesting to
Dangerzone is that it also offers an OCI runtime (`runsc`) that enables
containers to transparently run this application kernel.

As of writing this, Dangerzone uses two containers to sanitize a document:
* The first container reads a document from stdin, converts each page to pixels,
and writes them to stdout.
* The second container reads the pixels from a mounted volume (the host has
taken care of this), and saves the final PDF to another mounted volume.

Our threat model considers the computation and output of the first container
as **untrusted**, and the computation and output of the second container as
trusted. For this reason, and because we are about to remove the need for the
second container, our integration plan will focus on the first container.

## Design overview

Our integration goals are to:
* Make gVisor available to all of our supported platforms.
* Do not ask from users to run any commands on their system to do so.

Because gVisor does not support Windows and macOS systems out of the box,
Dangerzone will be responsible for "shipping" gVisor to those users. It will do
so using nested containers:
* The **outer** container is the Docker/Podman container that Dangerzone uses
already. This container acts as our **portability** layer. It's main purpose
is to bundle all the necessary configuration files and program to run gVisor
in all of our platforms.
* The **inner** container is the gVisor container, created with `runsc`. This
container acts as our **isolation layer**. It is responsible for running the
Python code that rasterizes a document, in a way that will be fully isolated
from the host.

### Building the container image

This nested container approach directly affects the container image as well,
which will also have two layers:
* The **outer** container image will contain just Python3 and `runsc`, the
latter downloaded from the official gVisor website. It will also contain an
entrypoint that will launch `runsc`. Finally, it will contain the **inner**
container image (see below) as filesystem clone under
`/dangerzone-image/rootfs`.
* The **inner** container image is practically the original Dangerzone image, as
we've always built it, which contains the necessary tooling to rasterize a
document.

### Spawning the container

Spawning the container now becomes a multi-stage process:

The `Container` isolation provider spawns the container as before, with the
following changes:

* It adds two Linux capabilities to the **outer** container that didn't exist
before: `SETFCAP` and `SYS_CHROOT`. Those capabilities are necessary to run
`runsc` rootless, and are not inherited by the **inner** container.
* It removes the `--userns keep-id` argument, which mapped the user outside the
container to the same UID (normally `1000`) within the container. This was
originally required when we were mounting host directories within the
container, but this no longer applies to the gVisor integration. By removing
this flag, the host user maps to the root user within the container (UID `0`).
- In distributions that offer Podman version 4 or greater, we use the
`--userns nomap` flag. This flag greatly minimizes the attack surface,
since the host user is not mapped within the container at all.
* In distributions that offer Podman 3.x, we add a seccomp filter that adds the
`ptrace` syscall, which is required for running gVisor.

Then, the following happens when Podman/Docker spawns the container:

1. _(outer container)_ The entrypoint code finds from `sys.argv` the command
that Dangerzone passed to the `docker run` / `podman run` invocation.
Typically, this command is:

```
/usr/bin/python3 -m dangerzone.conversion.doc_to_pixels
```

2. _(outer container)_ The entrypoint code then creates an OCI config for
`runsc` with the following properties:
* Use UID/GID 1000 in the **inner** container image.
* Run the command we detected on step 1.
* Drop all Linux capabilities.
* Limit the number of open files to 4096.
* Use the `/dangerzone-image/rootfs` directory as the root path for the
**inner** container.
* Mount a gVisor view of the `procfs` hierarchy under `/proc` , and then
mount `tmpfs` in the `/dev`, `/sys` and `/tmp` mount points. This way, no
host-specific info may leak to the **inner** container.
- Mount `tmpfs` on some more mountpoints where we want write access.
3. _(outer container)_ If `RUNSC_DEBUG` has been specified, add some debug
arguments to `runsc` (applies to development environments only).
4. _(outer container)_ If `RUNSC_FLAGS` has been specified, pass some
user-specified flags to `runsc` (applies to development environments only).
5. _(outer container)_ Spawn `runsc` as a Python subprocess, and wait for it to
complete.
6. _(inner container)_ Read the document from stdin and write pixels to stdout.
- In practice, nothing changes here, as far as the document conversion is
concerned. The Python process transparently uses the emulated Linux Kernel
API that gVisor provides.
7. _(outer container)_ Exit the container with the same exit code as the inner
container.

## Implementation details

### Creating the outer container image

In order to achieve the above, we add one more build stage in our Dockerfile
(see [multi-stage builds](https://docs.docker.com/build/building/multi-stage/))
that copies the result of the previous stages under `/dangerzone-image/rootfs`.
Also, we install `runsc` and Python, and copy our entrypoint to that layer.

Here's how it looks like:

```dockerfile
# NOTE: The following lines are appended to the end of our original Dockerfile.

# Install some commands required by the entrypoint.
FROM alpine:latest
RUN apk --no-cache -U upgrade && \
apk --no-cache add \
python3 \
su-exec

# Add the previous build stage (`dangerzone-image`) as a filesystem clone under
# the /dangerzone-image/rootfs directory.
RUN mkdir --mode=0755 -p /dangerzone-image/rootfs
COPY --from=dangerzone-image / /dangerzone-image/rootfs

# Download and install gVisor, based on the official instructions.
RUN GVISOR_URL="https://storage.googleapis.com/gvisor/releases/release/latest/$(uname -m)"; \
wget "${GVISOR_URL}/runsc" "${GVISOR_URL}/runsc.sha512" && \
sha512sum -c runsc.sha512 && \
rm -f runsc.sha512 && \
chmod 555 runsc /entrypoint.py && \
mv runsc /usr/bin/

COPY gvisor_wrapper/entrypoint.py /
ENTRYPOINT ["/entrypoint.py"]
```

### OCI config

The OCI config that gets produced is similar to this:

```json
{
"ociVersion": "1.0.0",
"process": {
"user": {
"uid": 1000,
"gid": 1000
},
"args": [
"/usr/bin/python3",
"-m",
"dangerzone.conversion.doc_to_pixels"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"PYTHONPATH=/opt/dangerzone",
"TERM=xterm"
],
"cwd": "/",
"capabilities": {
"bounding": [],
"effective": [],
"inheritable": [],
"permitted": [],
},
"rlimits": [
{
"type": "RLIMIT_NOFILE",
"hard": 4096,
"soft": 4096
}
]
},
"root": {
"path": "rootfs",
"readonly": true
},
"hostname": "dangerzone",
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"noexec",
"nodev",
"ro"
]
},
{
"destination": "/tmp",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/home/dangerzone",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/usr/lib/libreoffice/share/extensions/",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"noexec",
"nodev"
]
}
],
"linux": {
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
}
]
}
}

```

## Security considerations

* gVisor does not have an official release on Alpine Linux. The developers
provide gVisor binaries from a GCS bucket. In order to verify the integrity of
these binaries, they also provide a SHA-512 hash of the files.
- If we choose to pin the hash, then we essentially pin gVisor, and we may
lose security updates.

## Alternatives

gVisor can be integrated with Podman/Docker, but this is the case only on Linux.
Because we want gVisor on Windows and macOS as well, we decided to not move
forward with this approach.

0 comments on commit 277b167

Please sign in to comment.