Skip to content

Run Headscale on Fly.io with Litestream replication to the integrated Tigris object storage

License

Notifications You must be signed in to change notification settings

NiklasRosenstein/headscale-fly-io

Repository files navigation

Headscale on Fly.io

This repository builds a Docker image that can be run as an app on Fly.io to create an easy, robust and affordable deployment of Headscale (an open source implementation of the Tailscale control plane, allowing you to create your self-hosted virtual private network using Tailscale clients). It uses Litestream to replicate and restore the SQlite database from an S3 bucket (such as Tigris bucket integrated with your Fly.io app).

The default configuration is to use the cheapested VM size available, shared-cpu-1x. This sizing should be sufficient to support tens if not up to 100 nodes in your VPN while costing you approx. 2 USD/mo (depending on the region). Tigris object storage has a free allowance of 5GB/mo, which you will likely not exceed. (By default we run Litestream with a longer sync interval to not exceed the free Tigris API request limit all too easily).

Note that, because Tailscale connected devices report back to the control plane on a regular, short interval, you won't be able to benefit from Fly.io technically being able to automatically scale your application down to 0, unless you have no nodes connected.

Contents

Prerequisites

Installation

Copy fly.example.toml to a fly.toml file and modify it. The minimum change you need to make is to update the app field. Unless you configure a custom domain, this will define the name of your Headscale server (i.e. https://<app>.fly.dev).

You then need to create the app, create object storage and initialize secret values that Headscale requires to run. These steps can be performed with the following commands. Note that the storage name can be anything, but if you don't have a better name, just give it the same name as the app.

$ fly apps create <app>
$ fly storage create -a <app> -n <name>
$ age-keygen -o age.privkey
$ fly secrets set NOISE_PRIVATE_KEY="privkey:$(openssl rand -hex 32)" AGE_SECRET_KEY="$(tail -n1 age.privkey)"

All that's left now is to deploy the application. After initial deployment, you should scale the application down to one, (or pass --ha=false to the deploy command), as the initial deploy will default to set the machine count to two. Despite the SQlite database being replicated, it does not support multiple users that independently write data to the same database.

$ fly deploy
$ fly scale count 1

You could run the SQlite database with something like LiteFS to achieve a highly available installation of Headscale, but that is not currently supported in this project.

Usage

On a device, run

$ tailscale up --login-server https://<app>.fly.dev

Following the link that will be displayed in the console will give you the headscale command to run to register the device. You may need to create a user first with the headscale user create command. If you have not configured OIDC, you need to use the Headscale CLI to register the node in the control plane.

For this you can either shell into your Headscale deployment via fly ssh console and use the headscale command there, or use the Headscale CLI locally to remotely control it. For this, you must have first generated an API key by connecting via SSH and running headscale apikeys create.

Then, locally, make sure you have the same version of the Headscale CLI installed that is running on your Fly.io app and follow as documented. We use the same typical gRPC port (50443).

$ export HEADSCALE_CLI_ADDRESS=${FLY_APP_NAME}.fly.dev:50443
$ export HEADSCALE_CLI_API_KEY=...
$ headscale node list

Updates

You should use an immutable tag in your fly.toml configuration file's [build.image] parameter. Using a mutable tag, such as :main (pointing to the latest version of the main branch of this repository), does not guarantee that your deployment comes up with the latest image version as a prior version may be cached.

Simply run fly deploy after updating the [build.image]. Note that there will be a brief downtime unless you configured a highly available deployment. Be sure to check the release notes to see if there are any breaking changes that require an update to your apps configuration!

Advanced configuration and usage

ACLs

We configure Headscale to store the ACL in the database instead of from file, this allows updating the ACLs without a fly deploy on every update. Follow the above steps to remote-control the Headscale server and then use the headscale policy get and headscale policy set commands.

Configuring OIDC

To enable OIDC, you must at the minimum provide the following environment variables:

  • HEADSCALE_OIDC_ISSUER
  • HEADSCALE_OIDC_CLIENT_ID
  • HEADSCALE_OIDC_CLIENT_SECRET

Please make sure that you pass the client secret using fly secrets set instead of via the [[env]] section of your fly.toml configuration file.

Using a custom domain

  1. Create a CNAME entry for your Fly.io application
  2. Run fly certs add <custom_domain>
  3. Set the HEADSCALE_DOMAIN_NAME=<custom_domain> in the fly.toml's [env] section and re-deploy

See also the related documentation on Fly.io: Custom domains.

Metrics

Metrics are automatically available through Fly.io's built-in managed Prometheus metrics collection and Grafana dashboard. Simply click on "Metrics" in your Fly.io account and explore headscale_* metrics.

Environment variables

Many Headscale configuration options can be set vie the [env] section in your fly.toml configuration file. The following is a complete list of the environment variables the Headscale-on-Fly.io recognizes, including those that are expected to be set automatically.

System variables

Variable Default Description
AWS_ACCESS_KEY_ID (automatic) Access key for the object storage for Litestream SQlite replication. Usually set automatically by Fly.io when enabling the Tigris integration.
AWS_SECRET_ACCESS_KEY (automatic) Secret key for the object storage.
AWS_REGION (automatic)
AWS_ENDPOINT_URL_S3 (automatic)
BUCKET_NAME (automatic)
FLY_APP_NAME (automatic) Used to determine the Headscale server URL, if HEADSCALE_DOMAIN_NAME is not set.

Security variables

Variable Default Description
AGE_SECRET_KEY n/a, but required [age] Secret key for encryption your Litestream SQLite replication.
NOISE_PRIVATE_KEY n/a, but required Noise private key for Headscale. Generate with echo privkey:$(openssl rand -hex 32). Important: Pass this value securely with fly secrets set.

Headscale configuration variables

Variable Default Description
HEADSCALE_DOMAIN_NAME ${FLY_APP_NAME}.fly.dev URL of the Headscale server.
HEADSCALE_DNS_BASE_DOMAIN tailnet Base domain for members in the Tailnet. This must not be a part of the HEADSCALE_DOMAIN_NAME.
HEADSCALE_DNS_MAGIC_DNS true Whether to use MagicDNS.
HEADSCALE_DNS_NAMESERVERS_GLOBAL 1.1.1.1, 1.0.0.1, 2606:4700:4700::1111, 2606:4700:4700::1001 A comma-separated list of global DNS servers to use. Defaults to Cloudflare DNS servers. To use NextDNS, supply the URL like https://dns.nextdns.io/abc123.
HEADSCALE_DNS_SEARCH_DOMAINS (empty) A comma-separated list of search domains. Note that with MagicDNS enabled, tour tailnet base domain is always the first search domain.
HEADSCALE_LOG_LEVEL info Log level for the Headscale server.
HEADSCALE_PREFIXES_V4 100.64.0.0/10 Prefix for IP-v4 addresses of nodes in the Tailnet.
HEADSCALE_PREFIXES_V6 fd7a:115c:a1e0::/48 Prefix for IP-v6 addresses of nodes in the Tailnet.
HEADSCALE_PREFIXES_ALLOCATION random How IPs are allocated to nodes joining the Tailnet. Can be random or sequential.
HEADSCALE_EPHEMERAL_NODE_INACTIVITY_TIMEOUT 30m The time after which an inactive ephemeral node is deleted from the control plane.
HEADSCALE_OIDC_ISSUER n/a If set, enables OIDC configuration. Must be set to the URL of the OIDC issuer. For example, if you use Keycloak, it might look something like https://mykeycloak.com/realms/main
HEADSCALE_OIDC_CLIENT_ID n/a, but required if oidc is enabled The OIDC client ID.
HEADSCALE_OIDC_CLIENT_SECRET n/a, but required if oidc is enabled The OIDC client secret. Important: Configure this through fly secrets set.
HEADSCALE_OIDC_SCOPES openid, profile, email A comma-separated list of OpenID scopes. (The comma-separated list must be valid YAML if placed inside [ ... ].)
HEADSCALE_OIDC_ALLOWED_GROUPS n/a A comma-separated list of groups to permit. Note that this requires your OIDC client to be configured with a groups claim mapping. In some cases you may need to prefix the group name with a slash (e.g. /headscale). (The comma-separated list must be valid YAML if placed inside [ ... ].)
HEADSCALE_OIDC_ALLOWED_DOMAINS n/a A comma-separated list of email domains to permit. (The comma-separated list must be valid YAML if placed inside [ ... ].)
HEADSCALE_OIDC_ALLOWED_USERS n/a A comma-separated list of users to permit. (The comma-separated list must be valid YAML if placed inside [ ... ].)
HEADSCALE_OIDC_STRIP_EMAIL_DOMAIN true Whether to strip the email domain for the Headscale user names.
HEADSCALE_OIDC_EXPIRY 180d The amount of time from a node is authenticated with OpenID until it expires and needs to reauthenticate. Setting the value to "0" will mean no expiry.
HEADSCALE_OIDC_USE_EXPIRY_FROM_TOKEN false Use the expiry from the token received from OpenID when the user logged in, this will typically lead to frequent need to reauthenticate and should only been enabled if you know what you are doing. If enabled, HEADSCALE_OIDC_EXPIRY is ignored.
HEADSCALE_OIDC_ONLY_START_IF_OIDC_IS_AVAILABLE true Fail startup if the OIDC server cannot be reached.

Litestream configuration variables

Variable Default Description
LITESTREAM_ENABLED true Whether to restore and replicate the SQlite database with Litestream. You likely never want to turn this option off, as you will loose your SQlite database on restarts.
LITESTREAM_RETENTION 24h Configure the Litestream retention period. Retention is enforced periodically and can be changed with LITESTREAM_RETENTION_CHECK_INTERVAL.
LITESTREAM_RETENTION_CHECK_INTERVAL 1h The interval at which retention should be applied.
LITESTREAM_VALIDATION_INTERVAL 12h The interval at which Litestream does a separate restore of the database and validates the result vs. the current database.
LITESTREAM_SYNC_INTERVAL 10s Frequency in which frames are pushed to the replica. Note that Litestream's typical default is 1s, and increasing this frequency can increase storage costs due to higher API request counts.

Maintenance variables

Variable Default Description
ENTRYPOINT_DEBUG n/a If set to true, enables logging of executed commands in the container entrypoint and prints out the Headscale configuration before startup. Use with caution, as it might reveal secret values to stdout (and thus into Fly.io's logging infrastructure).
ENTRYPOINT_IDLE false If set to true, go idle instead of starting the Headscale server. Will also go idle if an intermediate error occurs. Useful for recovering secrets when the deployment critically fails. Note that after a short time, Fly will turn off the machine since its health check won't be coming online.
IMPORT_DATABASE false If set to true, the entrypoint will check for an import-db.sqlite file in the S3 bucket to restore, and use that instead of litestream restore if it exists. Note that the file will not be removed, so you should disable this option and remove the file from the bucket once the import is complete.

Migrating to Headscale on Fly.io

To migrate your existing Headscale instance that uses SQlite to Fly.io, you must upload the database to the S3 bucket under a file named import-db.sqlite and temporarily set the IMPORT_DATABASE=true environment variable. This will instruct the application to load this database file instead of attempting a Litestream restore on startup. Once done and Litestream has finished replicating this database state to S3, you must remove the IMPORT_DATABASE environment variable and re-deploy your application, and you should also consider removing the import-db.sqlite file from the S3 bucket again.

You should also make sure that you set the NOISE_PRIVATE_KEY secret variable to the contents of your original Headscale instance's noise private key.

Migrating from Postgres

Warning: These steps have been tested on Headscale 0.23.0 only.

If your current Headscale deployment is using a Postgres database, you must convert it to an SQlite database before you can migrate your instance to Headscale on Fly.io. You can leverage script provided by bigbozza/headscalebacktosqlite for this, and it is more conveniently made available in this repository in ./headscale-back-to-sqlite.

First, you need to grab an empty SQlite database that was initialized by Headscale (so all the tables exist with the right schemas). You can do this by grabbing it from an initial Fly.io deployment. If your deployment already has some data in it because you did some prior testing, you can set the LITESTREAM_ENABLED=false environment variable to not use Litestream and have Headscale start from an empty database (remember to unset this variable again once you have retrieved the empty SQlite database).

Because Headscale is configured to use SQlite in WAL mode, we must first create a WAL checkpoint to ensure that the database initialization is committed to the database file.

$ fly deploy
$ fly console ssh
app> $ apk add sqlite
app> $ sqlite3 /var/lib/headscale/db.sqlite
app> sqlite3> PRAGMA wal_checkpoint(TRUNCATE);
app> sqlite3> [Ctrl+D]
app> $ exit
$ fly ssh sftp get /var/lib/headscale/db.sqlite

Change into the ./headscale-back-to-sqlite directory and use UV to run the script.

$ uv run main.py \
    --pg-host db-host.example \
    --pg-port 5432 \
    --pg-db headscale \
    --pg-user headscale \
    --pg-password DBPASSWORD \
    --sqlite-out path/to/db.sqlite

This will perform read-only operations on the Postgres database so you do not need to worry about creating a separate backup of your Postgres database.

If all succeeded, upload the database to the S3 bucket that Headscale on Fly.io also uses to replicate the database to with Litestream. If you're using the Tigris object storage extension in Fly.io, you will likely need to log into the Tigris console via the Fly.io dashboard and generate some temporary access credentials. The following example uses the mc CLI to upload the file.

$ mc alias set tigris https://fly.storage.tigris.dev <ACCESS_KEY_ID> <SECRET_ACCESS_KEY>
$ mc cp path/to/db.sqlite tigris/<YOUR_BUCKET_NAME>/import-db.sqlite

Set the IMPORT_DATABASE=true environment variable and re-deploy your application.

$ fly deploy --env IMPORT_DATABASE=true
$ fly logs

Wait for the application to start, the database to be imported from S3 and Litestream to have replicated it to the S3 bucket. Then re-deploy to remove the IMPORT_DATABASE variable.

$ fly deploy

You should be good to go!

litestream-entrypoint.sh

As part of this repository, the litestream-entrypoint.sh can be considered public API can consumed by other projects that want to use Litestream in the same fashion as this project. It can be retrieved with curl or copied from the container published by the project under the /var/lib/headscale/litestream-entrypoint.sh path, however you must pin a tagged version to ensure reproducability and compatibility (newer versions might change in a backwards incompatible way).

Other projects that use this script include:

Development

Simply iterating via fly deploy works quite well!

To update the ToC in this file, run

$ uvx mksync -i README.md

Releases a tagged in the form of <version>-headscale-<headscale_version>. Requires that the GitHub CLI.

$ ./scripts/release 0.1.0-headscale-0.23.0

Integration testing

We perform a lightweight integration test by deploying the application to a Fly.io app after successful build on the main branch, which will fail if the application doesn't come up healthy.