From 3504fca9ae1cb6995a043debfb9b2ef12cd7c247 Mon Sep 17 00:00:00 2001 From: Mathis Kretz Date: Wed, 19 Mar 2025 23:10:17 +0100 Subject: [PATCH] Add Vault migration post --- content/posts/vault-migration/index.md | 663 +++++++++++++++++++++++++ 1 file changed, 663 insertions(+) create mode 100644 content/posts/vault-migration/index.md diff --git a/content/posts/vault-migration/index.md b/content/posts/vault-migration/index.md new file mode 100644 index 0000000..9d77489 --- /dev/null +++ b/content/posts/vault-migration/index.md @@ -0,0 +1,663 @@ +--- +title: "Migrating HashiCorp Vault Between AKS Clusters" +date: 2025-03-18 +author: "Max Leske (Xovis) and Mathis Kretz (bespinian)" +tags: ["Kubernetes", "Vault", "Azure", "Cloud Engineering"] +categories: ["Engineering", "Cloud Native"] +--- + +# Migrating HashiCorp Vault Between AKS Clusters + +## Introduction + +At [Xovis](https://xovis.com) and bespinian, we recently faced the challenge of +migrating a Kubernetes-based HashiCorp Vault instance from one Azure Kubernetes +Service (AKS) cluster to another. The goal was to consolidate infrastructure and +reduce the number of AKS clusters that required maintenance. The outcome of a +somewhat arduous journey is a Bash script that has enabled us to test the +migration repeatedly and then perform the migration with a outage of less than 2 +minutes. + +This blog post walks through our approach and presents the migration steps in 14 +structured chapters. Each chapter includes excerpts from the automation script, +highlighting the checks, operations, and configuration changes we performed +along the way. + +## Why We Migrated + +Operating multiple AKS clusters can introduce unnecessary complexity and +overhead. To optimize our infrastructure and simplify maintenance, we decided to +migrate Vault from one AKS cluster to another, reducing the number of clusters +we needed to manage. Our primary goals were: + +- Minimal downtime to avoid disruptions to dependent services. +- Data integrity with a verifiable transition from the old to the new instance. +- Seamless switchover to ensure applications continued to function without + configuration changes. +- A rollback strategy in case of unforeseen issues. + +## Migration Strategy + +Our Vault instances are running in a high-availability configuration using the +integrated Raft storage backend and auto-unseal with Azure Key Vault. This setup +is needed in order to ensure resilience and security across client environments. +To migrate Vault between AKS clusters, we implemented the process with a Bash +script that automated all steps including snapshot creation, data restoration, +and unseal key migration. + +### Pre-Migration Setup + +Before initiating the migration: + +1. **Provision the new Vault cluster**: Set up a fresh Vault instance on the + target AKS cluster with the same configuration as the existing one. +2. **Configure networking and authentication**: Ensure that the new instance has + the necessary access permissions and that external clients will be able to + connect post-migration. +3. **Test the migration in a staging environment**: Migrate a test Vault + instance within between two non-production clusters to verify the process. + +## Migration Process + +The migration script consists of the steps outlined below. + +### 1. Prepare the Migration + +Before performing the actual migration, it's crucial to validate your +environment and ensure all prerequisites are met. This step includes setting up +local backups, verifying Kubernetes contexts, ensuring proper access +permissions, and preparing the user for the maintenance window. + +- Create a local backup directory: + + ```bash + mkdir -p vault-backups + ``` + +- Check that the required tools are available and that `kubectl wait` is + supported + + ```bash + REQUIRED_TOOLS=(az kubectl jq ed dig) + + for tool in "${REQUIRED_TOOLS[@]}"; do + if ! command -v "$tool" >/dev/null 2>&1; then + print_message "Required tool '$tool' is not installed or not in PATH." + MISSING=1 + fi + done + + if ! kubectl wait --help | grep -q 'Wait for a specific condition'; then + print_message "'kubectl wait' is not supported." + MISSING=1 + fi + + if [ -n "$MISSING" ]; then + print_message "Please install the missing tools." + exit 1 + fi + ``` + +- Check that the `kubectl` contexts for both clusters are available + + ```bash + if ! (kubectl config get-contexts "${OLD_K8S_NAME}" > /dev/null 2>&1 && \ + kubectl config get-contexts "${NEW_K8S_NAME}" > /dev/null 2>&1); then + print_message "No contexts for ${OLD_K8S_NAME} and ${NEW_K8S_NAME}." + print_message "Make sure you have credentials for both clusters" + exit 1 + fi + ``` + +- Check that both `kubectl` contexts are accessible + + ```bash + switch_context "${OLD_K8S_NAME}" "${OLD_VAULT_NAME}" + if ! kubectl get cm > /dev/null 2>&1; then + print_message "Can't access resources on ${OLD_K8S_NAME}. Aborting" + exit 1 + fi + + switch_context "${NEW_K8S_NAME}" "${NEW_VAULT_NAME}" + if ! kubectl get cm > /dev/null 2>&1; then + print_message "Can't access resources on ${NEW_K8S_NAME}. Aborting" + exit 1 + fi + ``` + +- Check that the new Vault cluster has three replicas + + ```bash + availableReplicas=$(kubectl get statefulsets.apps "${NEW_VAULT_NAME}" \ + -o template --template="{{.status.availableReplicas}}") + if [ "${availableReplicas}" -ge 3 ]; then + print_message "Expected number of replicas found on ${NEW_K8S_NAME}" + else + print_message "Unexpected replica count on ${NEW_K8S_NAME}. Exiting." + exit 1 + fi + ``` + +- Our auto-unseal setup using Azure Key Vault is based on the AKS cluster's + Managed Service Identity. We thus need to check that the Managed Service + Identity of the new Vault cluster has access to the unseal keys of both Vault + instances, since both sets of keys will be rquired during the migration. + + ```bash + az account set -s "${NEW_VAULT_SUBSCRIPTION_NAME}" + + msi_principal_id=$(az identity list \ + -g "${NEW_VAULT_RESOURCE_GROUP}" \ + --query "[?clientId == '${NEW_VAULT_MSI_CLIENT_ID}'].principalId" \ + | jq -r '.[]') + + key_vault_id=$(az keyvault show --name "${NEW_KEY_VAULT_NAME}" \ + --query "id" | jq -r '.') + if + ! az role assignment list \ + --scope "${key_vault_id}" \ + --role "Key Vault Crypto Service Encryption User" \ + --query "[?principalId == '${msi_principal_id}']" > /dev/null 2>&1; then + print_message "Role assigment missing for new vault MSI" + print_message "Assign role 'Key Vault Crypto Service Encryption User'" + exit 1 + fi + + az account set -s "${OLD_VAULT_SUBSCRIPTION_NAME}" + + key_vault_id=$(az keyvault show --name "${OLD_KEY_VAULT_NAME}" \ + --query "id" | jq -r '.') + if + ! az role assignment list \ + --scope "${key_vault_id}" \ + --role "Key Vault Crypto Service Encryption User" \ + --query "[?principalId == '${msi_principal_id}']" > /dev/null 2>&1; then + print_message "Role assigment missing for new vault MSI" + print_message "Assign role 'Key Vault Crypto Service Encryption User'" + exit 1 + fi + ``` + +- Remind the user to now announce the start of the maintenance window + + ```bash + print_message "Please announce the start of the maintenance window." + wait_for_any_key + ``` + +### 2. Create temporary migration tokens + +To perform the snapshot and restore operations securely, we create short-lived +Vault tokens with limited permissions. These tokens allow access to specific +endpoints like snapshot creation, restoration, sealing, and leadership +operations, which aren't granted to the existing policies. + +- Write policies for authorizing the migration operations + + ```hcl + # Policy allowing to step down Vault leader + path "sys/step-down" { + capabilities = ["update", "sudo"] + } + # Policy allowing to save snapshots + path "sys/storage/raft/snapshot" { + capabilities = [ "create", "read", "update", "list" ] + } + # Policy allowing to restore vault's snapshots + path "sys/storage/raft/snapshot-force" { + capabilities = [ "create", "read", "update", "list" ] + } + ``` + +- Detect the index of the leader pod in each Vault instance + + ```bash + kubectl get pod -l "vault-active=true" -o jsonpath \ + --template "{.items[0].metadata.labels.apps\.kubernetes\.io/pod-index}" + ``` + +- Create tokens for each vault instances + + ```bash + switch_context "${cluster_name}" "${vault_name}" > /dev/null + leader_index="$(detect_vault_leader_index)" > /dev/null + + kubectl cp migration-policy.hcl "${vault_name}-${leader_index}":/tmp > /dev/null + + kubectl exec -it "${vault_name}-${leader_index}" -- \ + vault login "${admin_token}" > /dev/null + + kubectl exec -it "${vault_name}-${leader_index}" -- \ + sh -c "cat /tmp/migration-policy.hcl | vault policy write migration -" \ + > /dev/null + + kubectl exec -it "${vault_name}-${leader_index}" -- \ + vault token create -policy=migration -period=30m -format json \ + | jq -r '.auth.client_token' + + kubectl exec -it "${vault_name}-${leader_index}" -- \ + rm /tmp/migration-policy.hcl > /dev/null + ``` + +### 3. Block access to the old Vault instance + +Before creating the snapshot, it’s important to prevent any writes to the old +Vault instance. This step ensures no data is changed after the snapshot and +avoids issues with lease revocation. + +- Backup and delete the ingress for the old Vault instance + + ```bash + kubectl get ing vault -o yaml > "${BACKUPS_DIR}/${OLD_VAULT_INGRESS_FILENAME}" + + kubectl delete ing vault --wait=true + ``` + +### 4. Create snapshot of the old Vault instance + +We now take a snapshot of the Vault data using the Raft backend's built-in +snapshot mechanism. After the snapshot is saved and downloaded, we immediately +disable the old Vault to avoid unintended behavior. + +- Log in with the migration token + + ```bash + kubectl exec -it "${OLD_VAULT_NAME}-${leader_index}" -- \ + vault login "${OLD_VAULT_ADMIN_TOKEN}" + ``` + +- Create the snapshot in the leader pod + + ```bash + kubectl exec -it "${OLD_VAULT_NAME}-${leader_index}" -- \ + vault operator raft snapshot save "${snapshot_filepath}" + ``` + +- Dowload the snapshot + + ```bash + kubectl cp "${OLD_VAULT_NAME}-${leader_index}":"${snapshot_filepath}" \ + "${BACKUPS_DIR}/${OLD_VAULT_SNAPSHOT_FILENAME}" + + kubectl exec -it "${OLD_VAULT_NAME}-${leader_index}" -- \ + rm "${snapshot_temp_filepath}" + ``` + +- Disable the old Vault instance by changing the command of the Vault + StatefulSet to `sleep` + + ```bash + kubectl get statefulset vault -o yaml > \ + "${BACKUPS_DIR}/${OLD_VAULT_STATEFULSET_FILENAME}" + cp "${BACKUPS_DIR}/${OLD_VAULT_STATEFULSET_FILENAME}" . + + ed "${OLD_VAULT_STATEFULSET_FILENAME}" < "${vault_config_filename}" + + ed '+/seal "/' "${vault_config_filename}" < vault-config.yaml + +ed '+/seal "/' vault-config.yaml < vault-config.yaml + + ed '+/disabled \+= true/' vault-config.yaml <