Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update key management docs for keyring-in-Raft #24026

Merged
merged 3 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 53 additions & 26 deletions website/content/docs/operations/key-management.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,20 @@ description: Learn about the key management in Nomad.
# Key Management

Nomad servers maintain an encryption keyring used to encrypt [Variables][] and
sign task [workload identities][]. The servers store key metadata in raft, but
the encryption key material is stored in a separate file in the `keystore`
subdirectory of the Nomad [data directory][]. These files have the extension
`.nks.json`. The key material in each file is wrapped in a unique key encryption
key (KEK) that is not shared between servers. Nomad servers can use an external
KMS or Vault transit encryption to wrap the key material using the [`keyring`][]
configuration block.
sign task [workload identities][]. The servers encrypt these data encryption
keys (DEK) and store the wrapped keys in Raft.

The key encryption key (KEK) used to encrypt the DEK is controlled by the
[`keyring`][] provider. When using an external KMS or Vault transit encryption
provider, the KEK is securely stored outside of Nomad. For the default AEAD
provider, the KEK is stored in cleartext in Raft.

tgross marked this conversation as resolved.
Show resolved Hide resolved
<Note>

Default AED provider users should be aware that storing the Key Encryption Key
(KEK) in cleartext in Raft may expose data in case of a breach.

</Note>

Under normal operations the keyring is entirely managed by Nomad, but this
section provides administrators additional context around key replication and
Expand All @@ -34,36 +41,56 @@ operator root keyring rotate -full`][]. A new "active" key will be created and
re-encrypt all variables with the new key. As each key's variables are encrypted
with the new key, the old key will marked as "deprecated".

## Key Replication
## Key Decryption

When a leader is elected, it creates the keyring if it does not already
exist. When a key is added, the metadata will be replicated via raft. Each
server runs a key replication process that watches for changes to the state
store and will fetch the key material from the leader asynchronously, falling
back to retrieving from other servers in the case where a key is written
immediately before a leader election.
When a leader is elected, the leader creates the keyring if it does not already
exist. When a key is added, the new wrapped key material is replicated via
Raft. As each server replicates the new key, the server starts a task to decrypt
the key material. Until this task completes, the server is not able to serve
requests that require this key.

## Key Redaction in Raft Snapshots

## Restoring the Keyring from Backup
The default AEAD `keyring` configuration stores the KEK in Raft. Raft snapshots
contain the cleartext KEK. The `nomad operator snapshot save` command has a
`-redact` option that removes the key material when creating a snapshot. The
`nomad operator snapshot redact` command removes key material from an
existing snapshot.

Key material is never stored in raft. This prevents an attacker with a backup of
the state store from getting access to encrypted variables. It also allows the
HashiCorp engineering and support organization to safely handle cluster
snapshots you might provide without exposing any of your keys or variables.
Redacting key material is not required when using an external KMS.

However, this means that to restore a cluster from snapshot you need to also
provide the keystore directory with the `.nks.json` key files on at least one
server. The `.nks.json` key files are unique per server, but only one server's
key files are needed to recover the cluster. Operators should include these
files as part of your organization's backup and recovery strategy for the
cluster.
## Legacy Keystore

If you are recovering a Raft snapshot onto a new cluster without running
Versions of Nomad prior to 1.9.0 stored only key metadata in Raft, but the
encryption key material was stored in a separate file in the `keystore`
subdirectory of the Nomad [data directory][]. These files have the extension
`.nks.json`. The key material in each file is wrapped in a unique key encryption
key (KEK) that is not shared between servers.

Each server runs a key replication process that watches for changes to the state
store and fetches the key material from the leader asynchronously, falling
back to retrieving from other servers in the case where a key is written
immediately before a leader election. Nomad 1.9.0 and above can replicate keys
from older servers.

However, replicating keys from older servers means that to restore an older cluster from snapshot you need to
also provide the keystore directory with the `.nks.json` key files on at least
one server. The `.nks.json` key files are unique per server, but only one
server's key files are needed to recover the cluster. Operators should continue
to include these files as part of your organization's backup and recovery
strategy for the cluster until the cluster is fully upgraded to Nomad 1.9.0 and
at least one [`root_key_gc_interval`][] has passed.

If you are recovering an older Raft snapshot onto a new cluster without running
workloads, you can skip restoring the keyring and run [`nomad operator root
keyring rotate`][] once the servers have joined.



[variables]: /nomad/docs/concepts/variables
[workload identities]: /nomad/docs/concepts/workload-identity
[data directory]: /nomad/docs/configuration#data_dir
[`keyring`]: /nomad/docs/configuration/keyring
[`nomad operator root keyring rotate -full`]: /nomad/docs/commands/operator/root/keyring-rotate
[`nomad operator root keyring rotate`]: /nomad/docs/commands/operator/root/keyring-rotate
[`root_key_gc_interval`]: /nomad/docs/configuration/server#root_key_gc_interval
33 changes: 26 additions & 7 deletions website/content/docs/upgrade/upgrade-specific.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,44 @@ used to document those details separately from the standard upgrade flow.

## Nomad 1.9.0

#### Dropped support for older clients

Nomad 1.9.0 removes support for Nomad client agents older than 1.6.0. Older
nodes fail heartbeats. Nomad servers mark the workloads on those nodes
as lost and reschedule them normally according to the job's [reschedule][]
block.

#### Keyring In Raft

Nomad 1.9.0 stores keys used for signing Workload Identity and encrypting
Variables in Raft, instead of storing key material in the external
keystore. When using external KMS or Vault transit encryption for the
[`keyring`][] provider, the key encryption key (KEK) is stored outside of Nomad
and no cleartext key material exists on disk. When using the default AEAD
provider, the key encryption key (KEK) is stored in Raft alongside the encrypted
data encryption keys (DEK).

Nomad automatically migrates the key storage for all key material on the
first [`root_key_gc_interval`][] after all servers are upgraded to 1.9.0. The
existing on-disk keystore is required to restore servers from older snapshots,
so you should continue to back up the on-disk keystore until you no longer need
those older snapshots.

#### Support for HCLv1 removed

Nomad 1.9.0 no longer supports the HCLv1 format for job specifications. Using
the `-hcl1` option for the `job run`, `job plan`, and `job validate` commands
will no longer work.

## Nomad 1.8.4
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the order here: the default Docker image change already shipped.


#### Default Docker `infra_image` changed

Due to the deprecation of the third-party `gcr.io` registry, the default Docker
[`infra_image`][] is now `registry.k8s.io/pause-<arch>:3.3`. If you do not
override the default, clients using the `docker` driver will make outbound
requests to the new registry.

#### Dropped support for older clients

Nomad 1.9.0 removes support for Nomad client agents older than 1.6.0. Older
nodes will fail heartbeats. Nomad servers will mark the workloads on those nodes
as lost and reschedule them normally according to the job's [reschedule][]
block.

## Nomad 1.8.3

#### Nomad keyring rotation
Expand Down Expand Up @@ -2174,6 +2192,7 @@ deleted and then Nomad 0.3.0 can be launched.
[preemption]: /nomad/docs/concepts/scheduling/preemption
[proxy_concurrency]: /nomad/docs/job-specification/sidecar_task#proxy_concurrency
[reserved]: /nomad/docs/configuration/client#reserved-parameters
[`root_key_gc_interval`]: /nomad/docs/configuration/server#root_key_gc_interval
[task-config]: /nomad/docs/job-specification/task#config
[template_gid]: /nomad/docs/job-specification/template#gid
[template_uid]: /nomad/docs/job-specification/template#uid
Expand Down