Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WI] Vault renew self fails #23859

Closed
ygersie opened this issue Aug 23, 2024 · 4 comments · Fixed by #24409
Closed

[WI] Vault renew self fails #23859

ygersie opened this issue Aug 23, 2024 · 4 comments · Fixed by #24409
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/vault type/bug
Milestone

Comments

@ygersie
Copy link
Contributor

ygersie commented Aug 23, 2024

With workload identity I found myself in a scenario where Vault token renewal fails indefinitely. It looks like determining if this is a fatal error doesn't work correctly as the error is different than what is in this list:

strings.Contains(renewalErr.Error(), "invalid lease ID") ||

    2024-08-22T18:45:35.294+0200 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default
  error=
  | failed to renew the vault token: Error making API request.
  |
  | URL: PUT http://localhost:8200/v1/auth/token/renew-self
  | Code: 400. Errors:
  |
  | * lease expired

In my setup this will never resolve I assume because of:

  template {
    vault_retry {
      attempts    = 0
      backoff     = "250ms"
      max_backoff = "5m"
    }
  }
@Juanadelacuesta
Copy link
Member

Hi @ygersie, can you please give us a little more context on what your problem is? When the vault client runs into a fatal error, something that will keep on returning an error in time, like a lease expiration, it will stop renewing it. If you provide us with more information, we might be able to have a better insight into your particular problem, and find a better way to help.

@Juanadelacuesta Juanadelacuesta self-assigned this Sep 4, 2024
@Juanadelacuesta Juanadelacuesta moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Sep 4, 2024
@ygersie
Copy link
Contributor Author

ygersie commented Sep 9, 2024

Hey @Juanadelacuesta, thanks for your response.

When the vault client runs into a fatal error, something that will keep on returning an error in time, like a lease expiration, it will stop renewing it.

Exactly, in this case Nomad incorrectly determines that this is a non-fatal error and keeps retrying. There may be other errors that are boiling up now with the switch to WI that aren't caught in the error checking right now. I wonder if it wouldn't be better to also check the Vault response status code, if it's in the 4XX range it by definition should be fatal, right?

I've also seen something else that could be problematic. I'm running an extensive local lab setup of hashistack that includes multiple regions, mTLS and Vault + Consul integration using Workload Identity. If I leave my laptop to rest overnight in the morning a bunch of my jobs are dead and won't start anymore. Here's an example of such an alloc:

ID                     = 97415db2-cbb0-fb20-b907-54bb412376e3
Eval ID                = e9bf10f1
Name                   = teleport-proxy.teleport-proxy[0]
Node ID                = e7cb90f9
Node Name              = agent-1
Job ID                 = teleport-proxy
Job Version            = 0
Client Status          = failed
Client Description     = Failed tasks
Desired Status         = run
Desired Description    = <none>
Created                = 1h9m ago
Modified               = 28m2s ago
Reschedule Eligibility = 31m54s from now

Allocation Addresses (mode = "host"):
Label   Dynamic  Address
*proxy  yes      192.168.1.132:3080
*diag   yes      192.168.1.132:23153

Task "teleport-proxy" is "dead"
Task Resources:
CPU         Memory          Disk     Addresses
18/100 MHz  64 MiB/512 MiB  300 MiB

Task Events:
Started At     = 2024-09-09T07:43:24Z
Finished At    = 2024-09-09T08:24:39Z
Total Restarts = 1
Last Restart   = 2024-09-09T10:24:39+02:00

Recent Events:
Time                       Type              Description
2024-09-09T10:24:39+02:00  Not Restarting    Error was unrecoverable
2024-09-09T10:24:39+02:00  Task hook failed  consul_task: 1 error occurred:
	* failed to write Consul SI token: open /Users/ygersie/git/local-hashistack/data/nomad/us-east-1/1/alloc/97415db2-cbb0-fb20-b907-54bb412376e3/teleport-proxy/secrets/consul_token: permission denied
2024-09-09T10:24:39+02:00  Restarting        Task restarting in 0s
2024-09-09T10:24:39+02:00  Terminated        Exit Code: 0
2024-09-09T10:24:39+02:00  Restart Signaled  Vault: new Vault token acquired
2024-09-09T09:43:24+02:00  Started           Task started by client
2024-09-09T09:43:23+02:00  Task Setup        Building Task Directory
2024-09-09T09:43:21+02:00  Received          Task received by client

It looks like a new Vault token was acquired and a restart was executed triggering the Prestart hooks, however, the consul_token was created with permissions that does not allow writes:

-r--r-----@ 1 ygersie  staff  36 Sep  9 09:43 /Users/ygersie/git/local-hashistack/data/nomad/us-east-1/1/alloc/97415db2-cbb0-fb20-b907-54bb412376e3/teleport-proxy/secrets/consul_token

Which is set here: https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/consul_hook.go#L28

@Juanadelacuesta Juanadelacuesta removed their assignment Oct 23, 2024
@tgross tgross self-assigned this Nov 8, 2024
@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Nov 8, 2024
@tgross
Copy link
Member

tgross commented Nov 8, 2024

Exactly, in this case Nomad incorrectly determines that this is a non-fatal error and keeps retrying.

Seems pretty straightforward. I've open a PR #24409 to fix.

There may be other errors that are boiling up now with the switch to WI that aren't caught in the error checking right now. I wonder if it wouldn't be better to also check the Vault response status code, if it's in the 4XX range it by definition should be fatal, right?

Unfortunately the Vault Go API doesn't actually expose the HTTP status or wrap the error type in something we could typecheck here. So that's a bigger lift to fix. I'll audit the known errors at least and will update #24409 with any others I find.

@tgross tgross added this to the 1.9.x milestone Nov 8, 2024
@tgross
Copy link
Member

tgross commented Nov 8, 2024

however, the consul_token was created with permissions that does not allow writes:

That's because Nomad clients are supposed to be running as root, which can write to that just fine. But seems harmless to allow whatever user Nomad is running as to write to it. I'll get a quickie PR up for that as well.

tgross added a commit that referenced this issue Nov 8, 2024
When a task restarts, the Nomad client may need to rewrite the Consul token, but
it's created with permissions that prevent a non-root agent from writing to
it. While Nomad clients should be run as root (currently), it's harmless to
allow whatever user the Nomad agent is running as to be able to write to it, and
that's one less barrier to rootless Nomad.

Ref: #23859 (comment)
tgross added a commit that referenced this issue Nov 8, 2024
When a Vault lease expires, it's revoked on the server and cannot be removed, so
this error should be treated as fatal.

Fixes: #23859
tgross added a commit that referenced this issue Nov 15, 2024
When a Vault lease expires, it's revoked on the server and cannot be removed, so
this error should be treated as fatal.

The errors we get aren't wrapped by the Vault SDK, so unfortunately we have to
read the error messages and can't easily enumerate non-fatal error
messages (which might be bubbling up from the stdlib). I've audited the errors
currently used and have documented their source.

Ref https://github.com/hashicorp/vault/blob/52ba156d47da170bf40471fe57d72522030bdc7e/vault/expiration.go#L1327
Fixes: #23859
@tgross tgross closed this as completed in 6be9a50 Nov 18, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Nov 18, 2024
tgross added a commit that referenced this issue Nov 19, 2024
When a task restarts, the Nomad client may need to rewrite the Consul token, but
it's created with permissions that prevent a non-root agent from writing to
it. While Nomad clients should be run as root (currently), it's harmless to
allow whatever user the Nomad agent is running as to be able to write to it, and
that's one less barrier to rootless Nomad.

Ref: #23859 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/vault type/bug
Projects
Development

Successfully merging a pull request may close this issue.

3 participants