-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WI] Vault renew self fails #23859
Comments
Hi @ygersie, can you please give us a little more context on what your problem is? When the vault client runs into a fatal error, something that will keep on returning an error in time, like a lease expiration, it will stop renewing it. If you provide us with more information, we might be able to have a better insight into your particular problem, and find a better way to help. |
Hey @Juanadelacuesta, thanks for your response.
Exactly, in this case Nomad incorrectly determines that this is a I've also seen something else that could be problematic. I'm running an extensive local lab setup of hashistack that includes multiple regions, mTLS and Vault + Consul integration using Workload Identity. If I leave my laptop to rest overnight in the morning a bunch of my jobs are dead and won't start anymore. Here's an example of such an alloc:
It looks like a new Vault token was acquired and a restart was executed triggering the
Which is set here: https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/consul_hook.go#L28 |
Seems pretty straightforward. I've open a PR #24409 to fix.
Unfortunately the Vault Go API doesn't actually expose the HTTP status or wrap the error type in something we could typecheck here. So that's a bigger lift to fix. I'll audit the known errors at least and will update #24409 with any others I find. |
That's because Nomad clients are supposed to be running as root, which can write to that just fine. But seems harmless to allow whatever user Nomad is running as to write to it. I'll get a quickie PR up for that as well. |
When a task restarts, the Nomad client may need to rewrite the Consul token, but it's created with permissions that prevent a non-root agent from writing to it. While Nomad clients should be run as root (currently), it's harmless to allow whatever user the Nomad agent is running as to be able to write to it, and that's one less barrier to rootless Nomad. Ref: #23859 (comment)
When a Vault lease expires, it's revoked on the server and cannot be removed, so this error should be treated as fatal. Fixes: #23859
When a Vault lease expires, it's revoked on the server and cannot be removed, so this error should be treated as fatal. The errors we get aren't wrapped by the Vault SDK, so unfortunately we have to read the error messages and can't easily enumerate non-fatal error messages (which might be bubbling up from the stdlib). I've audited the errors currently used and have documented their source. Ref https://github.com/hashicorp/vault/blob/52ba156d47da170bf40471fe57d72522030bdc7e/vault/expiration.go#L1327 Fixes: #23859
When a task restarts, the Nomad client may need to rewrite the Consul token, but it's created with permissions that prevent a non-root agent from writing to it. While Nomad clients should be run as root (currently), it's harmless to allow whatever user the Nomad agent is running as to be able to write to it, and that's one less barrier to rootless Nomad. Ref: #23859 (comment)
With workload identity I found myself in a scenario where Vault token renewal fails indefinitely. It looks like determining if this is a fatal error doesn't work correctly as the error is different than what is in this list:
nomad/client/vaultclient/vaultclient.go
Line 439 in 36522ec
In my setup this will never resolve I assume because of:
The text was updated successfully, but these errors were encountered: