-
Notifications
You must be signed in to change notification settings - Fork 14
feat: Allow aws-vault to safely be run in parallel #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b4be5e4 to
96a6c03
Compare
|
Heads up I've seen some issues since running with this where in rare cases aws-vault will error with something along the lines of 'can't update keychain item it already exists'. I'm going to drop this back to draft and let it bake a bit more. |
|
Hey @timvisher-dd , no worries let me know when it's ready for a review. |
96a6c03 to
72c41a4
Compare
|
I haven't been able to reproduce any issues yet. I think it may have been a failure in my small build script that creates a throwaway keychain to start from scratch. |
No worries, we can mark this as Draft while you investigate, if you want to look more into it. |
|
Ah hah! Found it. It's a legit bug where if more than one aws-vault instance is trying to refresh creds for a single profile they fight over setting the keychain item. I can figure that out. Dropping this back to Draft. |
fa67d8c to
18e63b7
Compare
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
18e63b7 to
be4e082
Compare
|
I think I've found another area where parallelism needs to be coordinated: The keychain Get/Set paths. I've never noticed this before because I use the |
Could you talk more about your use case here and what is the problem you're trying to solve? Am I reading this correct you're getting locking issues on the keychain backend ro something else? |
Closes 99designs#1275 (abandoned)
I run aws-vault as a credentials_provider for SSO backed AWS profiles. This is often in parallel contexts of dozens to hundreds of invocations at once.
In that context when credentials needed to be refreshed, aws-vault would open an unconstrained amount of browser tabs in parallel, usually triggering HTTP 500 responses on the AWS side and failing to acquire creds.
To mitigate this I developed a small wrapper around aws-vault that would use a Bash dir lock (Wooledge BashFAQ 45) when there was a possibility that the credentials would need to be refreshed. This worked but it was also quite slow as it would lock the entire aws-vault execution rather than just the step that might launch a browser tab. The dir locking strategy was also sensitive to being killed by process managers like terraform and so had to do things like die voluntarily after magic amounts of seconds to avoid being SIGKILLed and leaving a stale lock around.
This changeset introduces a cross-platform process lock that is tolerant of all forms of process death (even SIGKILL), and wires it into the SSO/OIDC browser flow so only one process refreshes at a time. This keeps parallel invocations fast while avoiding the browser storm that triggers HTTP 500s.
While stress testing I also found that the session cache could race across processes, leading to "item already exists" errors from the keychain. This branch adds a session cache refresh lock so only one process writes back to the same cache entry at once, while others wait and re-check.
Because the parallelism is now safe and fast, I also hit SSO OIDC rate limits (429). This change adds Retry-After support with jittered backoff so the SSO token exchange is resilient under heavy load.
In a stress test across 646 AWS profiles in 9 SSO Directories I'm able to retrieve creds successfully in ~36 seconds on my box. This fails on upstream HEAD because the browser storm overwhelms the IAM/SSO APIs and the keychain cache update races.
Testing
Unit tests have been added for the SSO/OIDC lock, session cache lock, and OIDC retry logic. I also used some local integration test scripts that clear out the creds and run
aws-vault export --format=jsonacross different sets of my profiles and assert that it succeeds. Finally I've converted my local tooling to use this fork of aws-vault and have been exercising it there without issue.Colofon
I did not write any of this code. Codex did. That said I have read through it in some detail and it looks reasonable to me.
Co-authored-by: Codex noreply@openai.com