-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENG-2740 Use elapsed time for SSH retry timeout #124
Changes from 3 commits
8eadde8
b2bcf13
47767c7
7099906
46513f7
1b64506
d79cbba
9b74b7e
334cd1e
dbbc549
d95ef7c
9c4ab9f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -22,11 +22,10 @@ | |||||
AwsSshRoleRequest, | ||||||
} from "./types"; | ||||||
|
||||||
/** Maximum number of attempts to start an SSH session | ||||||
* | ||||||
* Each attempt consumes ~ 1 s. | ||||||
/** | ||||||
* It can take up to 1 minute for access to propagate on AWS, so set the time limit to 10 minutes. | ||||||
*/ | ||||||
const MAX_SSH_RETRIES = 30; | ||||||
const TIME_LIMIT_MS = 10 * 60 * 1000; | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Seems already long enough. When I analyzed this previously, 99% + of provisioning finished within 8 seconds. Has that changed? |
||||||
|
||||||
/** The name of the SessionManager port forwarding document. This document is managed by AWS. */ | ||||||
const START_SSH_SESSION_DOCUMENT_NAME = "AWS-StartSSHSession"; | ||||||
|
@@ -84,7 +83,7 @@ | |||||
|
||||||
friendlyName: "AWS", | ||||||
|
||||||
maxRetries: MAX_SSH_RETRIES, | ||||||
|
||||||
preTestAccessPropagationArgs: () => undefined, | ||||||
|
||||||
|
@@ -130,6 +129,8 @@ | |||||
access: "idc", | ||||||
}; | ||||||
}, | ||||||
|
||||||
timeLimit: TIME_LIMIT_MS, | ||||||
|
||||||
toCliRequest: async (request) => ({ ...request, cliLocalData: undefined }), | ||||||
|
||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,11 +14,10 @@ | |
import { importSshKey } from "./ssh-key"; | ||
import { GcpSshPermissionSpec, GcpSshRequest } from "./types"; | ||
|
||
/** Maximum number of attempts to start an SSH session | ||
* | ||
* The length of each attempt varies based on the type of error from a few seconds to < 1s | ||
/** | ||
* It typically takes < 1 minute for access to propagate on GCP, so set the time limit to 2 minutes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be good to understand the CDF of successful attempts by time. |
||
*/ | ||
const MAX_SSH_RETRIES = 120; | ||
const TIME_LIMIT_MS = 2 * 60 * 1000; | ||
|
||
/** | ||
* There are 7 cases of unprovisioned access in Google Cloud. | ||
|
@@ -74,7 +73,7 @@ | |
|
||
loginRequiredPattern: /You do not currently have an active account selected/, | ||
|
||
maxRetries: MAX_SSH_RETRIES, | ||
|
||
preTestAccessPropagationArgs: (cmdArgs) => { | ||
if (isSudoCommand(cmdArgs)) { | ||
|
@@ -117,6 +116,8 @@ | |
type: "gcloud", | ||
}; | ||
}, | ||
|
||
timeLimit: TIME_LIMIT_MS, | ||
|
||
unprovisionedAccessPatterns, | ||
|
||
|
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -107,7 +107,7 @@ type SpawnSshNodeOptions = { | |||||||
credential?: AwsCredentials; | ||||||||
command: string; | ||||||||
args: string[]; | ||||||||
attemptsRemaining: number; | ||||||||
endTime: number; | ||||||||
abortController?: AbortController; | ||||||||
detached?: boolean; | ||||||||
stdio: [StdioNull, StdioNull, StdioPipe]; | ||||||||
|
@@ -127,14 +127,11 @@ async function spawnSshNode( | |||||||
return new Promise((resolve, reject) => { | ||||||||
const provider = SSH_PROVIDERS[options.provider]; | ||||||||
|
||||||||
const attemptsRemaining = options.attemptsRemaining; | ||||||||
if (options.debug) { | ||||||||
const gerund = options.isAccessPropagationPreTest | ||||||||
? "Pre-testing" | ||||||||
: "Trying"; | ||||||||
print2( | ||||||||
`Waiting for access to propagate. ${gerund} SSH session... (remaining attempts: ${attemptsRemaining})` | ||||||||
); | ||||||||
print2(`Waiting for access to propagate. ${gerund} SSH session...)`); | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like we should display
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||
} | ||||||||
|
||||||||
const child = spawnChildProcess( | ||||||||
|
@@ -156,23 +153,17 @@ async function spawnSshNode( | |||||||
// In the case of ephemeral AccessDenied exceptions due to unpropagated | ||||||||
// permissions, continually retry access until success | ||||||||
if (!isAccessPropagated()) { | ||||||||
if (attemptsRemaining <= 0) { | ||||||||
if (options.endTime < Date.now()) { | ||||||||
reject( | ||||||||
`Access did not propagate through ${provider.friendlyName} before max retry attempts were exceeded. Please contact support@p0.dev for assistance.` | ||||||||
`Access did not propagate through ${provider.friendlyName} in time. Please contact support@p0.dev for assistance.` | ||||||||
); | ||||||||
return; | ||||||||
} | ||||||||
|
||||||||
delay(RETRY_DELAY_MS) | ||||||||
.then(() => | ||||||||
spawnSshNode({ | ||||||||
...options, | ||||||||
attemptsRemaining: attemptsRemaining - 1, | ||||||||
}) | ||||||||
) | ||||||||
.then(() => spawnSshNode(options)) | ||||||||
.then((code) => resolve(code)) | ||||||||
.catch(reject); | ||||||||
|
||||||||
return; | ||||||||
} else if (isLoginException()) { | ||||||||
reject( | ||||||||
|
@@ -335,7 +326,7 @@ const preTestAccessPropagationIfNeeded = async < | |||||||
stdio: ["inherit", "inherit", "pipe"], | ||||||||
debug: cmdArgs.debug, | ||||||||
provider: request.type, | ||||||||
attemptsRemaining: sshProvider.maxRetries, | ||||||||
endTime: Date.now() + sshProvider.timeLimit, | ||||||||
isAccessPropagationPreTest: true, | ||||||||
}); | ||||||||
} | ||||||||
|
@@ -398,6 +389,6 @@ export const sshOrScp = async (args: { | |||||||
stdio: ["inherit", "inherit", "pipe"], | ||||||||
debug: cmdArgs.debug, | ||||||||
provider: request.type, | ||||||||
attemptsRemaining: sshProvider.maxRetries, | ||||||||
endTime: Date.now() + sshProvider.timeLimit, | ||||||||
}); | ||||||||
}; |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -80,6 +80,8 @@ export type SshProvider< | |||||||
|
||||||||
/** Unwraps this provider's types */ | ||||||||
requestToSsh: (request: CliPermissionSpec<PR, O>) => SR; | ||||||||
|
||||||||
timeLimit: number; | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should have some jsdoc describing what it does. I might also suggest giving it a more descriptive name:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||
|
||||||||
/** Regex matches for error strings indicating that the provider has not yet fully provisioned node acces */ | ||||||||
unprovisionedAccessPatterns: readonly { | ||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation to change from 30 seconds to 10 minutes?
I'd be surprised if any end user is willing to wait 10 minutes.
Also, this should just be a normal comment as it's explaining why the value is chosen, rather than what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is based on existing comments in ssh/index.ts:
AWS takes about 8 minutes, GCP takes under 1 minute to fully resolve access after it is granted.
During this time, calls to
aws ssm start-session
/gcloud compute start-iap-tunnel
will fail randomly with an various error messages.
See also this PR, which also mentions 10 minutes: #19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gergas3 @nbrahms I can change the timeout if you want. Just let me know what to set it to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it a typo in #19 then? I didn't observe such a long propagation time for AWS. 30 seconds makes more sense.
GCP takes significantly longer, it's mostly under 1 minute. But my guess would be it's not >99th percentile. We can continue with the 2m in this PR imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full info is that it takes AWS 30 minutes to get to 100% success rate. But we don't need to wait this long. We only need to wait until the chance that a single SSH attempt succeeds is high.
I'm not sure what we've changed in terms of delays, as those will affect this. But, when we were hammering it repeatedly in a loop, that took about 12 seconds.