ENG-2740 Use elapsed time for SSH retry timeout #124

fabgo · 2024-10-01T23:12:59Z

Use elapsed time to determine the SSH retry timeout instead of the number of retries.

nbrahms

My biggest concern is why we changed from a 30 second to 10 minute (!) timeout for AWS access.

nbrahms · 2024-10-01T23:48:34Z

src/plugins/aws/ssh.ts

- *
- * Each attempt consumes ~ 1 s.
+/**
+ * It can take up to 1 minute for access to propagate on AWS, so set the time limit to 10 minutes.


What's the motivation to change from 30 seconds to 10 minutes?

I'd be surprised if any end user is willing to wait 10 minutes.

Suggested change

* It can take up to 1 minute for access to propagate on AWS, so set the time limit to 10 minutes.

// It takes around 8 seconds for access to propagate on AWS, so allow 30 seconds as a safe ceiling.

Also, this should just be a normal comment as it's explaining why the value is chosen, rather than what it does.

The timeout is based on existing comments in ssh/index.ts:

AWS takes about 8 minutes, GCP takes under 1 minute to fully resolve access after it is granted.
During this time, calls to aws ssm start-session / gcloud compute start-iap-tunnel
will fail randomly with an various error messages.

See also this PR, which also mentions 10 minutes: #19

@gergas3 @nbrahms I can change the timeout if you want. Just let me know what to set it to.

Was it a typo in #19 then? I didn't observe such a long propagation time for AWS. 30 seconds makes more sense.

GCP takes significantly longer, it's mostly under 1 minute. But my guess would be it's not >99th percentile. We can continue with the 2m in this PR imo.

Changed it to 30 seconds.

The full info is that it takes AWS 30 minutes to get to 100% success rate. But we don't need to wait this long. We only need to wait until the chance that a single SSH attempt succeeds is high.

I'm not sure what we've changed in terms of delays, as those will affect this. But, when we were hammering it repeatedly in a loop, that took about 12 seconds.

nbrahms · 2024-10-01T23:49:25Z

src/plugins/aws/ssh.ts

 */
-const MAX_SSH_RETRIES = 30;
+const TIME_LIMIT_MS = 10 * 60 * 1000;


Suggested change

const TIME_LIMIT_MS = 10 * 60 * 1000;

const TIME_LIMIT_MS = 60 * 1000;

Seems already long enough.

When I analyzed this previously, 99% + of provisioning finished within 8 seconds. Has that changed?

nbrahms · 2024-10-01T23:50:06Z

src/plugins/google/ssh.ts

- *
- * The length of each attempt varies based on the type of error from a few seconds to < 1s
+/**
+ * It typically takes < 1 minute for access to propagate on GCP, so set the time limit to 2 minutes.


Would be good to understand the CDF of successful attempts by time.

nbrahms · 2024-10-01T23:51:40Z

src/plugins/ssh/index.ts

-      print2(
-        `Waiting for access to propagate. ${gerund} SSH session... (remaining attempts: ${attemptsRemaining})`
-      );
+      print2(`Waiting for access to propagate. ${gerund} SSH session...)`);


Seems like we should display

Suggested change

print2(`Waiting for access to propagate. ${gerund} SSH session...)`);

const remainingS = ((endTime - Date.now()) / 1e3 ).toFixed(1)

print2(`Waiting for access to propagate. ${gerund} SSH session... (will wait up to ${remainingS} seconds)`);

nbrahms · 2024-10-01T23:56:55Z

src/types/ssh.ts

@@ -80,6 +80,8 @@ export type SshProvider<

  /** Unwraps this provider's types */
  requestToSsh: (request: CliPermissionSpec<PR, O>) => SR;
+
+  timeLimit: number;


This should have some jsdoc describing what it does. I might also suggest giving it a more descriptive name:

Suggested change

timeLimit: number;

/** Amount of time, in ms, to wait between granting access and giving up on attempting an SSH connection */

propagationTimeoutMs: number;

…security/p0cli into fabian/eng-2740-ssh-use-elapsed-time

gergas3

With this change does this check in plugins/ssh/index.ts:

    if (
      match &&
      Date.now() <=
        beforeStart + (match.validationWindowMs || DEFAULT_VALIDATION_WINDOW_MS)
    ) {
      isEphemeralAccessDeniedException = true;
    }

still make sense? Do we want to have a timeout on individual subprocesses and consider the error non-ephemeral if we read it off the stderr too late.

I think it can be removed or set to a larger value because we have an overall cap on the time.

gergas3 · 2024-10-03T01:11:44Z

src/plugins/ssh/index.ts

@@ -398,6 +394,6 @@ export const sshOrScp = async (args: {
    stdio: ["inherit", "inherit", "pipe"],
    debug: cmdArgs.debug,
    provider: request.type,
-    attemptsRemaining: sshProvider.maxRetries,
+    endTime: Date.now() + sshProvider.propagationTimeoutMs,


To globally wait max propagationTimeoutMs, we would have to the endTime once before calling preTestAccessPropagationIfNeeded (above), and pass the same endTime to spawnSshNode (here).

Now for a GCP sudo access the total propagationTimeoutMs is 2m * 2.

ENG-2740 Use elapsed time for SSH retry timeout

8eadde8

Use elapsed time to determine the SSH retry timeout instead of the number of retries.

fabgo requested a review from gergas3 October 1, 2024 23:13

Revert commit of .vscode/launch.json

b2bcf13

nbrahms reviewed Oct 1, 2024

View reviewed changes

Merge branch 'main' into fabian/eng-2740-ssh-use-elapsed-time

47767c7

nbrahms reviewed Oct 1, 2024

View reviewed changes

fabgo added 3 commits October 1, 2024 19:59

Merge branch 'main' into fabian/eng-2740-ssh-use-elapsed-time

7099906

Merge branch 'fabian/eng-2740-ssh-use-elapsed-time' of github.com:p0-…

46513f7

…security/p0cli into fabian/eng-2740-ssh-use-elapsed-time

Address review comments.

1b64506

gergas3 reviewed Oct 3, 2024

View reviewed changes

fabgo added 6 commits October 3, 2024 11:17

Merge branch 'main' into fabian/eng-2740-ssh-use-elapsed-time

d79cbba

Update timeout for AWS.

9b74b7e

Use one timeout for both pre-test & trying.

334cd1e

Merge branch 'main' into fabian/eng-2740-ssh-use-elapsed-time

dbbc549

Address review comments.

d95ef7c

Merge branch 'main' into fabian/eng-2740-ssh-use-elapsed-time

9c4ab9f

gergas3 approved these changes Oct 4, 2024

View reviewed changes

fabgo merged commit d0e583c into main Oct 4, 2024
3 checks passed

fabgo deleted the fabian/eng-2740-ssh-use-elapsed-time branch October 4, 2024 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENG-2740 Use elapsed time for SSH retry timeout #124

ENG-2740 Use elapsed time for SSH retry timeout #124

fabgo commented Oct 1, 2024

nbrahms left a comment

nbrahms Oct 1, 2024 •

edited

Loading

fabgo Oct 2, 2024 •

edited

Loading

fabgo Oct 2, 2024

gergas3 Oct 3, 2024 •

edited

Loading

fabgo Oct 3, 2024

nbrahms Oct 3, 2024

nbrahms Oct 1, 2024

nbrahms Oct 1, 2024

nbrahms Oct 1, 2024

fabgo Oct 2, 2024

nbrahms Oct 1, 2024

fabgo Oct 2, 2024

gergas3 left a comment

gergas3 Oct 3, 2024

fabgo Oct 3, 2024

	* It can take up to 1 minute for access to propagate on AWS, so set the time limit to 10 minutes.
	// It takes around 8 seconds for access to propagate on AWS, so allow 30 seconds as a safe ceiling.

	const TIME_LIMIT_MS = 10 * 60 * 1000;
	const TIME_LIMIT_MS = 60 * 1000;

	print2(`Waiting for access to propagate. ${gerund} SSH session...)`);
	const remainingS = ((endTime - Date.now()) / 1e3 ).toFixed(1)
	print2(`Waiting for access to propagate. ${gerund} SSH session... (will wait up to ${remainingS} seconds)`);

	timeLimit: number;
	/** Amount of time, in ms, to wait between granting access and giving up on attempting an SSH connection */
	propagationTimeoutMs: number;

ENG-2740 Use elapsed time for SSH retry timeout #124

ENG-2740 Use elapsed time for SSH retry timeout #124

Conversation

fabgo commented Oct 1, 2024

nbrahms left a comment

Choose a reason for hiding this comment

nbrahms Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

fabgo Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gergas3 Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gergas3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbrahms Oct 1, 2024 •

edited

Loading

fabgo Oct 2, 2024 •

edited

Loading

gergas3 Oct 3, 2024 •

edited

Loading