Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default max retries and expose environment variable to override #830

Merged
merged 2 commits into from
Apr 2, 2024

Conversation

jamesbornholt
Copy link
Member

@jamesbornholt jamesbornholt commented Apr 1, 2024

Description of change

We were using the SDK's default retry configuration (actually, slightly wrong -- it's supposed to be 3 total attempts, but we configured 3 retries, so 4 attempts). This isn't a good default for file systems, as it works out to only retrying for about 2 seconds before giving up, and applications are rarely equipped to gracefully handle transient file IO errors.

This change increases the default to 10 total attempts, which takes about a minute on average. This is in the same ballpark as NFS's defaults (3 attempts, 60 seconds linear backoff), though still a little more aggressive. There's probably scope to go even further (20?), but this is a reasonable step for now.

To allow customers to further tweak this, the S3CrtClient (and therefore Mountpoint) now respects the AWS_MAX_ATTEMPTS environment variable, and its value overrides the defaults. This is only a partial solution, as SDKs are supposed to also respect the max_attempts config file setting, but we don't have any of the infrastructure for that today (similar issue as #389).

We don't really have any way to test retries at the moment. It would be neat to have a mock S3 server we could control to test them, like the CRT does. But that's for another day.

Relevant issues: fixes #829, closes #743.

Does this change impact existing behavior?

Yes, Mountpoint now retries failing requests more, which can manifest as higher latency for file operations that previously would have failed.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).

We were using the SDK's default retry configuration (actually, slightly
wrong -- it's supposed to be 3 total attempts, but we configured 3
*retries*, so 4 attempts). This isn't a good default for file systems,
as it works out to only retrying for about 2 seconds before giving up,
and applications are rarely equipped to gracefully handle transient
errors.

This change increases the default to 10 total attempts, which takes
about a minute on average. This is in the same ballpark as NFS's
defaults (3 attempts, 60 seconds linear backoff), though still a little
more aggressive. There's probably scope to go even further (20?), but
this is a reasonable step for now.

To allow customers to further tweak this, the S3CrtClient now respects
the `AWS_MAX_ATTEMPTS` environment variable, and its value overrides the
defaults. This is only a partial solution, as SDKs are supposed to also
respect the `max_attempts` config file setting, but we don't have any of
the infrastructure for that today (similar issue as awslabs#389).

Signed-off-by: James Bornholt <bornholt@amazon.com>
Signed-off-by: James Bornholt <bornholt@amazon.com>
@jamesbornholt jamesbornholt marked this pull request as ready for review April 1, 2024 19:54
@dannycjones dannycjones added this pull request to the merge queue Apr 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024
@jamesbornholt jamesbornholt added this pull request to the merge queue Apr 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024
@jamesbornholt jamesbornholt added this pull request to the merge queue Apr 2, 2024
Merged via the queue into awslabs:main with commit a187420 Apr 2, 2024
23 checks passed
@jamesbornholt jamesbornholt deleted the retry-config branch April 2, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to configure maximum retry count
3 participants