-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default max retries and expose environment variable to override #830
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We were using the SDK's default retry configuration (actually, slightly wrong -- it's supposed to be 3 total attempts, but we configured 3 *retries*, so 4 attempts). This isn't a good default for file systems, as it works out to only retrying for about 2 seconds before giving up, and applications are rarely equipped to gracefully handle transient errors. This change increases the default to 10 total attempts, which takes about a minute on average. This is in the same ballpark as NFS's defaults (3 attempts, 60 seconds linear backoff), though still a little more aggressive. There's probably scope to go even further (20?), but this is a reasonable step for now. To allow customers to further tweak this, the S3CrtClient now respects the `AWS_MAX_ATTEMPTS` environment variable, and its value overrides the defaults. This is only a partial solution, as SDKs are supposed to also respect the `max_attempts` config file setting, but we don't have any of the infrastructure for that today (similar issue as awslabs#389). Signed-off-by: James Bornholt <bornholt@amazon.com>
jamesbornholt
had a problem deploying
to
PR integration tests
April 1, 2024 05:09
— with
GitHub Actions
Failure
jamesbornholt
had a problem deploying
to
PR integration tests
April 1, 2024 05:09
— with
GitHub Actions
Failure
jamesbornholt
had a problem deploying
to
PR integration tests
April 1, 2024 05:09
— with
GitHub Actions
Failure
jamesbornholt
had a problem deploying
to
PR integration tests
April 1, 2024 05:09
— with
GitHub Actions
Failure
jamesbornholt
had a problem deploying
to
PR integration tests
April 1, 2024 05:09
— with
GitHub Actions
Failure
jamesbornholt
had a problem deploying
to
PR integration tests
April 1, 2024 05:09
— with
GitHub Actions
Failure
jamesbornholt
had a problem deploying
to
PR integration tests
April 1, 2024 05:09
— with
GitHub Actions
Failure
Signed-off-by: James Bornholt <bornholt@amazon.com>
jamesbornholt
temporarily deployed
to
PR integration tests
April 1, 2024 05:10
— with
GitHub Actions
Inactive
jamesbornholt
temporarily deployed
to
PR integration tests
April 1, 2024 05:10
— with
GitHub Actions
Inactive
jamesbornholt
temporarily deployed
to
PR integration tests
April 1, 2024 05:10
— with
GitHub Actions
Inactive
jamesbornholt
temporarily deployed
to
PR integration tests
April 1, 2024 05:10
— with
GitHub Actions
Inactive
jamesbornholt
temporarily deployed
to
PR integration tests
April 1, 2024 05:10
— with
GitHub Actions
Inactive
jamesbornholt
temporarily deployed
to
PR integration tests
April 1, 2024 05:10
— with
GitHub Actions
Inactive
jamesbornholt
temporarily deployed
to
PR integration tests
April 1, 2024 05:10
— with
GitHub Actions
Inactive
arsh
approved these changes
Apr 2, 2024
dannycjones
approved these changes
Apr 2, 2024
github-merge-queue
bot
removed this pull request from the merge queue due to failed status checks
Apr 2, 2024
github-merge-queue
bot
removed this pull request from the merge queue due to failed status checks
Apr 2, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of change
We were using the SDK's default retry configuration (actually, slightly wrong -- it's supposed to be 3 total attempts, but we configured 3 retries, so 4 attempts). This isn't a good default for file systems, as it works out to only retrying for about 2 seconds before giving up, and applications are rarely equipped to gracefully handle transient file IO errors.
This change increases the default to 10 total attempts, which takes about a minute on average. This is in the same ballpark as NFS's defaults (3 attempts, 60 seconds linear backoff), though still a little more aggressive. There's probably scope to go even further (20?), but this is a reasonable step for now.
To allow customers to further tweak this, the S3CrtClient (and therefore Mountpoint) now respects the
AWS_MAX_ATTEMPTS
environment variable, and its value overrides the defaults. This is only a partial solution, as SDKs are supposed to also respect themax_attempts
config file setting, but we don't have any of the infrastructure for that today (similar issue as #389).We don't really have any way to test retries at the moment. It would be neat to have a mock S3 server we could control to test them, like the CRT does. But that's for another day.
Relevant issues: fixes #829, closes #743.
Does this change impact existing behavior?
Yes, Mountpoint now retries failing requests more, which can manifest as higher latency for file operations that previously would have failed.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).