Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 timing out on CI (tests using LocalStack) #5045

Closed
daniel-frak opened this issue Mar 26, 2024 · 11 comments
Closed

S3 timing out on CI (tests using LocalStack) #5045

daniel-frak opened this issue Mar 26, 2024 · 11 comments
Labels
bug This issue is a bug. p3 This is a minor priority issue third-party This issue is related to third-party libraries or applications.

Comments

@daniel-frak
Copy link

daniel-frak commented Mar 26, 2024

Describe the bug

AWS SDK 2.21.12+ causes our Gitlab CI runner (an AWS t3.xlarge instance) to time out on S3 operations. While the tests usually take 6-10 min, here they take up to 43 minutes before failing the build.

Importantly, everything works perfectly well on a local machine (the entire build takes about 3.5min). The issues only occur on CI. Furthermore, the issue does not occur on version 2.20.63. On 2.20.63, all S3 operations take milliseconds, as expected (even on CI).

Expected Behavior

S3 operations should not time out during CI builds.

Current Behavior

S3 operations time out during CI builds.

software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Read timed out
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:83)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
	at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
	at software.amazon.awssdk.services.s3.DefaultS3Client.createBucket(DefaultS3Client.java:1156)
	at software.amazon.awssdk.services.s3.S3Client.createBucket(S3Client.java:1708)
	at [...]
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Unable to execute HTTP request: Read timed out
	Suppressed: java.lang.NullPointerException: Cannot invoke "java.util.Set.forEach(java.util.function.Consumer)" because the return value of "org.junit.jupiter.api.extension.ExtensionContext$Store.remove(Object, java.lang.Class)" is null
		at org.mockito.junit.jupiter.MockitoExtension.afterEach(MockitoExtension.java:190)

Reproduction Steps

We are running the tests using Maven and creating a LocalStack instance using TestContainers (1.19.7). The relevant code is:

s3Client = S3Client.builder()
  .endpointOverride(LOCAL_STACK_CONTAINER.getEndpointOverride(LocalStackContainer.Service.S3))
  .credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create(
    LOCAL_STACK_CONTAINER.getAccessKey(), LOCAL_STACK_CONTAINER.getSecretKey())))
  .region(Region.of(LOCAL_STACK_CONTAINER.getRegion()))
  .build();

// ...

// It times out here (@BeforeEach):
s3Client.createBucket(request -> request.bucket(bucket));

// ...And here (@AfterEach):
s3Client.deleteBucket(request -> request.bucket(bucket));

Below is a simplified version of our .gitlab-ci.yml:

variables:
  MAVEN_CLI_OPTS: >-
    -gs ./ci_settings.xml 
    -Dmaven.repo.local=${CI_PROJECT_DIR}/.m2/repository 
    -Djavax.net.ssl.trustStore=./cacerts
    --batch-mode
    --errors

image: [...]/maven:3.9.4-amazoncorretto-21

code_quality_analysis:
  before_script:
    - >
      export DOCKER_AUTH_CONFIG=[...]
script:
    - > # Run tests
      mvn clean verify $MAVEN_CLI_OPTS
      -T 2C
# ...

I don't think our ci_settings.xml proxies are to blame, but just in case here's a simplified version of those:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">
  
  <proxies>
    <proxy>
      <id>[...]-http</id>
      <active>true</active>
      <protocol>http</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
    <proxy>
      <id>[...]-https</id>
      <active>true</active>
      <protocol>https</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
  </proxies>
</settings>

Possible Solution

it seems the issue was introduced in AWS SDK somewhere between 2.20.63 and 2.21.12. I would assume some code introduced between those version is the reason for this issue.

Additional Information/Context

I noticed the bug when upgrading spring-cloud-aws-dependencies from 3.0.1 to 3.0.3+.
spring-cloud-aws-dependencies:3.0.1 depends on AWS SDK 2.20.63, while spring-cloud-aws-dependencies:3.0.3 depends on 2.21.12.

I noticed the bug appears on spring-cloud-aws-dependencies:3.0.3, but disappears when I manually downgrade AWS SDK to 2.21.12. I tried upgrading AWS SDK to 2.25.17, but the problem persists. Therefore, it seems the issue was introduced in AWS SDK somewhere between 2.20.63 and 2.21.12.

The runner instance shows only mild memory and CPU consumption, so it doesn't seem like lack of resources is the problem here.

AWS Java SDK version used

2.21.12

JDK version used

maven:3.9.4-amazoncorretto-21 (docker image)

Operating System and version

Alpine Linux 3.16

@daniel-frak daniel-frak added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 26, 2024
@daniel-frak daniel-frak changed the title S3 timing out on CI (tests using LocalStack)(short issue description) S3 timing out on CI (tests using LocalStack) Mar 26, 2024
@daniel-frak
Copy link
Author

After some more research, I determined that the issue starts with version 2.21.0.

On version 2.20.162, everything works correctly, while as soon as I update to 2.21.0, the CI build (and only it) starts timing out.

Could the internal refactor of authentication be to blame here?

@daniel-frak
Copy link
Author

For even more information, it seems that the S3 requests are not reaching LocalStack:

localstack/localstack#10565 (comment)

@debora-ito
Copy link
Member

Yes, 2.21.0 included a major authentication refactor. Localstack, being a third-party solution, is not guaranteed to be supported by changes made in the AWS SDK.

You did the recommended action and reached out to Localstack team. Let's keep track of their investigation - localstack/localstack#10565

@debora-ito debora-ito added third-party This issue is related to third-party libraries or applications. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed needs-triage This issue or PR still needs to be triaged. labels Apr 3, 2024
Copy link

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

@github-actions github-actions bot added the closing-soon This issue will close in 4 days unless further comments are made. label Apr 13, 2024
@daniel-frak
Copy link
Author

Unfortunately, the issue is still a problem. I'm planning to do some more digging soon and will hopefully be able to come up with some additional information.

@github-actions github-actions bot removed closing-soon This issue will close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Apr 16, 2024
@debora-ito debora-ito added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. p3 This is a minor priority issue labels Apr 17, 2024
Copy link

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

@github-actions github-actions bot added the closing-soon This issue will close in 4 days unless further comments are made. label Apr 26, 2024
@daniel-frak
Copy link
Author

This is still a problem and the investigation is ongoing.

@github-actions github-actions bot removed closing-soon This issue will close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Apr 26, 2024
@debora-ito debora-ito added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Apr 29, 2024
Copy link

github-actions bot commented May 6, 2024

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

@github-actions github-actions bot added the closing-soon This issue will close in 4 days unless further comments are made. label May 6, 2024
@daniel-frak
Copy link
Author

Investigation still ongoing.

@github-actions github-actions bot removed closing-soon This issue will close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels May 6, 2024
@daniel-frak
Copy link
Author

The issue turned out to be AWS SDK proxies not supporting IP ranges:
#5399

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p3 This is a minor priority issue third-party This issue is related to third-party libraries or applications.
Projects
None yet
Development

No branches or pull requests

2 participants