PinotS3 : Connection Pool Shutdown after one hour on EKS on release-1.0.0 #11761

piby180 · 2023-10-09T02:13:13Z

We are currently facing the following exception in the controller pod on EKS. The minion job SegmentGenerationAndPush for offline tables works for one hour after controller pod start and after that it doesn't work anymore and throws the error below. Restarting the pod fixes the error again for an hour after which the error appears again.

We use IRSA roles on service account to provide S3 access to pinot pods.

The problem started happening after we upgraded the helm chart from release-0.12.1 to release-1.0.0. It seems that it is happening after upgrading awssdk to 2.20.94 in pinot release-1.0.0. It was not happening with awssdk 2.14.28 in pinot release-0.12.1.

We also tried the latest tag for pinot image but we got the same error.

It looks like it is unable to referesh the aws session token after it is expired in one hour.

Similar issue has been raised in awssdk project
aws/aws-sdk-java-v2#4386
aws/aws-sdk-java-v2#4221

This issue is currently blocking all our batch ingestion minion jobs in our offline tables forcing us to downgrade back to release-0.12.1

Unable to list files under URI: s3://bucket/path
java.io.IOException: java.lang.IllegalStateException: Connection pool shut down
        at org.apache.pinot.plugin.filesystem.S3PinotFS.visitFiles(S3PinotFS.java:556) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.pinot.plugin.filesystem.S3PinotFS.listFiles(S3PinotFS.java:490) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.pinot.plugin.minion.tasks.segmentgenerationandpush.SegmentGenerationAndPushTaskGenerator.getInputFilesFromDirectory(SegmentGenerationAndPushTaskGenerator.java:343) [pinot-all-1.1.0-SNAPSHOT-jar-with-dependencies.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.pinot.plugin.minion.tasks.segmentgenerationandpush.SegmentGenerationAndPushTaskGenerator.generateTasks(SegmentGenerationAndPushTaskGenerator.java:161) [pinot-all-1.1.0-SNAPSHOT-jar-with-dependencies.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.pinot.controller.helix.core.minion.PinotTaskManager.scheduleTask(PinotTaskManager.java:547) [pinot-all-1.1.0-SNAPSHOT-jar-with-dependencies.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.pinot.controller.helix.core.minion.PinotTaskManager.scheduleTask(PinotTaskManager.java:642) [pinot-all-1.1.0-SNAPSHOT-jar-with-dependencies.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.pinot.controller.helix.core.minion.CronJobScheduleJob.execute(CronJobScheduleJob.java:68) [pinot-all-1.1.0-SNAPSHOT-jar-with-dependencies.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [pinot-all-1.1.0-SNAPSHOT-jar-with-dependencies.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [pinot-all-1.1.0-SNAPSHOT-jar-with-dependencies.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
Caused by: java.lang.IllegalStateException: Connection pool shut down
        at org.apache.http.util.Asserts.check(Asserts.java:34) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:269) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.http.apache.internal.conn.ClientConnectionManagerFactory$DelegatingHttpClientConnectionManager.requestConnection(ClientConnectionManagerFactory.java:75) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.http.apache.internal.conn.ClientConnectionManagerFactory$InstrumentedHttpClientConnectionManager.requestConnection(ClientConnectionManagerFactory.java:57) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.http.apache.internal.impl.ApacheSdkHttpClient.execute(ApacheSdkHttpClient.java:72) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.http.apache.ApacheHttpClient.execute(ApacheHttpClient.java:254) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.http.apache.ApacheHttpClient.access$500(ApacheHttpClient.java:104) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:231) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:228) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:63) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:77) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:56) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:39) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:52) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:37) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]     
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]     
        at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.services.sts.DefaultStsClient.assumeRoleWithWebIdentity(DefaultStsClient.java:757) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.services.sts.auth.StsAssumeRoleWithWebIdentityCredentialsProvider.getUpdatedCredentials(StsAssumeRoleWithWebIdentityCredentialsProvider.java:74) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.services.sts.auth.StsCredentialsProvider.updateSessionCredentials(StsCredentialsProvider.java:88) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.utils.cache.CachedSupplier.lambda$jitteredPrefetchValueSupplier$3(CachedSupplier.java:284) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.utils.cache.CachedSupplier$PrefetchStrategy.fetch(CachedSupplier.java:420) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.utils.cache.CachedSupplier.refreshCache(CachedSupplier.java:199) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.utils.cache.CachedSupplier.get(CachedSupplier.java:128) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.services.sts.auth.StsCredentialsProvider.resolveCredentials(StsCredentialsProvider.java:101) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.services.sts.internal.StsWebIdentityCredentialsProviderFactory$StsWebIdentityCredentialsProvider.resolveCredentials(StsWebIdentityCredentialsProviderFactory.java:96) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.auth.credentials.WebIdentityTokenFileCredentialsProvider.resolveCredentials(WebIdentityTokenFileCredentialsProvider.java:121) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.auth.credentials.AwsCredentialsProviderChain.resolveCredentials(AwsCredentialsProviderChain.java:90) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.auth.credentials.internal.LazyAwsCredentialsProvider.resolveCredentials(LazyAwsCredentialsProvider.java:45) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider.resolveCredentials(DefaultCredentialsProvider.java:128) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.util.MetricUtils.measureDuration(MetricUtils.java:50) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.awscore.internal.authcontext.AwsCredentialsAuthorizationStrategy.resolveCredentials(AwsCredentialsAuthorizationStrategy.java:100) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]       
        at software.amazon.awssdk.awscore.internal.authcontext.AwsCredentialsAuthorizationStrategy.addCredentialsToExecutionAttributes(AwsCredentialsAuthorizationStrategy.java:77) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:123) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.invokeInterceptorsAndCreateExecutionContext(AwsSyncClientHandler.java:69) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:78) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at software.amazon.awssdk.services.s3.DefaultS3Client.listObjectsV2(DefaultS3Client.java:6447) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        at org.apache.pinot.plugin.filesystem.S3PinotFS.visitFiles(S3PinotFS.java:548) ~[pinot-s3-1.1.0-SNAPSHOT-shaded.jar:1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2]
        ... 8 more

Our plugin versions are

{
  "pinot-protobuf": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-kafka-2.0": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-avro": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-distribution": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-csv": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-s3": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-yammer": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-segment-uploader-default": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-batch-ingestion-standalone": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-confluent-avro": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-thrift": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-orc": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-clp-log": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-pulsar": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-azure": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-gcs": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-dropwizard": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-hdfs": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-adls": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-kinesis": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-json": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-minion-builtin-tasks": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-parquet": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2",
  "pinot-segment-writer-file-based": "1.1.0-SNAPSHOT-13c2901978c7a6a7bab46afbce2b09c13a0b05e2"
}

The text was updated successfully, but these errors were encountered:

Jackie-Jiang · 2023-10-10T04:42:00Z

Thanks for reporting the issue.
@snleee @swaminathanmanish Can you help take a look?

swaminathanmanish · 2023-10-10T14:56:12Z

Thanks for reporting the issue. @snleee @swaminathanmanish Can you help take a look?

Thanks @Jackie-Jiang . Will take a look.

piby180 · 2023-10-10T15:03:45Z

@swaminathanmanish The issue seems to be somewhat related to the comment here

WebIdentityTokenFileCredentialsProvider became AutoCloseable in SDK version 2.17.283 via aws/aws-sdk-java-v2#3440.

I am not a Java Developer but let me know if I can help in anyway to fast track this. Thanks!

davizucon · 2023-10-10T18:29:55Z

What do you think about downgrading the AWS-SDK version? (only in the environment with the problem, not in the official release of course)

swaminathanmanish · 2023-10-10T19:11:33Z

What do you think about downgrading the AWS-SDK version? (only in the environment with the problem, not in the official release of course)

Just trying to understand the scope of impact - Looks like this impacts only DefaultCredentialsProvider and not when you provide the credentials (from S3PinotFS)

try { if (StringUtils.isNotEmpty(s3Config.getAccessKey()) && StringUtils.isNotEmpty(s3Config.getSecretKey())) { AwsBasicCredentials awsBasicCredentials = AwsBasicCredentials.create(s3Config.getAccessKey(), s3Config.getSecretKey()); awsCredentialsProvider = StaticCredentialsProvider.create(awsBasicCredentials); } else { **awsCredentialsProvider = DefaultCredentialsProvider.create();** }

davizucon · 2023-10-10T19:54:50Z

Thank you for your help !
Sure, in the tests I took, the credential comes from here: awsCredentialsProvider = DefaultCredentialsProvider.create(); not sure if I set AccessKey / SecretKey avoid the error.

piby180 · 2023-10-11T06:31:34Z

What do you think about downgrading the AWS-SDK version? (only in the environment with the problem, not in the official release of course)

Just trying to understand the scope of impact - Looks like this impacts only DefaultCredentialsProvider and not when you provide the credentials (from S3PinotFS)

try { if (StringUtils.isNotEmpty(s3Config.getAccessKey()) && StringUtils.isNotEmpty(s3Config.getSecretKey())) { AwsBasicCredentials awsBasicCredentials = AwsBasicCredentials.create(s3Config.getAccessKey(), s3Config.getSecretKey()); awsCredentialsProvider = StaticCredentialsProvider.create(awsBasicCredentials); } else { **awsCredentialsProvider = DefaultCredentialsProvider.create();** }

We can't use access and secret keys on AWS as it is against our IT security policy. We have to use role based access. Also does this not impact other parts of the code where aws resources are accessed besides minion tasks (like deep store)?

swaminathanmanish · 2023-10-13T21:17:46Z

What do you think about downgrading the AWS-SDK version? (only in the environment with the problem, not in the official release of course)

Just trying to understand the scope of impact - Looks like this impacts only DefaultCredentialsProvider and not when you provide the credentials (from S3PinotFS)
try { if (StringUtils.isNotEmpty(s3Config.getAccessKey()) && StringUtils.isNotEmpty(s3Config.getSecretKey())) { AwsBasicCredentials awsBasicCredentials = AwsBasicCredentials.create(s3Config.getAccessKey(), s3Config.getSecretKey()); awsCredentialsProvider = StaticCredentialsProvider.create(awsBasicCredentials); } else { **awsCredentialsProvider = DefaultCredentialsProvider.create();** }

We can't use access and secret keys on AWS as it is against our IT security policy. We have to use role based access. Also does this not impact other parts of the code where aws resources are accessed besides minion tasks (like deep store)?

Tracking down when this update was made (from 2.14 to 2.20). Im not sure if downgrade is an option, since it was upgraded to address a memory leak - #10898
cc @klsince

piby180 · 2023-10-19T17:40:02Z

Any ETA on this? We are currently stuck for our offline tables due to this error.
I am not a Java Developer but to me, it looks like awsCredentialsProvider is now autoclosable on else block exit. Could the solution be raising the scope of the variable awsCredentialsProvider so that it does not get closed on else block exit?

klsince · 2023-10-19T18:11:53Z

not sure about the root cause.

But the awsCredentialsProvider is not closed when the else block is existed. In Java, the autoclosable would be auto closed if it's put in a try(awsCredentialsProvider = ...) {...} statement.

You said you were using role based access, then it's not the DefaultCredentialsProvider but StsAssumeRoleCredentialsProvider to be used, as in here

      if (StringUtils.isNotEmpty(s3Config.getAccessKey()) && StringUtils.isNotEmpty(s3Config.getSecretKey())) {
        AwsBasicCredentials awsBasicCredentials =
            AwsBasicCredentials.create(s3Config.getAccessKey(), s3Config.getSecretKey());
        awsCredentialsProvider = StaticCredentialsProvider.create(awsBasicCredentials);
      } else {
        awsCredentialsProvider = DefaultCredentialsProvider.create(); <----
      }
...
      // IAM Role based access
      if (s3Config.isIamRoleBasedAccess()) {
...
StsClient stsClient =
            StsClient.builder().region(Region.of(s3Config.getRegion())).
            credentialsProvider(awsCredentialsProvider).build();  <----                
awsCredentialsProvider =
            StsAssumeRoleCredentialsProvider.builder().stsClient(stsClient).refreshRequest(assumeRoleRequest)
                .asyncCredentialUpdateEnabled(s3Config.isAsyncSessionUpdateEnabled()).build();
}

However, to create StsAssumeRoleCredentialsProvider, a StsClient object is created firstly and it takes a awsCredentialsProvider of DefaultCredentialsProvider type.

The issue you found aws/aws-sdk-java-v2#4221 mentioned some potential cause like OutOfMemory errors, do you see such errors or any other suspicious exceptions in the logs?

piby180 · 2023-10-19T20:19:28Z

I didn't find any issue with memory.

I tried increasing the pod memory limits as well as modified the jvmOptions to exit on OutOfMemoryError as discussed on slack here

The error happens exactly one hour after the pod start. I suspect since the AWS session token generated by IAM role expires after one hour, Pinot is unable to refresh the AWS session token.

So anything related to memory is out of the picture here.

We are currently restarting the pods to run minion tasks. The problem goes away exactly for one hour during which our tasks run. After one hour, the problem comes back up again and we have to restart the pods again.

q-nathangrand · 2023-11-24T17:34:16Z

I came across this whilst googling around the same issue and just wanted to post my findings in case it helps you.

For me, the cause of the issue was that DefaultCredentialsProvider.create() doesn't create a new 'CredentialsProvider' each time you call it, but points to a single 'global' instance. If you use this 'CredentialsProvider' in any client that gets closed, or close the provider itself, then this single 'global' instance is closed. This goes unnoticed until an attempt is made to refresh a token after an hour.

I made a new 'CredentialsProvider' each time I created a client using DefaultCredentialsProvider.builder().build() to solve this.

steveloughran · 2023-11-25T13:38:44Z

you might want to look at org.apache.hadoop.fs.s3a.AWSCredentialProviderList as we ref count our closing; plus we don't use that default chain. We did however hit HADOOP-18945 though: IAM profile timing out in resolve() when under load. do make sure you do async credential refresh.

klsince · 2023-11-29T00:39:13Z

Thanks @q-nathangrand for the pointer. I made a quick fix here: #12063

@piby180 wonder if you could help take this fix for a spin in your env and see how it goes. Thank you!

abhijeetkushe · 2023-12-18T15:18:00Z

Any updates on when this might be fixed.We are seeing a similar error while reading from AWS kinesis in a Realtime Table

Jackie-Jiang · 2023-12-19T02:47:55Z

@abhijeetkushe Can you try out #12063 and see if it works? We are waiting for a confirmation that it works before merging it

abhijeetkushe · 2023-12-19T21:58:19Z

@Jackie-Jiang Ok will let you know

abhijeetkushe · 2024-01-04T15:14:13Z

@Jackie-Jiang I was successfully able to test the fix.Although in my case I am also using Kinsesis so I had to make the same change at this location as well https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-stream-ingestion/pinot-kinesis/src/main/java/org/apache/pinot/plugin/stream/kinesis/KinesisConnectionHandler.java#L99. Do I need to create a separate PR for this ?
Here is the docker version which has the fix
https://github.com/apache/pinot/pull/12063/files
https://hub.docker.com/layers/abhijeetkushe/pinottest/ctct_1.0.7/images/sha256-a0e9c38ecc54fc1a97fca5f2542d06cfe29c396820465ce42aa3b1d4afef97d3?context=repo

klsince · 2024-01-04T17:57:11Z

thanks for testing it out. and good catch for the other places.

As you have env to test those out, please help open a new PR include all the fixes on DefaultCredentialsProvider, including that one-line change in my previous PR. Thank you :)

abhijeetkushe · 2024-01-04T19:49:36Z

@klsince Thanks I will add that fix for all the plugins where I can find and test it out

abhijeetkushe · 2024-01-04T21:53:02Z

@klsince @Jackie-Jiang I have opened this PR #12221

klsince · 2024-01-04T23:17:31Z

Thanks for the fix and tests @abhijeetkushe closing the issue for now.

Jackie-Jiang added ingestion troubleshooting bug and removed troubleshooting labels Oct 10, 2023

klsince mentioned this issue Nov 29, 2023

create new default credential as attempt to fix the conn shutdown issue #12063

Closed

abhijeetkushe mentioned this issue Jan 4, 2024

11761 - fix for connection pool error #12221

Merged

klsince closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PinotS3 : Connection Pool Shutdown after one hour on EKS on release-1.0.0 #11761

PinotS3 : Connection Pool Shutdown after one hour on EKS on release-1.0.0 #11761

piby180 commented Oct 9, 2023 •

edited

Loading

Jackie-Jiang commented Oct 10, 2023

swaminathanmanish commented Oct 10, 2023

piby180 commented Oct 10, 2023

davizucon commented Oct 10, 2023 •

edited

Loading

swaminathanmanish commented Oct 10, 2023 •

edited

Loading

davizucon commented Oct 10, 2023

piby180 commented Oct 11, 2023

swaminathanmanish commented Oct 13, 2023

piby180 commented Oct 19, 2023

klsince commented Oct 19, 2023

piby180 commented Oct 19, 2023

q-nathangrand commented Nov 24, 2023

steveloughran commented Nov 25, 2023

klsince commented Nov 29, 2023

abhijeetkushe commented Dec 18, 2023

Jackie-Jiang commented Dec 19, 2023

abhijeetkushe commented Dec 19, 2023

abhijeetkushe commented Jan 4, 2024

klsince commented Jan 4, 2024 •

edited

Loading

abhijeetkushe commented Jan 4, 2024

abhijeetkushe commented Jan 4, 2024

klsince commented Jan 4, 2024

PinotS3 : Connection Pool Shutdown after one hour on EKS on release-1.0.0 #11761

PinotS3 : Connection Pool Shutdown after one hour on EKS on release-1.0.0 #11761

Comments

piby180 commented Oct 9, 2023 • edited Loading

Jackie-Jiang commented Oct 10, 2023

swaminathanmanish commented Oct 10, 2023

piby180 commented Oct 10, 2023

davizucon commented Oct 10, 2023 • edited Loading

swaminathanmanish commented Oct 10, 2023 • edited Loading

davizucon commented Oct 10, 2023

piby180 commented Oct 11, 2023

swaminathanmanish commented Oct 13, 2023

piby180 commented Oct 19, 2023

klsince commented Oct 19, 2023

piby180 commented Oct 19, 2023

q-nathangrand commented Nov 24, 2023

steveloughran commented Nov 25, 2023

klsince commented Nov 29, 2023

abhijeetkushe commented Dec 18, 2023

Jackie-Jiang commented Dec 19, 2023

abhijeetkushe commented Dec 19, 2023

abhijeetkushe commented Jan 4, 2024

klsince commented Jan 4, 2024 • edited Loading

abhijeetkushe commented Jan 4, 2024

abhijeetkushe commented Jan 4, 2024

klsince commented Jan 4, 2024

piby180 commented Oct 9, 2023 •

edited

Loading

davizucon commented Oct 10, 2023 •

edited

Loading

swaminathanmanish commented Oct 10, 2023 •

edited

Loading

klsince commented Jan 4, 2024 •

edited

Loading