-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PinotS3 : Connection Pool Shutdown after one hour on EKS on release-1.0.0 #11761
Comments
Thanks for reporting the issue. |
Thanks @Jackie-Jiang . Will take a look. |
@swaminathanmanish The issue seems to be somewhat related to the comment here
I am not a Java Developer but let me know if I can help in anyway to fast track this. Thanks! |
What do you think about downgrading the AWS-SDK version? (only in the environment with the problem, not in the official release of course) |
Just trying to understand the scope of impact - Looks like this impacts only DefaultCredentialsProvider and not when you provide the credentials (from S3PinotFS)
|
Thank you for your help ! |
We can't use access and secret keys on AWS as it is against our IT security policy. We have to use role based access. Also does this not impact other parts of the code where aws resources are accessed besides minion tasks (like deep store)? |
Tracking down when this update was made (from 2.14 to 2.20). Im not sure if downgrade is an option, since it was upgraded to address a memory leak - #10898 |
Any ETA on this? We are currently stuck for our offline tables due to this error. |
not sure about the root cause. But the awsCredentialsProvider is not closed when the else block is existed. In Java, the autoclosable would be auto closed if it's put in a You said you were using role based access, then it's not the
However, to create StsAssumeRoleCredentialsProvider, a StsClient object is created firstly and it takes a The issue you found aws/aws-sdk-java-v2#4221 mentioned some potential cause like |
I didn't find any issue with memory. I tried increasing the pod memory limits as well as modified the jvmOptions to exit on OutOfMemoryError as discussed on slack here The error happens exactly one hour after the pod start. I suspect since the AWS session token generated by IAM role expires after one hour, Pinot is unable to refresh the AWS session token. So anything related to memory is out of the picture here. We are currently restarting the pods to run minion tasks. The problem goes away exactly for one hour during which our tasks run. After one hour, the problem comes back up again and we have to restart the pods again. |
I came across this whilst googling around the same issue and just wanted to post my findings in case it helps you. For me, the cause of the issue was that I made a new 'CredentialsProvider' each time I created a client using |
you might want to look at org.apache.hadoop.fs.s3a.AWSCredentialProviderList as we ref count our closing; plus we don't use that default chain. We did however hit HADOOP-18945 though: IAM profile timing out in resolve() when under load. do make sure you do async credential refresh. |
Thanks @q-nathangrand for the pointer. I made a quick fix here: #12063 @piby180 wonder if you could help take this fix for a spin in your env and see how it goes. Thank you! |
Any updates on when this might be fixed.We are seeing a similar error while reading from AWS kinesis in a Realtime Table |
@abhijeetkushe Can you try out #12063 and see if it works? We are waiting for a confirmation that it works before merging it |
@Jackie-Jiang Ok will let you know |
@Jackie-Jiang I was successfully able to test the fix.Although in my case I am also using Kinsesis so I had to make the same change at this location as well https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-stream-ingestion/pinot-kinesis/src/main/java/org/apache/pinot/plugin/stream/kinesis/KinesisConnectionHandler.java#L99. Do I need to create a separate PR for this ? |
thanks for testing it out. and good catch for the other places. As you have env to test those out, please help open a new PR include all the fixes on DefaultCredentialsProvider, including that one-line change in my previous PR. Thank you :) |
@klsince Thanks I will add that fix for all the plugins where I can find and test it out |
@klsince @Jackie-Jiang I have opened this PR #12221 |
Thanks for the fix and tests @abhijeetkushe closing the issue for now. |
We are currently facing the following exception in the controller pod on EKS. The minion job SegmentGenerationAndPush for offline tables works for one hour after controller pod start and after that it doesn't work anymore and throws the error below. Restarting the pod fixes the error again for an hour after which the error appears again.
We use IRSA roles on service account to provide S3 access to pinot pods.
The problem started happening after we upgraded the helm chart from release-0.12.1 to release-1.0.0. It seems that it is happening after upgrading awssdk to 2.20.94 in pinot release-1.0.0. It was not happening with awssdk 2.14.28 in pinot release-0.12.1.
We also tried the latest tag for pinot image but we got the same error.
It looks like it is unable to referesh the aws session token after it is expired in one hour.
Similar issue has been raised in awssdk project
aws/aws-sdk-java-v2#4386
aws/aws-sdk-java-v2#4221
This issue is currently blocking all our batch ingestion minion jobs in our offline tables forcing us to downgrade back to release-0.12.1
Our plugin versions are
The text was updated successfully, but these errors were encountered: