-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][txn] fix concurrent error cause txn stuck in TransactionBufferHandlerImpl#endTxn #23551
base: master
Are you sure you want to change the base?
[fix][txn] fix concurrent error cause txn stuck in TransactionBufferHandlerImpl#endTxn #23551
Conversation
@codelipenghui @congbobo184 Can you help review this pr? |
} | ||
} else { | ||
break; | ||
checkPendingRequests(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering if deeply nested recursive calls could become a problem. is there a specific reason to replace the while loop with recursion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the while loop looks somewhat confusing, it look like 1 response can trigger taking multiple requests from pendingRequest. But actually it is executed when its permit > 0 . And permit would +1 when 1 response return.
And let's assume a scene, 100 responses come back at the same time, both of them would go into while loop and compete for acquire the permits. Now the permits is 100 and pendingRequestQueue is also 100, then all of the responses would compete for taking all requests from queue. That is not neccessary.
So I replace to 1 response trigger take 1 request from pendingRequestQueue. And only two case we should do recursion to retry:
- If the request taking is null, we should retry checkPendingRequest() to take the next one.
- If permit.compareAndSet is not successful, it means there is concurrent checkRequestCredits() or checkPendingRequest(), we should retry checkPendingRequest() again to make sure response can trigger taking 1 request from pendingRequestQueue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recursive calls actually may be problem. I have updated the code, the new one is better.
b679478
to
e5428b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #23550
Motivation
After diving into the code, finding that there is a concurrent error in TransactionBufferHandlerImpl#checkRequestCredits(), checkPendingRequests(), which would cause the above issue.
Currently, we have config TransactionBufferClientMaxConcurrentRequests to control the concurrent request number. However, if the request and response is executed as follow, the request would permanently stuck in queue.
(to simplify the case, let's set permit is 1)
Now we can find there is no response can trigger pendingRequest.remove, and then all the new requests just add to pendingRequest but permanently not execute.
Modifications
The root reason is currently only onResponse() can trigger pendingRequest.remove. But when we execute onResponse(), the requestOp may not have been added to pendingRequest.
It is hard to add test for this concurrent case.
Verifying this change
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: