Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][txn] fix concurrent error cause txn stuck in TransactionBufferHandlerImpl#endTxn #23551

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TakaHiR07
Copy link
Contributor

@TakaHiR07 TakaHiR07 commented Nov 4, 2024

Fixes #23550

Motivation

After diving into the code, finding that there is a concurrent error in TransactionBufferHandlerImpl#checkRequestCredits(), checkPendingRequests(), which would cause the above issue.

Currently, we have config TransactionBufferClientMaxConcurrentRequests to control the concurrent request number. However, if the request and response is executed as follow, the request would permanently stuck in queue.
(to simplify the case, let's set permit is 1)

step request-1 request-2 response-1 request-3
1 start do checkRequestCredits()
2 compareAndSet requestCredits to 0
3 execute endTxn
4 start do checkRequestCredits()
5 get currentPermit = 0
6 trigger onResponse(), set requestCredits to 1
7 trigger checkPendingRequests(), permit == 1 && pendingRequests is null, so break the while process
8 currentPermits == 0 && pendingRequest is null, then add op to pendingRequest
9 start do checkRequestCredits()
10 currentPermit == 1 && pendingRequests is not null , also add op to pendingRequest

Now we can find there is no response can trigger pendingRequest.remove, and then all the new requests just add to pendingRequest but permanently not execute.

Modifications

The root reason is currently only onResponse() can trigger pendingRequest.remove. But when we execute onResponse(), the requestOp may not have been added to pendingRequest.

  • So one modification is to let it can check the pendingRequest queue in checkRequestCredits()
  • And the while(true) in checkPendingRequests() is not necessary, 1 response come back, take 1 requestOp from pendingRequest is OK.

It is hard to add test for this concurrent case.

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 4, 2024
@TakaHiR07
Copy link
Contributor Author

TakaHiR07 commented Nov 4, 2024

@codelipenghui @congbobo184 Can you help review this pr?

}
} else {
break;
checkPendingRequests();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering if deeply nested recursive calls could become a problem. is there a specific reason to replace the while loop with recursion?

Copy link
Contributor Author

@TakaHiR07 TakaHiR07 Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the while loop looks somewhat confusing, it look like 1 response can trigger taking multiple requests from pendingRequest. But actually it is executed when its permit > 0 . And permit would +1 when 1 response return.

And let's assume a scene, 100 responses come back at the same time, both of them would go into while loop and compete for acquire the permits. Now the permits is 100 and pendingRequestQueue is also 100, then all of the responses would compete for taking all requests from queue. That is not neccessary.

So I replace to 1 response trigger take 1 request from pendingRequestQueue. And only two case we should do recursion to retry:

  • If the request taking is null, we should retry checkPendingRequest() to take the next one.
  • If permit.compareAndSet is not successful, it means there is concurrent checkRequestCredits() or checkPendingRequest(), we should retry checkPendingRequest() again to make sure response can trigger taking 1 request from pendingRequestQueue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursive calls actually may be problem. I have updated the code, the new one is better.

@TakaHiR07 TakaHiR07 force-pushed the fix_concurrent_error_in_TransactionBufferHandlerImpl branch from b679478 to e5428b9 Compare November 6, 2024 09:38
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@congbobo184 congbobo184 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug][txn] txn committing stuck and never finish commit process
4 participants