-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve transaction relay logic #4985
base: develop
Are you sure you want to change the base?
Conversation
* Allows transactions, validator lists, proposals, and validations to be relayed more often, but only when triggered by another event, such as receiving it from a peer * Decrease from 5min. * Expected to help transaction throughput on poorly connected networks.
* Hold if the transaction gets a ter, tel, or tef result. * Use the new SF_HELD flag to ultimately prevent the transaction from being held and retried too many times.
* Ensures that successful transactions are broadcast to peers, appropriate failed transactions are held for later attempts, fee changes are sent to subscribers, etc.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #4985 +/- ##
=========================================
- Coverage 77.9% 77.9% -0.1%
=========================================
Files 782 782
Lines 66621 66701 +80
Branches 8161 8169 +8
=========================================
+ Hits 51902 51930 +28
- Misses 14719 14771 +52
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I really appreciate the way the commits were divided up. It made the code review much easier.
I mostly left complementary comments. But there's one bool
that I suspect needs to be changed to a std::atomic<bool>
. See what you think...
(!itrNext->second->getSeqProxy().isSeq() || | ||
itrNext->second->getSeqProxy().value() == seqProxy.value() + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This takes full advantage of the "unusual" sort order of SeqProxy
, that all sequence numbers sort in front of all tickets.
src/ripple/app/misc/NetworkOPs.cpp
Outdated
auto const txNext = m_ledgerMaster.popAcctTransaction(txCur); | ||
if (txNext) | ||
auto txNext = m_ledgerMaster.popAcctTransaction(txCur); | ||
while (txNext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially worried that this while
loop might submit_held()
a boatload of transactions. But the TxQ
defaults maximumTxnPerAccount
to 10. So the largest number of times this loop could run (ordinarily) would be 10. That seems reasonable, and a good way to clear out an account that has a lot of transactions queued.
This new characteristic may be worth pointing out to the performance folks, in case they want to stress it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, this loop doesn't do anything with the TxQ
. This is reading the list of held transactions in LedgerMaster
. So, AFAIK, there is no limit. However, the only way for a tx to get into that list is in this same function (see calls to addHeldTransaction
), and only transactions received directly from RPC or from peers get to this function. That said, a malicious client or peer could probably get a bunch of transactions held (this is true without these changes). Thus, it might make sense to add a limit to mitigate some kinds of attacks.
Also, submit_held
is a std::vector
. Anything added to it will be processed in the next batch of transactions.
// VFALCO NOTE The hash for an open ledger is undefined so we use | ||
// something that is a reasonable substitute. | ||
CanonicalTXSet set(app_.openLedger().current()->info().parentHash); | ||
std::swap(mHeldTransactions, set); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great way to minimize processing while the lock
is held. Good spotting!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
src/ripple/app/misc/NetworkOPs.cpp
Outdated
bool bLocal, | ||
FailHard failType) | ||
bool | ||
NetworkOPsImp::preProcessTransaction(std::shared_ptr<Transaction>& transaction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting the validation and canonicalization part of processTransaction
into this other method was a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
src/ripple/app/misc/NetworkOPs.cpp
Outdated
@@ -1276,6 +1300,17 @@ NetworkOPsImp::doTransactionSync( | |||
transaction->setApplying(); | |||
} | |||
|
|||
doTransactionSyncBatch( | |||
lock, [&transaction](std::unique_lock<std::mutex>& lock) { | |||
return transaction->getApplying(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this call to Transaction::getApplying()
. The bool
being read is neither protected by a lock nor atomic
. I think we need to change the Transaction::mApplying
member variable into a std::atomic<bool>
, since the bool
is being accessed across threads.
This problem was present before your change. But please fix it while we're thinking about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was done intentionally because it avoids the overhead of taking a lock, and the consequences of a race condition are nearly-zero.
- Most significantly, almost all accesses of the
*Applying
functions are underNetworkOPsImp
's lock. The exceptions are- The new
processTransactionSet
function. - The
popAcctTransaction
loop we were just talking about.
- The new
- If there is a race condition, and
getApplying
returns false when it should be true, the transaction will be processed again. Not that big a deal if it's a rare one-off. Most of the time, it'll gettefALREADY
ortefPAST_SEQ
. - On the flip side, if it returns true, when it should be false, then the transaction must have been attempted recently, so no big deal if it doesn't immediately get tried right away.
- If there's a race between
setApplying
andclearApplying
, and the flag ends up set, then a batch is about to try to process the transaction and will callclearApplying
later. If it ends up cleared, then it might get attempted again later as is the case with item 2.
Alllll that said, I don't think it would hurt to rewrite those two lock exceptions I pointed out to ensure that the lock is taken for the function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this optimistic locking approach doesn't create any harmful side effects, as you suggest @ximinez , it makes sense to keep the current approach. But let's put your explanation in a comment in the code to preserve it.
src/ripple/app/misc/NetworkOPs.cpp
Outdated
@@ -1224,12 +1232,28 @@ NetworkOPsImp::processTransaction( | |||
transaction->setStatus(INVALID); | |||
transaction->setResult(temBAD_SIGNATURE); | |||
app_.getHashRouter().setFlags(transaction->getID(), SF_BAD); | |||
return; | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, these 5 lines are not hit by the unit tests. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be impossible to get to this line. Note the block above explains that I don't think the check is necessary, and has an assert
on the same validity
at the end.
// NOTE eahennis - I think this check is redundant,
// but I'm not 100% sure yet.
// If so, only cost is looking up HashRouter flags.
auto const view = m_ledgerMaster.getCurrentLedger();
auto const [validity, reason] = checkValidity(
app_.getHashRouter(),
*transaction->getSTransaction(),
view->rules(),
app_.config());
assert(validity == Validity::Valid);
src/ripple/app/misc/NetworkOPs.cpp
Outdated
JLOG(m_journal.warn()) << transaction->getID() << ": cached bad!\n"; | ||
transaction->setStatus(INVALID); | ||
transaction->setResult(temBAD_SIGNATURE); | ||
return; | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. According to my local code coverage these four lines are never hit. I know there are a few places in the unit tests that produce corrupted signatures. They must be handled elsewhere. Just noticing, no need to address this in this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this one isn't documented as explicitly, but the idea is that it should also be impossible to get to this point with a bad signature.
src/ripple/app/misc/NetworkOPs.cpp
Outdated
mTransactions.swap(transactions); | ||
else | ||
{ | ||
for (auto& t : transactions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to reserve space in mTransactions
? Consider
mTransactions.reserve(mTransactions.size() + transactions.size());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does!
* upstream/develop: fix: Remove redundant STAmount conversion in test (4996) fix: resolve database deadlock: (4989) test: verify the rounding behavior of equal-asset AMM deposits (4982) test: Add tests to raise coverage of AMM (4971) chore: Improve codecov coverage reporting (4977) test: Unit test for AMM offer overflow (4986) fix amendment to add `PreviousTxnID`/`PreviousTxnLgrSequence` (4751)
* upstream/develop: Set version to 2.2.0-b3
* upstream/develop: Ignore more commits Address compiler warnings Add markers around source lists Fix source lists Rewrite includes Format formerly .hpp files Rename .hpp to .h Simplify protobuf generation Consolidate external libraries Remove packaging scripts Remove unused files
Process held transactions through existing NetworkOPs batching: * Ensures that successful transactions are broadcast to peers, appropriate failed transactions are held for later attempts, fee changes are sent to subscribers, etc. Pop all transactions with sequential sequences, or tickets Give a transaction more chances to be retried: * Hold if the transaction gets a ter, tel, or tef result. * Use the new SF_HELD flag to ultimately prevent the transaction from being held and retried too many times. Decrease `shouldRelay` limit to 30s: * Allows transactions, validator lists, proposals, and validations to be relayed more often, but only when triggered by another event, such as receiving it from a peer * Decrease from 5min. * Expected to help transaction throughput on poorly connected networks.
* upstream/develop: Set version to 2.2.0-rc1
* upstream/develop: Remove flow assert: (5009) Update list of maintainers: (4984)
* upstream/develop: Add external directory to Conan recipe's exports (5006) Add missing includes (5011)
* commit 'c706926': (23 commits) Change order of checks in amm_info: (4924) Add the fixEnforceNFTokenTrustline amendment: (4946) Replaces the usage of boost::string_view with std::string_view (4509) docs: explain how to find a clang-format patch generated by CI (4521) XLS-52d: NFTokenMintOffer (4845) chore: remove repeat words (5041) Expose all amendments known by libxrpl (5026) fixReducedOffersV2: prevent offers from blocking order books: (5032) Additional unit tests for testing deletion of trust lines (4886) Fix conan typo: (5044) Add new command line option to make replaying transactions easier: (5027) Fix compatibility with Conan 2.x: (5001) Set version to 2.2.0 Set version to 2.2.0-rc3 Add xrpl.libpp as an exported lib in conan (5022) Fix Oracle's token pair deterministic order: (5021) Set version to 2.2.0-rc2 Fix last Liquidity Provider withdrawal: Fix offer crossing via single path AMM with transfer fee: Fix adjustAmountsByLPTokens(): ...
* commit 'f6879da': Add bin/physical.sh (4997) Prepare to rearrange sources: (4997)
* upstream/develop: fixInnerObjTemplate2 amendment (5047) Set version to 2.3.0-b1 Ignore restructuring commits (4997) Recompute loops (4997) Rewrite includes (4997) Rearrange sources (4997) Move CMake directory (4997)
* upstream/develop: fix CTID in tx command returns invalidParams on lowercase hex (5049) Invariant: prevent a deleted account from leaving (most) artifacts on the ledger. (4663) Bump codecov plugin version to version 4.5.0 (5055) fix "account_nfts" with unassociated marker returning issue (5045)
* Limit number of held transactions popped for retry after a successful transaction. * Check and set the Transaction::mApplying flag under NetworkOPsImp's lock. * Reserve space when appending to transaction vector(s)
* upstream/develop: chore: remove repeat words (5053)
@ximinez I'm reading the changes in this PR and trying to think if our release payment load testing case would come across the processing path modified by this PR. Based on the highlevel overview:
|
* upstream/develop: Add xrpld build option and Conan package test (5052)
* upstream/develop: Update BUILD.md after PR 5052 (5067)
* upstream/develop: chore: Add comments to SignerEntries.h (5059) chore: Rename two files from Directory* to Dir*: (5058)
* upstream/develop: Ensure levelization sorting is ASCII-order across platforms (5072) fix: Fix NuDB build error via Conan patch (5061) Disallow filtering account_objects by unsupported types (5056)
* upstream/develop: Update gcovr EXCLUDE (5084) Fix crash inside `OverlayImpl` loops over `ids_` (5071) Set version to 2.3.0-b2 docs: Document the process for merging pull requests (5010) Remove unused constants from resource/Fees.h (4856) fix: change error for invalid `feature` param in `feature` RPC (5063) Set version to 2.2.1 Use error codes throughout fast Base58 implementation Improve error handling in some RPC commands
* upstream/develop: Factor out Transactor::trapTransaction (5087) Remove shards (5066)
* upstream/develop: Address rare corruption of NFTokenPage linked list (4945)
* upstream/develop: chore: Fix documentation generation job: (5091) chore: libxrpl verification on CI (5028)
* upstream/develop: docs: Update options documentation (5083) refactor: Remove dead headers (5081) refactor: Remove reporting mode (5092)
* upstream/develop: Update Release Notes for 2.2.1 and 2.2.2 Set version to 2.2.2 Allow only 1 job queue slot for each validation ledger check Allow only 1 job queue slot for acquiring inbound ledger. Track latencies of certain code blocks, and log if they take too long
* upstream/develop: test: Retry RPC commands to try to fix MacOS CI jobs (5120)
* upstream/develop: Set version to 2.3.0-b4 feat(SQLite): allow configurable database pragma values (5135) refactor: re-order PRAGMA statements (5140) fix(book_changes): add "validated" field and reduce RPC latency (5096) chore: fix typos in comments (5094) Set version to 2.2.3 Update SQLite3 max_page_count to match current defaults (5114)
* upstream/develop: Expand Error Message for rpcInternal (4959) docs: clean up API-CHANGELOG.md (5064)
* upstream/develop: Consolidate definitions of fields, objects, transactions, and features (5122) Ignore reformat when blaming Reformat code with clang-format-18 Update pre-commit hook Update clang-format settings Update clang-format workflow
* upstream/develop: docs: Add protobuf dependencies to linux setup instructions (5156) fix: reject invalid markers in account_objects RPC calls (5046) Update RELEASENOTES.md (5154) Introduce MPT support (XLS-33d): (5143)
* upstream/develop: Add hubs.xrpkuwait.com to bootstrap (5169)
* upstream/develop: Add AMMClawback Transaction (XLS-0073d) (5142)
* upstream/develop: Fix unity build (5179)
* upstream/develop: Set version to 2.3.0-rc1 Replace Uint192 with Hash192 in server_definitions response (5177) Fix potential deadlock (5124) Introduce Credentials support (XLS-70d): (5103) Fix token comparison in Payment (5172) Add fixAMMv1_2 amendment (5176)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change was thoroughly reviewed by former colleagues. I added a couple of questions. Once all the questions are answered, I will be happy to approve it.
src/xrpld/app/misc/HashRouter.h
Outdated
@@ -221,6 +235,7 @@ class HashRouter | |||
suppressionMap_; | |||
|
|||
std::chrono::seconds const holdTime_; | |||
std::chrono::seconds const relayTime_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to have RelayTime
defined in the config. The 30s is not a formally proven optimal value. Once used in the liveness environment, we might want to adjust it.
The same can be said about the HoldTime
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HoldTime
has been 300s for over 11 years, so it's a pretty safe bet that's a good value. However, I see your point about being able to tweak RelayTime
.
I added a commit to add configuration options for these two values. I'm leaving them undocumented for now, because I don't want operators just randomly changing them for no particular reason.
src/xrpld/app/misc/NetworkOPs.cpp
Outdated
@@ -1271,6 +1296,17 @@ NetworkOPsImp::doTransactionSync( | |||
transaction->setApplying(); | |||
} | |||
|
|||
doTransactionSyncBatch( | |||
lock, [&transaction](std::unique_lock<std::mutex>& lock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]: It would drop lock
parameter name in lambda. This will clarify the intent of the lock being required by the function signature, and we didn't forget to acquire the lock inside the lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea. I also changed the function signature to require the lock to be const
in the callback.
src/ripple/app/misc/NetworkOPs.cpp
Outdated
@@ -1276,6 +1300,17 @@ NetworkOPsImp::doTransactionSync( | |||
transaction->setApplying(); | |||
} | |||
|
|||
doTransactionSyncBatch( | |||
lock, [&transaction](std::unique_lock<std::mutex>& lock) { | |||
return transaction->getApplying(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this optimistic locking approach doesn't create any harmful side effects, as you suggest @ximinez , it makes sense to keep the current approach. But let's put your explanation in a comment in the code to preserve it.
|
||
std::size_t count = 0; | ||
for (auto txNext = m_ledgerMaster.popAcctTransaction(txCur); | ||
txNext && count < maxPoppedTransactions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only pop a maximum of ten transactions, does this mean that over time, we will have a slow memory leak here of transactions that will never be used but will stay in the CanonicalTXSet indefinitely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the mHeldTransactions
list that this function pulls from is also fully cleared out and retried all at once in LedgerMaster::applyHeldTransactions()
. applyHeldTransactions
is called during consensus when a new closed ledger is created from the open ledger to start the process.
There are restrictions on how many times a transaction can be put into mHeldTransactions
so that it can't cycle in and out indefinitely. Those are described at one of the call sites of addHeldTransaction()
* upstream/develop: fix: include `index` in `server_definitions` RPC (5190) Fix ledger_entry crash on invalid credentials request (5189)
* upstream/develop: Set version to 2.3.0-rc2
High Level Overview of Change
This PR, if merged, will improve transaction relay logic around a few edge cases.
(I'll write a single commit message later, but before this PR is squashed and merged.)
Context of Change
A few months ago, while examining some of the issues around the 2.0.0 release, and auditing transaction relay code, I identified a few areas with potential for improvement.
Type of Change
Before / After
This PR is divided into four mostly independent changes.
shouldRelay
limit to 30s." Pretty self-explanatory. Currently, the limit is 5 minutes, by which point theHashRouter
entry could have expired, making this transaction look brand new (and thus causing it to be relayed back to peers which have sent it to us recently).LedgerMaster
's held transactions if the transaction gets ater
,tel
, ortef
result. Old behavior was justter
.NetworkOPs::apply
which, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for a really long time (5 minutes in the current implementation, or 30 seconds with this first commit). If the node is a bottleneck (either due to network configuration, or because the transaction was submitted locally), the transaction may not be seen by any other nodes or validators before it expires or causes other problems.