KAFKA-15370: Support Participation in 2PC (KIP-939) (K/N) #19751

artemlivshits · 2025-05-17T22:49:05Z

Support keepPreparedTxn=true on transaction coordinator. The ongoing transaction needs to be preserved (and its producerId / epoch needs to be preserved as well so that we can send transaction markers), but the client-facing producerId / epoch needs to move forward, so now we have 2 id / epoch pairs that need to be maintained during the Ongoing and Prepare states.

To simplify state machine implementation, the transaction coordinator now eagerly rejects requests during the prepareCommit / prepareAbort states with CONCURRENT_TRANSACTIONS error - the client would get this error anyway and we avoid handing a complex combination of conditions that try to distinguish between retries and invalid requests.

Support keepPreparedTxn=true on transaction coordinator. The ongoing transaction needs to be preserved (and its producerId / epoch needs to be preserved as well so that we can send transaction markers), but the client-facing producerId / epoch needs to move forward, so now we have 2 id / epoch pairs that need to be maintained during the ongoing and prepare states. To simplify state machine implementation, the transaction coordinator eagerly rejects requests during the prepareCommit / prepareAbort states with CONCURRENT_TRANSACTIONS error - the client would get this error anyway and we avoid handing a complex combinaton of conditions that try to distinguish between retries and invalid requests.

artemlivshits · 2025-05-17T23:02:39Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+            // (including PrepareCommit and PrepareAbort) in already complex state machine.
+            if (txnMetadata.pendingTransitionInProgress && !txnMetadata.pendingState.contains(TransactionState.PREPARE_EPOCH_FENCE)) {
+              Left(Errors.CONCURRENT_TRANSACTIONS)
+            } else if (txnMetadata.state == TransactionState.PREPARE_COMMIT || txnMetadata.state == TransactionState.PREPARE_ABORT) {


Here we eagerly reject requests if the transaction is in PrepareCommit / PrepareAbort state. Previously the code tried to validate the state and return fenced error on mismatch, but in the end if the checks were successful the request got rejected with concurrent transactions error. The checks grew extremely complex with the new nextProduceEpoch state, so we avoid this complexity altogether by just returning concurrent transactions error for this transient state.

artemlivshits · 2025-05-17T23:07:58Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala

@@ -109,7 +125,7 @@ private[transaction] class TransactionMetadata(val transactionalId: String,
  def prepareIncrementProducerEpoch(newTxnTimeoutMs: Int,
                                    expectedProducerEpoch: Option[Short],
                                    updateTimestamp: Long): Either[Errors, TxnTransitMetadata] = {
-    if (isProducerEpochExhausted)
+    if (TransactionMetadata.isEpochExhausted(producerEpoch))


This changed because isProducerEpochExhausted used producerEpoch, but now it uses clientProducerEpoch (which sometimes could be nextProducerEpoch), so in the place we want to continue checking producerEpoch, thus this is an explicit call.

To clarify for myself:
This method would not be called from keepPrepared=True initProducerId calls, and thus we should update the ongoing transaction's producer epoch and not the client (in some cases nextProducerEpoch) epoch.

If we call this method, how does that affect nextProducerEpoch/Id? I could imagine we bump the epoch here such that it matches the nextProducerEpoch.

rreddy-22 · 2025-05-19T22:24:06Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

-      responseCallback(initTransactionError(Errors.UNSUPPORTED_VERSION))
+    } else if (keepPreparedTxn && expectedProducerIdAndEpoch.nonEmpty) {
+      // keepPreparedTxn can only be specified in the initial call of InitProducerId, so
+      // no exposed producerId / epoch should be set


nit: did we mean to say expected or exposed?

Should be "expected", thanks

rreddy-22 · 2025-05-21T19:16:54Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala

+  // When InitProducerId keeps an Ongoing transaction (which is "prepared" from
+  // the 2PC protocol perspective) it needs to preserve the producerId / epoch of
+  // the transaction (that will be used for sending markers), but update the
+  // producerId / epoch that's going ot be used by client (so that we could fence


nit: ot -> to

rreddy-22 · 2025-05-21T19:35:51Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala

@@ -174,6 +190,29 @@ private[transaction] class TransactionMetadata(val transactionalId: String,
    )
  }

+  def prepareIncrementProducerEpochForOngoing(updateTimestamp: Long): TxnTransitMetadata = {


nit: could we maybe name this to specify that we're updating the "next" producer epoch, and only in case of prepared txns? Perhaps even add a Javadoc. Something like prepareNextProducerEpochForOngoingTxn?

rreddy-22 · 2025-05-21T20:04:15Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala

@@ -250,7 +300,7 @@ private[transaction] class TransactionMetadata(val transactionalId: String,
   * Check if the epochs have been exhausted for the current producerId. We do not allow the client to use an


Can we update the javadoc to specify that this method checks if the client producer epoch is exhausted and what that entails wrt next producer epoch and producer epoch

jolshan · 2025-05-29T23:25:57Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+                // then we also need to bump this epoch -- we want to fence requests that were issued with that
+                // epoch after the transaction is complete.  Consider this example:
+                //  1. (initial state): Ongoing, producerId = 42, producerEpoch = 100, nextProducerId = -1, nextProducerEpoch = -1
+                //  2. InitProducerId(keepPreparedTxn): Ongoing, producerId = 42, producerEpoch = 100, nextProducerId = 42, nextProducerEpoch = 101


At step 2, are we returning the epoch 100 or 101 to the client? In other words, what is the epoch we send at step 3?

github-actions bot added triage PRs from the community core Kafka Broker transactions Transactions and EOS labels May 17, 2025

artemlivshits commented May 17, 2025

View reviewed changes

github-actions bot removed the triage PRs from the community label May 18, 2025

m1a2st added the ci-approved label May 18, 2025

rreddy-22 reviewed May 19, 2025

View reviewed changes

rreddy-22 reviewed May 21, 2025

View reviewed changes

jolshan reviewed May 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-15370: Support Participation in 2PC (KIP-939) (K/N) #19751

KAFKA-15370: Support Participation in 2PC (KIP-939) (K/N) #19751

artemlivshits commented May 17, 2025

Uh oh!

artemlivshits May 17, 2025

Uh oh!

artemlivshits May 17, 2025

Uh oh!

jolshan May 29, 2025

Uh oh!

rreddy-22 May 19, 2025

Uh oh!

artemlivshits May 28, 2025

Uh oh!

rreddy-22 May 21, 2025

Uh oh!

rreddy-22 May 21, 2025 •

edited

Loading

Uh oh!

rreddy-22 May 21, 2025

Uh oh!

jolshan May 29, 2025

Uh oh!

Uh oh!

		@@ -250,7 +300,7 @@ private[transaction] class TransactionMetadata(val transactionalId: String,
		* Check if the epochs have been exhausted for the current producerId. We do not allow the client to use an

KAFKA-15370: Support Participation in 2PC (KIP-939) (K/N) #19751

Are you sure you want to change the base?

KAFKA-15370: Support Participation in 2PC (KIP-939) (K/N) #19751

Conversation

artemlivshits commented May 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rreddy-22 May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rreddy-22 May 21, 2025 •

edited

Loading