Fixed the problem of using the same executor to process schema change requests and flush event requests, resulting in blocking timeout. #3858

linjianchang · 2025-01-14T10:17:37Z

When source generate metadata events in parallel for each table,There is a problem of timeout waiting for flush to complete. Exception be like:

yuxiqian

Thanks for @linjianchang's quick fix! Just left some trivial comments.

yuxiqian · 2025-01-14T11:16:28Z

...ntime/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaRegistry.java

@@ -96,6 +97,9 @@ public abstract class SchemaRegistry implements OperatorCoordinator, Coordinatio
    protected transient SchemaManager schemaManager;
    protected transient TableIdRouter router;

+    /** Executor service to execute handle event from operator. */
+    private final ExecutorService runInEventFromOperatorExecutor;


Keep its name consistent with regular schema coordinator

already modified

yuxiqian · 2025-01-14T11:19:21Z

...ntime/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaRegistry.java

@@ -253,6 +260,7 @@ public final CompletableFuture<CoordinationResponse> handleCoordinationRequest(
    public final void handleEventFromOperator(
            int subTaskId, int attemptNumber, OperatorEvent event) {
        runInEventLoop(
+                runInEventFromOperatorExecutor,


This seems incorrect. handleEventFromOperator should be submitted to the same single threaded executor like other methods (so they won't be scheduled simultaneously with other critical methods). Only the SchemaCoordinator#startSchemaChange needs to be wrapped in another executor.

already modified

yuxiqian · 2025-01-14T11:19:58Z

...ntime/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaRegistry.java

IIUC this problem will only occur if:

Distributed schema evolution topology is created

Flush success event takes a while to finish

The last schema operator initiates request before any flush succeeds

Thus, The handler of FlushSuccessEvent will wait for schema evolution to finish, but the busy-loop is still waiting for collecting all flush success events.

Could you please add a test case to verify this change?

The test case org.apache.flink.cdc.runtime.operators.schema.distributed.SchemaEvolveTest#testLenientSchemaEvolution() has been covered this case.

I'm afraid it's not sufficient because runtime unit tests uses ValuesDataSink only. It takes no time to flush and might not be able to verify this.

yuxiqian · 2025-01-14T11:21:25Z

...ntime/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaRegistry.java

@@ -325,6 +337,7 @@ public final void resetToCheckpoint(long checkpointId, @Nullable byte[] checkpoi
     * directly, make sure you're running heavy logics inside, or the entire job might hang!
     */
    protected void runInEventLoop(
+            final ExecutorService coordinatorExecutor,


It's nice to allow specifying ExecutorService when
calling runInEventLoop. Maybe regular/SchemaCoordinator could also invoke this instead of this:

schemaChangeThreadPool.submit( () -> { try { applySchemaChange(originalEvent, deducedSchemaChangeEvents); } catch (Throwable t) { failJob( "Schema change applying task", new FlinkRuntimeException( "Failed to apply schema change event.", t)); throw t; } });

…or to process schema change requests and flush event requests, resulting in blocking timeout.

linjianchang · 2025-01-17T02:54:44Z

Already modified according to comment @yuxiqian

yuxiqian · 2025-01-21T01:44:04Z

Seems there's something wrong with internal state switching logic. Would @linjianchang like to take a further look?

github-actions bot added the runtime label Jan 14, 2025

yuxiqian suggested changes Jan 14, 2025

View reviewed changes

[FLINK-37110][cdc-runtime] Fixed the problem of using the same execut…

0eaf9ef

…or to process schema change requests and flush event requests, resulting in blocking timeout.

linjianchang force-pushed the master-37110 branch from 032e560 to 0eaf9ef Compare January 17, 2025 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the problem of using the same executor to process schema change requests and flush event requests, resulting in blocking timeout. #3858

Fixed the problem of using the same executor to process schema change requests and flush event requests, resulting in blocking timeout. #3858

linjianchang commented Jan 14, 2025

yuxiqian left a comment

yuxiqian Jan 14, 2025

linjianchang Jan 17, 2025

yuxiqian Jan 14, 2025

linjianchang Jan 17, 2025

yuxiqian Jan 14, 2025

linjianchang Jan 17, 2025

yuxiqian Jan 17, 2025

yuxiqian Jan 14, 2025

linjianchang commented Jan 17, 2025

yuxiqian commented Jan 21, 2025

Fixed the problem of using the same executor to process schema change requests and flush event requests, resulting in blocking timeout. #3858

Are you sure you want to change the base?

Fixed the problem of using the same executor to process schema change requests and flush event requests, resulting in blocking timeout. #3858

Conversation

linjianchang commented Jan 14, 2025

yuxiqian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linjianchang commented Jan 17, 2025

yuxiqian commented Jan 21, 2025