-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54165][CONNECT]Add BatchExecutePlan RPC and reattach support #52852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
### What changes were proposed in this pull request? This PR introduces a new BatchExecutePlan RPC for Spark Connect that allows clients to submit multiple execution plans in a single batch operation, minimizing RPC overhead. It also adds reattach support for both Scala and Python clients to consume results from batch-submitted operations. Key changes: 1. New BatchExecutePlan RPC with unary request/response model 2. Support for client-provided operation IDs 3. Rollback mechanism for submission failures 4. Reattach methods in Scala and Python clients 5. Operations marked as reattachable for later result consumption ### Why are the changes needed? - **Performance**: Reduces RPC overhead when submitting multiple operations - **Flexibility**: Allows fire-and-forget execution pattern with later result retrieval - **Control**: Provides rollback capability for submission failures ### Does this PR introduce any user-facing change? Yes. New APIs added: - Scala: `SparkConnectClient.batchExecute()` and `reattach()` - Python: `SparkConnectClient.batch_execute()` and `reattach_execute()` ### How was this patch tested? - 11 server-side tests in SparkConnectBatchExecuteSuite - 11 client-side tests in SparkConnectClientBatchExecuteSuite - All tests passing with proper formatting and linting ### Key Implementation Details: - rollbackOnFailure only applies to submission failures (invalid UUID, duplicate operation ID), not execution failures - Operations are submitted sequentially and execute independently - Operations are marked as reattachable, allowing clients to consume results via reattach - Comprehensive documentation clarifying the behavior Closes #XXXXX
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @grundprinzip .
If you don't mind, please file a JIRA issue in the ASF community repository. It will help you prevent potential accidents like the following commits.
|
Absolutely, I'll do that as soon as it's more ready to leave draft mode. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll block this PR until then. It's because the previous mistake, SPARK-52762, also had the same pattern; The incorrect Draft status remains until it became incorrect commits mistakenly.
Absolutely, I'll do that as soon as it's more ready to leave draft mode.
|
I created a JIRA for it. |
|
Thank you! |
What changes were proposed in this pull request?
This PR introduces a new BatchExecutePlan RPC for Spark Connect that allows clients to submit multiple execution plans in a single batch operation, minimizing RPC overhead. It also adds reattach support for both Scala and Python clients to consume results from batch-submitted operations.
Key changes:
Why are the changes needed?
Does this PR introduce any user-facing change?
Yes. New APIs added:
SparkConnectClient.batchExecute()andreattach()SparkConnectClient.batch_execute()andreattach_execute()How was this patch tested?
Key Implementation Details:
Closes #XXXXX