-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1376] Push data failed should always release request body #2449
Conversation
} | ||
|
||
@Override | ||
public void operationComplete(Future<? super Void> future) throws Exception { | ||
super.operationComplete(future); | ||
if (rpcSendOutCallback != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this will affect buffer release from flink, IMO we can retain the current setup and remove the rpcFailureCallback call from handleFailure, since operationComplete will be invoked regardless of whether the channel fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about current?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about current?
Make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. But there are two nits I think it would be better to change.
common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java
Outdated
Show resolved
Hide resolved
Thanks. Merged into main(v0.5.0) and branch-0.4(v0.4.1) |
### What changes were proposed in this pull request? Worker netty not release <img width="1729" alt="截屏2024-04-07 17 26 40" src="https://github.com/apache/celeborn/assets/46485123/5774f735-570b-448e-ab94-4c78661717f5"> Many push failed <img width="767" alt="截屏2024-04-07 17 27 46" src="https://github.com/apache/celeborn/assets/46485123/41866bd0-d634-4dbf-8518-b474c8d1faad"> 1. For spark shuffle client, enable it release push data body when rpc failure 2. For flink client, since it use wrapped bytbuf, we need release push data body when rpc failure and release origin body when rpc completed. 3. For worker replicate, we should enable it release push data body when rpc failure. ### Why are the changes needed? Avoid worker netty memory leak ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #2449 from AngersZhuuuu/CELEBORN-1376. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com> (cherry picked from commit b65b543) Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When an outbound (a.k.a. downstream) message reaches at the beginning of the pipeline, Netty will release it after writing it out." So why is pushdata not released when rpc failure occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because in this case error/Exception occurs before data is sent to pipeline. future.isSuccess()
returns false in StdChannelListener#operationComplete
What changes were proposed in this pull request?
Worker netty not release
Many push failed
Why are the changes needed?
Avoid worker netty memory leak
Does this PR introduce any user-facing change?
How was this patch tested?