Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1376] Push data failed should always release request body #2449

Closed
wants to merge 19 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Apr 7, 2024

What changes were proposed in this pull request?

Worker netty not release
截屏2024-04-07 17 26 40

Many push failed
截屏2024-04-07 17 27 46

  1. For spark shuffle client, enable it release push data body when rpc failure
  2. For flink client, since it use wrapped bytbuf, we need release push data body when rpc failure and release origin body when rpc completed.
  3. For worker replicate, we should enable it release push data body when rpc failure.

Why are the changes needed?

Avoid worker netty memory leak

Does this PR introduce any user-facing change?

How was this patch tested?

}

@Override
public void operationComplete(Future<? super Void> future) throws Exception {
super.operationComplete(future);
if (rpcSendOutCallback != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes behavior, is it OK? cc @RexXiong @FMX

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this will affect buffer release from flink, IMO we can retain the current setup and remove the rpcFailureCallback call from handleFailure, since operationComplete will be invoked regardless of whether the channel fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about current?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about current?

Make sense.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. But there are two nits I think it would be better to change.

@FMX
Copy link
Contributor

FMX commented Apr 10, 2024

Thanks. Merged into main(v0.5.0) and branch-0.4(v0.4.1)

@FMX FMX closed this in b65b543 Apr 10, 2024
FMX pushed a commit that referenced this pull request Apr 10, 2024
### What changes were proposed in this pull request?
Worker netty not release
<img width="1729" alt="截屏2024-04-07 17 26 40" src="https://github.com/apache/celeborn/assets/46485123/5774f735-570b-448e-ab94-4c78661717f5">

Many push failed
<img width="767" alt="截屏2024-04-07 17 27 46" src="https://github.com/apache/celeborn/assets/46485123/41866bd0-d634-4dbf-8518-b474c8d1faad">

1. For spark shuffle client, enable it release push data body when rpc failure
2. For flink client, since it use wrapped bytbuf, we need release push data body when rpc failure and release origin body when rpc completed.
3. For worker replicate, we should enable it release push data body when rpc failure.

### Why are the changes needed?
Avoid worker netty memory leak

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2449 from AngersZhuuuu/CELEBORN-1376.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
(cherry picked from commit b65b543)
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When an outbound (a.k.a. downstream) message reaches at the beginning of the pipeline, Netty will release it after writing it out." So why is pushdata not released when rpc failure occurs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because in this case error/Exception occurs before data is sent to pipeline. future.isSuccess() returns false in StdChannelListener#operationComplete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants