[CELEBORN-1376] Push data failed should always release request body #2449

AngersZhuuuu · 2024-04-07T09:28:13Z

What changes were proposed in this pull request?

Worker netty not release

Many push failed

For spark shuffle client, enable it release push data body when rpc failure
For flink client, since it use wrapped bytbuf, we need release push data body when rpc failure and release origin body when rpc completed.
For worker replicate, we should enable it release push data body when rpc failure.

Why are the changes needed?

Avoid worker netty memory leak

Does this PR introduce any user-facing change?

How was this patch tested?

This reverts commit e7fdfba.

…t body" This reverts commit 51c7cc7.

This reverts commit f8484c8.

This reverts commit 4846963.

…e request body"" This reverts commit ab29738.

This reverts commit 4d8f416.

waitinfuture · 2024-04-07T15:34:54Z

common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java

    }

    @Override
    public void operationComplete(Future<? super Void> future) throws Exception {
      super.operationComplete(future);
-      if (rpcSendOutCallback != null) {


This changes behavior, is it OK? cc @RexXiong @FMX

IIRC this will affect buffer release from flink, IMO we can retain the current setup and remove the rpcFailureCallback call from handleFailure, since operationComplete will be invoked regardless of whether the channel fails

How about current?

How about current?

Make sense.

RexXiong

LGTM, thanks!

common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java

FMX

LGTM overall. But there are two nits I think it would be better to change.

common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java

FMX · 2024-04-10T11:41:23Z

Thanks. Merged into main(v0.5.0) and branch-0.4(v0.4.1)

### What changes were proposed in this pull request? Worker netty not release <img width="1729" alt="截屏2024-04-07 17 26 40" src="https://github.com/apache/celeborn/assets/46485123/5774f735-570b-448e-ab94-4c78661717f5"> Many push failed <img width="767" alt="截屏2024-04-07 17 27 46" src="https://github.com/apache/celeborn/assets/46485123/41866bd0-d634-4dbf-8518-b474c8d1faad"> 1. For spark shuffle client, enable it release push data body when rpc failure 2. For flink client, since it use wrapped bytbuf, we need release push data body when rpc failure and release origin body when rpc completed. 3. For worker replicate, we should enable it release push data body when rpc failure. ### Why are the changes needed? Avoid worker netty memory leak ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #2449 from AngersZhuuuu/CELEBORN-1376. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com> (cherry picked from commit b65b543) Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>

littlexyw · 2024-04-30T08:19:03Z

common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java

"When an outbound (a.k.a. downstream) message reaches at the beginning of the pipeline, Netty will release it after writing it out." So why is pushdata not released when rpc failure occurs?

That's because in this case error/Exception occurs before data is sent to pipeline. future.isSuccess() returns false in StdChannelListener#operationComplete

AngersZhuuuu added 13 commits April 7, 2024 17:25

[CELEBORN-1376] Push data failed should always release request body

51c7cc7

Update TransportClient.java

e7fdfba

Revert "Update TransportClient.java"

7d85518

This reverts commit e7fdfba.

Revert "[CELEBORN-1376] Push data failed should always release reques…

ab29738

…t body" This reverts commit 51c7cc7.

Update TransportClient.java

4846963

Update TransportClient.java

f8484c8

Revert "Update TransportClient.java"

e2702fd

This reverts commit f8484c8.

Revert "Update TransportClient.java"

2bc8360

This reverts commit 4846963.

Revert "Revert "[CELEBORN-1376] Push data failed should always releas…

371af05

…e request body"" This reverts commit ab29738.

Update FlinkShuffleClientImpl.java

7750a6b

Update TransportClient.java

f448c4e

update

4d8f416

Revert "update"

6e32149

This reverts commit 4d8f416.

waitinfuture reviewed Apr 7, 2024

View reviewed changes

AngersZhuuuu added 2 commits April 8, 2024 10:43

update

c2613fc

Update FlinkShuffleClientImplSuiteJ.java

f687735

RexXiong approved these changes Apr 8, 2024

View reviewed changes

FMX reviewed Apr 8, 2024

View reviewed changes

common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java Outdated Show resolved Hide resolved

AngersZhuuuu added 2 commits April 8, 2024 14:35

update

f1986ec

Update TransportClient.java

b81e8f1

FMX approved these changes Apr 10, 2024

View reviewed changes

common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java Outdated Show resolved Hide resolved

common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java Outdated Show resolved Hide resolved

AngersZhuuuu added 2 commits April 10, 2024 15:26

Update TransportClient.java

c5f067c

Update TransportClient.java

942946f

FMX closed this in b65b543 Apr 10, 2024

littlexyw reviewed Apr 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1376] Push data failed should always release request body #2449

[CELEBORN-1376] Push data failed should always release request body #2449

AngersZhuuuu commented Apr 7, 2024 •

edited

Loading

waitinfuture Apr 7, 2024

RexXiong Apr 8, 2024

AngersZhuuuu Apr 8, 2024

RexXiong Apr 8, 2024

RexXiong left a comment

FMX left a comment

FMX commented Apr 10, 2024

littlexyw Apr 30, 2024

littlexyw Apr 30, 2024

waitinfuture Apr 30, 2024

[CELEBORN-1376] Push data failed should always release request body #2449

[CELEBORN-1376] Push data failed should always release request body #2449

Conversation

AngersZhuuuu commented Apr 7, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

waitinfuture Apr 7, 2024

Choose a reason for hiding this comment

RexXiong Apr 8, 2024

Choose a reason for hiding this comment

AngersZhuuuu Apr 8, 2024

Choose a reason for hiding this comment

RexXiong Apr 8, 2024

Choose a reason for hiding this comment

RexXiong left a comment

Choose a reason for hiding this comment

FMX left a comment

Choose a reason for hiding this comment

FMX commented Apr 10, 2024

littlexyw Apr 30, 2024

Choose a reason for hiding this comment

littlexyw Apr 30, 2024

Choose a reason for hiding this comment

waitinfuture Apr 30, 2024

Choose a reason for hiding this comment

AngersZhuuuu commented Apr 7, 2024 •

edited

Loading