Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1256] Added internal port and auth support to Celeborn worker. #2292

Closed
wants to merge 10 commits into from

Conversation

otterc
Copy link
Contributor

@otterc otterc commented Feb 9, 2024

What changes were proposed in this pull request?

This adds an internal port and auth support to Celeborn Wokers.

  1. Internal port is used by a worker to receive messages from Celeborn Master.
  2. Authentication support for secure communication with clients. This change doesn't add the support in clients to communicate to the Workers securely. That will be in a future change.

This change targets just adding the port and auth support to Worker. The following items from the proposal are still pending:

  • Persisting the app secrets in Ratis.
  • Forwarding secrets to Workers and having ability for the workers to pull registration info from the Master.
  • Secured communication between workers and clients.

Why are the changes needed?

It is needed for adding authentication support to Celeborn (CELEBORN-1011)

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Part of a bigger change. For this change, only modified existing UTs.

@otterc
Copy link
Contributor Author

otterc commented Feb 9, 2024

@RexXiong @FMX @pan3793 @mridulm @akpatnam25 @rmcyang @waitinfuture
Could you please help review?

Copy link

codecov bot commented Feb 9, 2024

Codecov Report

Attention: Patch coverage is 82.75862% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 48.69%. Comparing base (7b0211e) to head (78923c4).
Report is 9 commits behind head on main.

Files Patch % Lines
...ache/celeborn/common/network/TransportContext.java 0.00% 1 Missing ⚠️
...cala/org/apache/celeborn/common/CelebornConf.scala 85.72% 1 Missing ⚠️
...a/org/apache/celeborn/common/meta/WorkerInfo.scala 93.34% 1 Missing ⚠️
...born/common/protocol/message/ControlMessages.scala 0.00% 1 Missing ⚠️
...apache/celeborn/common/rpc/netty/NettyRpcEnv.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2292      +/-   ##
==========================================
- Coverage   48.83%   48.69%   -0.14%     
==========================================
  Files         207      207              
  Lines       12939    12965      +26     
  Branches     1111     1113       +2     
==========================================
- Hits         6318     6312       -6     
- Misses       6218     6249      +31     
- Partials      403      404       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@otterc otterc marked this pull request as draft February 10, 2024 01:11
@otterc
Copy link
Contributor Author

otterc commented Feb 10, 2024

We will need separate secured ports for push and fetch for allowing both authenticated and un-authenticated clients to be able to communicate.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick pass through it.

Given the number of ports, I am wondering if we should encapsulate all port configs in a class in itself - and localize all port related validation and config there (including RpcEnv creation, etc)

@@ -59,6 +59,8 @@ public static String getFileName(String uniqueId, Mode mode) {
private int pushPort;
private int fetchPort;
private int replicatePort;
private int internalPort;
private int securedPort;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a note on when which port is to be used/is valid ?
Either here, or as javadoc to the get*Port methods below ?

@@ -86,19 +90,40 @@ private[celeborn] class Worker(
conf,
Math.min(64, Math.max(4, Runtime.getRuntime.availableProcessors())))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to support for disabling the existing behavior via a config in a future pr ? (workerArgs.port - which is not secure and is used for both s2s and c2s connections)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will be making that change in a future pr. I haven't created a jira yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@otterc
Copy link
Contributor Author

otterc commented Feb 13, 2024

Given the number of ports, I am wondering if we should encapsulate all port configs in a class in itself - and localize all port related validation and config there (including RpcEnv creation, etc)

Can this be done in a separate change, as it will require refactoring of existing code?

@otterc otterc marked this pull request as ready for review February 13, 2024 07:31
@otterc
Copy link
Contributor Author

otterc commented Feb 13, 2024

@RexXiong @FMX @pan3793 @mridulm @akpatnam25 @rmcyang @waitinfuture
This PR is ready for review now.

"Internal server port on the Worker where the master nodes connect.")
.version("0.5.0")
.intConf
.createWithDefault(0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these new ports, is 0 the correct default value? I noticed that other ports in this conf have rather specific default values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the corresponding un-secured ports, the default is 0 as well. When it is 0, then the system picks an ephemeral port. For these ports, that is desired.

@@ -36,6 +36,10 @@ class WorkerInfo(
val pushPort: Int,
val fetchPort: Int,
val replicatePort: Int,
val internalPort: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add these secure ports(It looks a little complicated beside insecure ports), IMO the celeborn service is either open securely or not. It seems strange to keep secure connection beside with insecure connection, if so it's still insecure for celeborn service?

For users who want to use security features, they can create a new Celeborn service to instead and finally migrate other tasks to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RexXiong Interesting, so you mean that we don't need to support seamless rolling upgrade from previously unsecured cluster to secured cluster, and one cluster can be either secured or unsecured, not both. Supporting mixed mode for rolling upgrade does introduces complexity. WDYT @otterc @mridulm @pan3793 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our last sync about authentication on September 23rd, we concluded, based on community feedback, that we would add separate ports for secured communication to allow for seamless rolling upgrades. In that meeting, we decided that it would be akin to supporting both HTTP and HTTPS by a web server. Here is the link to the summary: https://www.mail-archive.com/dev@celeborn.apache.org/msg00618.html.

However, IMO, allowing both secured and unsecured communication adds complexity and creates the possibility of issues. There is a way to achieve seamless rolling upgrades without adding new secured ports to Celeborn: by deploying another cluster of Celeborn that is secured (with different configured ports for Master/Worker) and leaving the existing unsecured cluster as it is. However, this creates more toil for the deployment/monitoring infrastructure.

We are not currently running Celeborn in production, and we will only run it in production with authentication enabled. Therefore, seamless rolling upgrades are not a problem for us, but we were addressing the community's concern.

If the community agrees that adding dedicated secured ports is not something we want to proceed with, I am willing to make those changes.

cc. @waitinfuture @mridulm @pan3793 @RexXiong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @otterc for the explanation. I agree with you that

allowing both secured and unsecured communication adds complexity and creates the possibility of issues

, and I believe @RexXiong have the same opinion. IMO whether to support rolling upgrade should depend on whether auth is enabled, as the following form describes:

Auth enabled before Auth enabled after Should support rolling upgrade
No No Yes
No Yes No
Yes Yes Yes
Yes No No

In summary, we should support rolling upgrade only if auth configuration stays the same before and after, and don't support otherwise.

WDYT? @otterc @RexXiong @mridulm @pan3793

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waitinfuture @otterc, sorry I may have lost track of some aspects of the original design, but this really adds a lot of complexity, I agree with summary about we should support rolling upgrade only if auth configuration stays the same

Copy link
Member

@pan3793 pan3793 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems there is more complexity than I first thought.

given the above discussion plus feedback from Celeborn users in other channels, most community members tend to downgrade the backward compatibility to "support rolling upgrade only if auth configuration stays the same", so I'm fine with this decision

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, as @otterc mentioned, this will significantly reduce the complexity !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Will remove the additional secure ports. We have already committed a secure RPC port for Master for which I will create a separate PR to remove. Please note that we do need internal ports for the servers to communicate with each other. The internal ports allow internal communication to follow a different flow than the external. However, the internal port number doesn't need to be part of the WorkerInfos equality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that we do need internal ports for the servers to communicate with each other. The internal ports allow internal communication to follow a different flow than the external. However, the internal port number doesn't need to be part of the WorkerInfos equality.

Understand, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the necessary changes. Please take a look.

waitinfuture pushed a commit that referenced this pull request Feb 24, 2024
…leborn Master

### What changes were proposed in this pull request?
#2292 (comment)
Based on the above discussion, removing the additional secured port. The existing port will be used for secured communication when auth is enabled.

### Why are the changes needed?
These changes are for enabling authentication

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
This removed additional secured port.

Closes #2327 from otterc/CELEBORN-1257.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
@otterc otterc changed the title [CELEBORN-1256] Added internal and secured ports to Celeborn worker. [CELEBORN-1256] Added internal port and auth support to Celeborn worker. Feb 27, 2024
@otterc
Copy link
Contributor Author

otterc commented Feb 27, 2024

@waitinfuture @mridulm @pan3793 @RexXiong I have updated the PR. Please take a look.

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! cc @RexXiong PTAL

@waitinfuture
Copy link
Contributor

Thanks! Merged to main(v0.5.0)

gotikkoxq added a commit to gotikkoxq/celeborn that referenced this pull request Aug 26, 2024
…leborn Master

### What changes were proposed in this pull request?
apache/celeborn#2292 (comment)
Based on the above discussion, removing the additional secured port. The existing port will be used for secured communication when auth is enabled.

### Why are the changes needed?
These changes are for enabling authentication

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
This removed additional secured port.

Closes #2327 from otterc/CELEBORN-1257.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants