-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1256] Added internal port and auth support to Celeborn worker. #2292
Conversation
@RexXiong @FMX @pan3793 @mridulm @akpatnam25 @rmcyang @waitinfuture |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2292 +/- ##
==========================================
- Coverage 48.83% 48.69% -0.14%
==========================================
Files 207 207
Lines 12939 12965 +26
Branches 1111 1113 +2
==========================================
- Hits 6318 6312 -6
- Misses 6218 6249 +31
- Partials 403 404 +1 ☔ View full report in Codecov by Sentry. |
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala
Show resolved
Hide resolved
We will need separate secured ports for push and fetch for allowing both authenticated and un-authenticated clients to be able to communicate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick pass through it.
Given the number of ports, I am wondering if we should encapsulate all port configs in a class in itself - and localize all port related validation and config there (including RpcEnv creation, etc)
@@ -59,6 +59,8 @@ public static String getFileName(String uniqueId, Mode mode) { | |||
private int pushPort; | |||
private int fetchPort; | |||
private int replicatePort; | |||
private int internalPort; | |||
private int securedPort; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a note on when which port is to be used/is valid ?
Either here, or as javadoc to the get*Port
methods below ?
@@ -86,19 +90,40 @@ private[celeborn] class Worker( | |||
conf, | |||
Math.min(64, Math.max(4, Runtime.getRuntime.availableProcessors()))) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to support for disabling the existing behavior via a config in a future pr ? (workerArgs.port
- which is not secure and is used for both s2s and c2s connections)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will be making that change in a future pr. I haven't created a jira yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.apache.org/jira/browse/CELEBORN-1268 to address this.
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala
Show resolved
Hide resolved
Can this be done in a separate change, as it will require refactoring of existing code? |
@RexXiong @FMX @pan3793 @mridulm @akpatnam25 @rmcyang @waitinfuture |
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
"Internal server port on the Worker where the master nodes connect.") | ||
.version("0.5.0") | ||
.intConf | ||
.createWithDefault(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these new ports, is 0 the correct default value? I noticed that other ports in this conf have rather specific default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the corresponding un-secured ports, the default is 0 as well. When it is 0, then the system picks an ephemeral port. For these ports, that is desired.
common/src/main/scala/org/apache/celeborn/common/meta/WorkerInfo.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/protocol/message/ControlMessages.scala
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/celeborn/common/protocol/PartitionLocation.java
Outdated
Show resolved
Hide resolved
d4ecffb
to
9a4b4bf
Compare
@@ -36,6 +36,10 @@ class WorkerInfo( | |||
val pushPort: Int, | |||
val fetchPort: Int, | |||
val replicatePort: Int, | |||
val internalPort: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to add these secure ports(It looks a little complicated beside insecure ports), IMO the celeborn service is either open securely or not. It seems strange to keep secure connection beside with insecure connection, if so it's still insecure for celeborn service?
For users who want to use security features, they can create a new Celeborn service to instead and finally migrate other tasks to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RexXiong Interesting, so you mean that we don't need to support seamless rolling upgrade from previously unsecured cluster to secured cluster, and one cluster can be either secured or unsecured, not both. Supporting mixed mode for rolling upgrade does introduces complexity. WDYT @otterc @mridulm @pan3793 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our last sync about authentication on September 23rd, we concluded, based on community feedback, that we would add separate ports for secured communication to allow for seamless rolling upgrades. In that meeting, we decided that it would be akin to supporting both HTTP and HTTPS by a web server. Here is the link to the summary: https://www.mail-archive.com/dev@celeborn.apache.org/msg00618.html.
However, IMO, allowing both secured and unsecured communication adds complexity and creates the possibility of issues. There is a way to achieve seamless rolling upgrades without adding new secured ports to Celeborn: by deploying another cluster of Celeborn that is secured (with different configured ports for Master/Worker) and leaving the existing unsecured cluster as it is. However, this creates more toil for the deployment/monitoring infrastructure.
We are not currently running Celeborn in production, and we will only run it in production with authentication enabled. Therefore, seamless rolling upgrades are not a problem for us, but we were addressing the community's concern.
If the community agrees that adding dedicated secured ports is not something we want to proceed with, I am willing to make those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @otterc for the explanation. I agree with you that
allowing both secured and unsecured communication adds complexity and creates the possibility of issues
, and I believe @RexXiong have the same opinion. IMO whether to support rolling upgrade should depend on whether auth is enabled, as the following form describes:
Auth enabled before | Auth enabled after | Should support rolling upgrade |
---|---|---|
No | No | Yes |
No | Yes | No |
Yes | Yes | Yes |
Yes | No | No |
In summary, we should support rolling upgrade only if auth configuration stays the same before and after, and don't support otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @waitinfuture @otterc, sorry I may have lost track of some aspects of the original design, but this really adds a lot of complexity, I agree with summary about we should support rolling upgrade only if auth configuration stays the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems there is more complexity than I first thought.
given the above discussion plus feedback from Celeborn users in other channels, most community members tend to downgrade the backward compatibility to "support rolling upgrade only if auth configuration stays the same", so I'm fine with this decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, as @otterc mentioned, this will significantly reduce the complexity !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Will remove the additional secure ports. We have already committed a secure RPC port for Master for which I will create a separate PR to remove. Please note that we do need internal ports for the servers to communicate with each other. The internal ports allow internal communication to follow a different flow than the external. However, the internal port number doesn't need to be part of the WorkerInfos equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that we do need internal ports for the servers to communicate with each other. The internal ports allow internal communication to follow a different flow than the external. However, the internal port number doesn't need to be part of the WorkerInfos equality.
Understand, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the necessary changes. Please take a look.
…leborn Master ### What changes were proposed in this pull request? #2292 (comment) Based on the above discussion, removing the additional secured port. The existing port will be used for secured communication when auth is enabled. ### Why are the changes needed? These changes are for enabling authentication ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? This removed additional secured port. Closes #2327 from otterc/CELEBORN-1257. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
59f5bbc
to
097fa26
Compare
4b2d485
to
caf80fe
Compare
@waitinfuture @mridulm @pan3793 @RexXiong I have updated the PR. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! cc @RexXiong PTAL
Thanks! Merged to main(v0.5.0) |
…leborn Master ### What changes were proposed in this pull request? apache/celeborn#2292 (comment) Based on the above discussion, removing the additional secured port. The existing port will be used for secured communication when auth is enabled. ### Why are the changes needed? These changes are for enabling authentication ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? This removed additional secured port. Closes #2327 from otterc/CELEBORN-1257. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
What changes were proposed in this pull request?
This adds an internal port and auth support to Celeborn Wokers.
This change targets just adding the port and auth support to Worker. The following items from the proposal are still pending:
Why are the changes needed?
It is needed for adding authentication support to Celeborn (CELEBORN-1011)
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Part of a bigger change. For this change, only modified existing UTs.