-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo #26302
[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo #26302
Conversation
@@ -61,7 +61,8 @@ py_test_module_list( | |||
"test_healthcheck.py", | |||
"test_kill_raylet_signal_log.py", | |||
"test_memstat.py", | |||
"test_protobuf_compatibility.py" | |||
"test_protobuf_compatibility.py", | |||
"test_scheduling_performance.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this PR, the node needs to wait for the agent to register before registering itself to GCS, cluster.add_node
will be nearly one second slower, resulting in an increase of 16 * 2 = 32 seconds in total time, so set the test size to medium.
src/ray/raylet/raylet.cc
Outdated
/*set_agent_info_and_register_node*/ | ||
[this](const AgentInfo &agent_info) { | ||
self_node_info_.mutable_agent_info()->CopyFrom(agent_info); | ||
RAY_CHECK_OK(RegisterGcs()); | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks bad. Why do it this way? We should try to decouple node_manager_ with the Raylet.
To be clear, I'm not comfortable that NodeManager <-> Raylet has circular dependence here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about move the logic of RegisterGcs
from Raylet
to NodeManager
? Maybe registering node should be a part of NodeManager
?
What if the agent failed, will the address be updated in case of that? (I hope we won't fail raylet in this case) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request changes
for now, raylet is fate-sharing with the agent process. Based on this, I think merging agent and node register into one RPC request is make sense, it can avoid dealing with consistency problems after raylet failover. |
2db6db7
to
1034cca
Compare
src/ray/raylet/raylet.cc
Outdated
@@ -96,7 +96,11 @@ Raylet::Raylet(instrumented_io_context &main_service, | |||
Raylet::~Raylet() {} | |||
|
|||
void Raylet::Start() { | |||
RAY_CHECK_OK(RegisterGcs()); | |||
register_thread_ = std::thread([this]() mutable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a new thread for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because Raylet::Start
and AgentManager::HandleRegisterAgent
will run in the same thread, and Raylet::Start
will block the thread to wait until AgentManager::HandleRegisterAgent
finished, it's a dead lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a post from io context just work? I feel we don't need a thread here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raylet::Start already run in main_service_, so it's no use to post this wait function into main_service_
detail:
Line 283 in 1949f35
raylet->Start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this:
std::function<void()> register;
register = [&]() {
if ready: set agent info; RegisterGCS;
else: sleep 1s; io_context.post(register);
};
register();
The reason I don't like thread here is that it increase complexity especially when we don't have a good threading model. For example it's easy to run into race conditions (self_node_info_ might be accessed in different thread, RegisterGcs now is running in a different thread. ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iycheng Here I modified it again, please take a look again
src/ray/raylet/raylet.cc
Outdated
RAY_CHECK_OK(RegisterGcs()); | ||
register_thread_ = std::thread([this]() mutable { | ||
SetThreadName("raylet.register_itself"); | ||
self_node_info_.mutable_agent_info()->CopyFrom(node_manager_.SyncGetAgentInfo()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, please add time out for sync call. If time out, we should just fail the raylet (otherwise, it'll hang) and also print a detailed log about what happened for observability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the agent register timeout is implemented here:
ray/src/ray/raylet/agent_manager.cc
Line 95 in 1949f35
auto timer = delay_executor_( |
maybe I should replace
time_
with agent_info_promise_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think future.get with a timeout should be good here.
6f79f43
to
08e7bb0
Compare
Maybe it is related? I've not seen any failure from this test https://flakey-tests.ray.io/?owner=core |
A lot of tests timed out. But the previous pipeline is green. Do we need to retry the tests? |
Some of tests are unstable in the master. I think test_ray_shutdown is the only one that I am really concerned about |
@Catch-Bull Take a look? Do we need to merge this before the cut of 2.0? |
@rkooo567 @SongGuyang this case seems never timeout, I will check it. |
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
…dress_to_node_table
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
…ay-project#26302)" This reverts commit 14dee5f.
…ect#26302) This is the first PR of ray-project#25963 : 1. Moved the agent information from `internal KV to `GCSNodeInfo`, 2. raylet registers itself after the agent process finished register. Motivation: Storing agent information in `internal KV` and registering nodes in GCS (write node information to `GCSNodeInfo`) are two asynchronous operations, which will bring some complex timing problems, especially after `raylet` failover Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
…ay-project#26302)" (ray-project#27242) This reverts commit 14dee5f. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Why are these changes needed?
This is the first PR of #25963 :
internal KV to
GCSNodeInfo`,Motivation:
Storing agent information in
internal KV
and registering nodes in GCS (write node information toGCSNodeInfo
) are two asynchronous operations, which will bring some complex timing problems, especially afterraylet
failoverRelated issue number
#25963
Checks
scripts/format.sh
to lint the changes in this PR.