Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo #26302

Merged

Conversation

Catch-Bull
Copy link
Contributor

@Catch-Bull Catch-Bull commented Jul 5, 2022

Why are these changes needed?

This is the first PR of #25963 :

  1. Moved the agent information from internal KV to GCSNodeInfo`,
  2. raylet registers itself after the agent process finished register.

Motivation:
Storing agent information in internal KV and registering nodes in GCS (write node information to GCSNodeInfo) are two asynchronous operations, which will bring some complex timing problems, especially after raylet failover

Related issue number

#25963

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Catch-Bull Catch-Bull marked this pull request as draft July 5, 2022 17:12
@Catch-Bull Catch-Bull changed the title Add AgentInfo to GCSNodeInfo [Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo Jul 6, 2022
@@ -61,7 +61,8 @@ py_test_module_list(
"test_healthcheck.py",
"test_kill_raylet_signal_log.py",
"test_memstat.py",
"test_protobuf_compatibility.py"
"test_protobuf_compatibility.py",
"test_scheduling_performance.py"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this PR, the node needs to wait for the agent to register before registering itself to GCS, cluster.add_node will be nearly one second slower, resulting in an increase of 16 * 2 = 32 seconds in total time, so set the test size to medium.

@Catch-Bull Catch-Bull marked this pull request as ready for review July 6, 2022 11:28
@Catch-Bull Catch-Bull requested a review from architkulkarni July 6, 2022 11:46
Comment on lines 78 to 82
/*set_agent_info_and_register_node*/
[this](const AgentInfo &agent_info) {
self_node_info_.mutable_agent_info()->CopyFrom(agent_info);
RAY_CHECK_OK(RegisterGcs());
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks bad. Why do it this way? We should try to decouple node_manager_ with the Raylet.

To be clear, I'm not comfortable that NodeManager <-> Raylet has circular dependence here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about move the logic of RegisterGcs from Raylet to NodeManager? Maybe registering node should be a part of NodeManager?

@fishbone
Copy link
Contributor

fishbone commented Jul 6, 2022

What if the agent failed, will the address be updated in case of that? (I hope we won't fail raylet in this case)

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request changes

@fishbone fishbone self-assigned this Jul 6, 2022
@Catch-Bull
Copy link
Contributor Author

Catch-Bull commented Jul 7, 2022

What if the agent failed, will the address be updated in case of that? (I hope we won't fail raylet in this case)

for now, raylet is fate-sharing with the agent process.
here are the details: https://github.com/ray-project/ray/blob/master/src/ray/raylet/agent_manager.cc#L117

Based on this, I think merging agent and node register into one RPC request is make sense, it can avoid dealing with consistency problems after raylet failover.

@Catch-Bull Catch-Bull force-pushed the add_agent_address_to_node_table branch 2 times, most recently from 2db6db7 to 1034cca Compare July 12, 2022 09:39
@@ -96,7 +96,11 @@ Raylet::Raylet(instrumented_io_context &main_service,
Raylet::~Raylet() {}

void Raylet::Start() {
RAY_CHECK_OK(RegisterGcs());
register_thread_ = std::thread([this]() mutable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a new thread for this?

Copy link
Contributor Author

@Catch-Bull Catch-Bull Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because Raylet::Start and AgentManager::HandleRegisterAgent will run in the same thread, and Raylet::Start will block the thread to wait until AgentManager::HandleRegisterAgent finished, it's a dead lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a post from io context just work? I feel we don't need a thread here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raylet::Start already run in main_service_, so it's no use to post this wait function into main_service_
detail:

raylet->Start();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this:

std::function<void()> register;
register = [&]() {
    if ready: set agent info; RegisterGCS;
    else: sleep 1s; io_context.post(register);
};

register();

The reason I don't like thread here is that it increase complexity especially when we don't have a good threading model. For example it's easy to run into race conditions (self_node_info_ might be accessed in different thread, RegisterGcs now is running in a different thread. ).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iycheng Here I modified it again, please take a look again

RAY_CHECK_OK(RegisterGcs());
register_thread_ = std::thread([this]() mutable {
SetThreadName("raylet.register_itself");
self_node_info_.mutable_agent_info()->CopyFrom(node_manager_.SyncGetAgentInfo());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, please add time out for sync call. If time out, we should just fail the raylet (otherwise, it'll hang) and also print a detailed log about what happened for observability.

Copy link
Contributor Author

@Catch-Bull Catch-Bull Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the agent register timeout is implemented here:

auto timer = delay_executor_(

maybe I should replace time_ with agent_info_promise_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think future.get with a timeout should be good here.

@Catch-Bull Catch-Bull force-pushed the add_agent_address_to_node_table branch 2 times, most recently from 6f79f43 to 08e7bb0 Compare July 18, 2022 16:46
@rkooo567
Copy link
Contributor

Screen Shot 2022-07-27 at 7 00 33 AM

Maybe it is related? I've not seen any failure from this test https://flakey-tests.ray.io/?owner=core

@SongGuyang
Copy link
Contributor

A lot of tests timed out. But the previous pipeline is green. Do we need to retry the tests?

@rkooo567
Copy link
Contributor

Some of tests are unstable in the master. I think test_ray_shutdown is the only one that I am really concerned about

@SongGuyang
Copy link
Contributor

@Catch-Bull Take a look?

Do we need to merge this before the cut of 2.0?

@Catch-Bull
Copy link
Contributor Author

Catch-Bull commented Jul 27, 2022

@rkooo567 @SongGuyang this case seems never timeout, I will check it.

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
@SongGuyang SongGuyang merged commit 14dee5f into ray-project:master Jul 28, 2022
@kfstorm kfstorm deleted the add_agent_address_to_node_table branch July 28, 2022 18:46
rkooo567 added a commit to rkooo567/ray that referenced this pull request Jul 29, 2022
rkooo567 added a commit that referenced this pull request Jul 30, 2022
Catch-Bull added a commit that referenced this pull request Jul 30, 2022
…deInfo (#26302)" (#27242)"

This reverts commit ec69fec.

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Catch-Bull added a commit that referenced this pull request Aug 2, 2022
…deInfo (#26302)" (#27242)"

This reverts commit ec69fec.

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Catch-Bull added a commit that referenced this pull request Aug 4, 2022
…deInfo (#26302)" (#27242)"

This reverts commit ec69fec.

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ect#26302)

This is the first PR of ray-project#25963 :
1. Moved the agent information from `internal KV to `GCSNodeInfo`,
2. raylet registers itself after the agent process finished register.

Motivation:
Storing agent information in `internal KV` and registering nodes in GCS (write node information to `GCSNodeInfo`) are two asynchronous operations, which will bring some complex timing problems, especially after `raylet` failover

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ay-project#26302)" (ray-project#27242)

This reverts commit 14dee5f.

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants