Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" #27308

Merged
merged 5 commits into from
Aug 5, 2022

Conversation

Catch-Bull
Copy link
Contributor

@Catch-Bull Catch-Bull commented Jul 30, 2022

Reverts #27242

fix flaky test: test_ray_shutdown.py

After the raylet dies, the CoreWorker will have two ways to exit:

  1. Finished the construction of CPP class CoreWorker, it will exit by monitor thread.
  2. Before the construction of CPP class CoreWorker, it will raise Exception when trying to connect GCS

the worst-case in case 2: it will get time out when check_version_info, then try to publish error to GCS, it will get timeout too, but publish timeout is 60s, our default timeout of function wait_for_condition is 10s, so the test case will fail.

@Catch-Bull
Copy link
Contributor Author

Catch-Bull commented Jul 30, 2022

@rkooo567 I've been testing on my PC for over an hour and should be fine, I'll retry UT multiple times to make sure it's no longer flaky.

@Catch-Bull
Copy link
Contributor Author

@rkooo567 HI, I retry UT multiple times, and I think it works!
image
image
the failed one, which causes by a existing flaky test case:
image

@@ -1902,6 +1902,11 @@ def connect(
if mode == SCRIPT_MODE:
raise e
elif mode == WORKER_MODE:
if isinstance(e, grpc.RpcError) and e.code() in (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not merge it yet. I will look at it soon.

grpc.StatusCode.UNAVAILABLE,
grpc.StatusCode.UNKNOWN,
):
raise e
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if node.check_version_info raise grpc.RpcError, it means GCS is unreachable, so the ray._private.utils.publish_error_to_driver will fail too, and we raise the Exception early can avoid wait to publish timeout, it will cost 60s for nothing

@@ -295,6 +293,8 @@ py_test_module_list(
"test_placement_group_mini_integration.py",
"test_scheduling_2.py",
"test_multi_node_3.py",
"test_multiprocessing.py",
"test_placement_group_2.py",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current PR will make cluster.add_node takes 1 second more than before, There are nearly 30 cluster.add_node in test_placement_group_2.py, it takes about 240s before this PR, So I think it is reasonable to increase the time limit of test_placement_group_2 to avoid flaky tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster.add_node takes 1 second more than before

This actually sounds pretty bad... (I feel like it will increase the runtime of tests too long).

Why is this the case?

@Catch-Bull Catch-Bull force-pushed the revert-27242-revert-agent-info branch from 19532a2 to 9f3ebf5 Compare August 2, 2022 03:54
@Catch-Bull
Copy link
Contributor Author

@rkooo567 Hi, could you review this PR soon?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, and the fix makes sense. After we resolve the below thing, we can merge!

  1. I understand the fix for the issue, but why does it happen after this PR? The PR seems unrelated to me.
  2. There are total 519 add_node (and I guess if we include for loop, it could easily exceed 1000), meaning the test runtime will increase 10~20 minutes. Maybe this much of overhead is acceptable because our test runtime is pretty long, but I still would like to understand why add_node takes 1 more second?

src/ray/protobuf/common.proto Outdated Show resolved Hide resolved
@@ -295,6 +293,8 @@ py_test_module_list(
"test_placement_group_mini_integration.py",
"test_scheduling_2.py",
"test_multi_node_3.py",
"test_multiprocessing.py",
"test_placement_group_2.py",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster.add_node takes 1 second more than before

This actually sounds pretty bad... (I feel like it will increase the runtime of tests too long).

Why is this the case?

@rkooo567 rkooo567 self-assigned this Aug 2, 2022
@Catch-Bull
Copy link
Contributor Author

Catch-Bull commented Aug 3, 2022

Mostly LGTM, and the fix makes sense. After we resolve the below thing, we can merge!

  1. I understand the fix for the issue, but why does it happen after this PR? The PR seems unrelated to me.
  2. There are total 519 add_node (and I guess if we include for loop, it could easily exceed 1000), meaning the test runtime will increase 10~20 minutes. Maybe this much of overhead is acceptable because our test runtime is pretty long, but I still would like to understand why add_node takes 1 more second?

@rkooo567 Hi:

  1. the driver just needs to call notify_raylet to ensure the raylet is ready (this does not need the raylet finished registered to GCS), but the scheduling of normal tasks needs to make sure the raylet finished registered.
    • In general: the time gap between driver start and the worker process start(this means raylet finished register) will take 1 more second than before. the time of kill driver is almost equal, when the raylet/GCS is killed, the worker process has a greater chance of not completing check_version_info
  2. The reason why add_node takes 1 more second is raylet needs to wait for the agent finished registered before registering itself with GCS, before this PR, the raylet register itself to GCS, and the agent register itself to raylet is async.

PS: I find add_node(wait=False) is useless, case class Node needs to wait raylet finished registered without any argument can skip that. details:

ray._private.services.wait_for_node(

@Catch-Bull
Copy link
Contributor Author

@rkooo567 about why I think the raylet register itself to GCS will take 1 more second:
without this PR, the time gap between raylet sending the RegisterRequest and HanldeAgentRegister is one second.

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 3, 2022

the time gap between raylet sending the RegisterRequest and HanldeAgentRegister is one second.

Hmm this seems too long? Can you actually measure it (maybe log how long it takes to register agent)?

…deInfo (#26302)" (#27242)"

This reverts commit ec69fec.

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
@Catch-Bull Catch-Bull force-pushed the revert-27242-revert-agent-info branch from ab3043c to a1bd774 Compare August 4, 2022 03:36
@Catch-Bull
Copy link
Contributor Author

Catch-Bull commented Aug 4, 2022

the time gap between raylet sending the RegisterRequest and HanldeAgentRegister is one second.

Hmm this seems too long? Can you actually measure it (maybe log how long it takes to register agent)?

@rkooo567 In the current branch:
[raylet][14:10:15,059] Raylet launch the agent process.
[raylet][14:10:15,060] the first time to call Raylet::RegisterGcs().
[agent][14:10:15,843] DashboardAgent.__init__.
[agent][14:10:16,160] agent send register request to raylet.
[raylet][14:10:16,167] agent finished registering.
[raylet][14:10:16,167] send register request to GCS.

The main time overhead is in the initialization of the agent process.

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
@rkooo567
Copy link
Contributor

rkooo567 commented Aug 4, 2022

I see. So the most of overhead is from the process start up

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 4, 2022

I will merge this PR when it passes all tests

@fishbone
Copy link
Contributor

fishbone commented Aug 4, 2022

LGTM!

@SongGuyang
Copy link
Contributor

@Catch-Bull @rkooo567 Ready to merge?

@SongGuyang
Copy link
Contributor

The failed tune tests are flakey in master.

@SongGuyang SongGuyang added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 5, 2022
@SongGuyang SongGuyang merged commit ccf4116 into master Aug 5, 2022
@SongGuyang SongGuyang deleted the revert-27242-revert-agent-info branch August 5, 2022 08:32
rkooo567 added a commit to rkooo567/ray that referenced this pull request Aug 7, 2022
rkooo567 added a commit that referenced this pull request Aug 8, 2022
Catch-Bull added a commit that referenced this pull request Aug 9, 2022
…entInfo to GCSNodeInfo (…" (#27308)" (#27613)"

This reverts commit 6084ee5.

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
scottsun94 pushed a commit to scottsun94/ray that referenced this pull request Aug 9, 2022
…to GCSNodeInfo (…" (ray-project#27308)" (ray-project#27613)

This reverts commit ccf4116.

Signed-off-by: Huaiwei Sun <scottsun94@gmail.com>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…deInfo (…" (ray-project#27308)

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…to GCSNodeInfo (…" (ray-project#27308)" (ray-project#27613)

This reverts commit ccf4116.

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants