Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" #27308

Catch-Bull · 2022-07-30T16:07:36Z

fix flaky test: test_ray_shutdown.py

After the raylet dies, the CoreWorker will have two ways to exit:

Finished the construction of CPP class CoreWorker, it will exit by monitor thread.
Before the construction of CPP class CoreWorker, it will raise Exception when trying to connect GCS

the worst-case in case 2: it will get time out when check_version_info, then try to publish error to GCS, it will get timeout too, but publish timeout is 60s, our default timeout of function wait_for_condition is 10s, so the test case will fail.

Catch-Bull · 2022-07-30T16:30:29Z

@rkooo567 I've been testing on my PC for over an hour and should be fine, I'll retry UT multiple times to make sure it's no longer flaky.

Catch-Bull · 2022-08-01T04:02:08Z

@rkooo567 HI, I retry UT multiple times, and I think it works!

the failed one, which causes by a existing flaky test case:

SongGuyang · 2022-08-01T05:05:57Z

python/ray/_private/worker.py

@@ -1902,6 +1902,11 @@ def connect(
        if mode == SCRIPT_MODE:
            raise e
        elif mode == WORKER_MODE:
+            if isinstance(e, grpc.RpcError) and e.code() in (


Add a comment here.

rkooo567

Please do not merge it yet. I will look at it soon.

Catch-Bull · 2022-08-01T05:13:52Z

python/ray/_private/worker.py

+                grpc.StatusCode.UNAVAILABLE,
+                grpc.StatusCode.UNKNOWN,
+            ):
+                raise e


if node.check_version_info raise grpc.RpcError, it means GCS is unreachable, so the ray._private.utils.publish_error_to_driver will fail too, and we raise the Exception early can avoid wait to publish timeout, it will cost 60s for nothing

Catch-Bull · 2022-08-01T05:21:03Z

python/ray/tests/BUILD

@@ -295,6 +293,8 @@ py_test_module_list(
    "test_placement_group_mini_integration.py",
    "test_scheduling_2.py",
    "test_multi_node_3.py",
+    "test_multiprocessing.py",
+    "test_placement_group_2.py",


the current PR will make cluster.add_node takes 1 second more than before, There are nearly 30 cluster.add_node in test_placement_group_2.py, it takes about 240s before this PR, So I think it is reasonable to increase the time limit of test_placement_group_2 to avoid flaky tests.

cluster.add_node takes 1 second more than before

This actually sounds pretty bad... (I feel like it will increase the runtime of tests too long).

Why is this the case?

Catch-Bull · 2022-08-02T04:00:22Z

@rkooo567 Hi, could you review this PR soon?

rkooo567

Mostly LGTM, and the fix makes sense. After we resolve the below thing, we can merge!

I understand the fix for the issue, but why does it happen after this PR? The PR seems unrelated to me.
There are total 519 add_node (and I guess if we include for loop, it could easily exceed 1000), meaning the test runtime will increase 10~20 minutes. Maybe this much of overhead is acceptable because our test runtime is pretty long, but I still would like to understand why add_node takes 1 more second?

src/ray/protobuf/common.proto

rkooo567 · 2022-08-02T16:02:39Z

python/ray/tests/BUILD

@@ -295,6 +293,8 @@ py_test_module_list(
    "test_placement_group_mini_integration.py",
    "test_scheduling_2.py",
    "test_multi_node_3.py",
+    "test_multiprocessing.py",
+    "test_placement_group_2.py",


cluster.add_node takes 1 second more than before

This actually sounds pretty bad... (I feel like it will increase the runtime of tests too long).

Why is this the case?

Catch-Bull · 2022-08-03T09:20:51Z

Mostly LGTM, and the fix makes sense. After we resolve the below thing, we can merge!

I understand the fix for the issue, but why does it happen after this PR? The PR seems unrelated to me.

There are total 519 add_node (and I guess if we include for loop, it could easily exceed 1000), meaning the test runtime will increase 10~20 minutes. Maybe this much of overhead is acceptable because our test runtime is pretty long, but I still would like to understand why add_node takes 1 more second?

@rkooo567 Hi:

the driver just needs to call notify_raylet to ensure the raylet is ready (this does not need the raylet finished registered to GCS), but the scheduling of normal tasks needs to make sure the raylet finished registered.
- In general: the time gap between driver start and the worker process start(this means raylet finished register) will take 1 more second than before. the time of kill driver is almost equal, when the raylet/GCS is killed, the worker process has a greater chance of not completing check_version_info
The reason why add_node takes 1 more second is raylet needs to wait for the agent finished registered before registering itself with GCS, before this PR, the raylet register itself to GCS, and the agent register itself to raylet is async.

PS: I find add_node(wait=False) is useless, case class Node needs to wait raylet finished registered without any argument can skip that. details：

ray/python/ray/_private/node.py

Line 312 in 1c1cca2

ray._private.services.wait_for_node(

Catch-Bull · 2022-08-03T09:25:39Z

@rkooo567 about why I think the raylet register itself to GCS will take 1 more second:
without this PR, the time gap between raylet sending the RegisterRequest and HanldeAgentRegister is one second.

rkooo567 · 2022-08-03T16:14:33Z

the time gap between raylet sending the RegisterRequest and HanldeAgentRegister is one second.

Hmm this seems too long? Can you actually measure it (maybe log how long it takes to register agent)?

…deInfo (#26302)" (#27242)" This reverts commit ec69fec. Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

Catch-Bull · 2022-08-04T06:45:10Z

the time gap between raylet sending the RegisterRequest and HanldeAgentRegister is one second.

Hmm this seems too long? Can you actually measure it (maybe log how long it takes to register agent)?

@rkooo567 In the current branch:
[raylet][14:10:15,059] Raylet launch the agent process.
[raylet][14:10:15,060] the first time to call Raylet::RegisterGcs().
[agent][14:10:15,843] DashboardAgent.__init__.
[agent][14:10:16,160] agent send register request to raylet.
[raylet][14:10:16,167] agent finished registering.
[raylet][14:10:16,167] send register request to GCS.

The main time overhead is in the initialization of the agent process.

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

rkooo567 · 2022-08-04T09:30:09Z

I see. So the most of overhead is from the process start up

rkooo567 · 2022-08-04T09:30:21Z

I will merge this PR when it passes all tests

fishbone · 2022-08-04T17:33:36Z

LGTM!

SongGuyang · 2022-08-05T04:02:50Z

@Catch-Bull @rkooo567 Ready to merge?

SongGuyang · 2022-08-05T08:26:18Z

The failed tune tests are flakey in master.

…to GCSNodeInfo (…" (ray-project#27308)" This reverts commit ccf4116.

…to GCSNodeInfo (…" (#27308)" (#27613) This reverts commit ccf4116.

…entInfo to GCSNodeInfo (…" (#27308)" (#27613)" This reverts commit 6084ee5. Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

…to GCSNodeInfo (…" (ray-project#27308)" (ray-project#27613) This reverts commit ccf4116. Signed-off-by: Huaiwei Sun <scottsun94@gmail.com>

…deInfo (…" (ray-project#27308) Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

…to GCSNodeInfo (…" (ray-project#27308)" (ray-project#27613) This reverts commit ccf4116. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

Catch-Bull requested review from wuisawesome, ericl, AmeerHajAli, robertnishihara, pcmoritz, raulchen, fishbone and scv119 as code owners July 30, 2022 16:07

Catch-Bull force-pushed the revert-27242-revert-agent-info branch from 54f16cb to 33610ca Compare July 30, 2022 16:22

Catch-Bull requested a review from rkooo567 July 30, 2022 16:28

SongGuyang approved these changes Aug 1, 2022

View reviewed changes

rkooo567 requested changes Aug 1, 2022

View reviewed changes

Catch-Bull commented Aug 1, 2022

View reviewed changes

Catch-Bull force-pushed the revert-27242-revert-agent-info branch from 19532a2 to 9f3ebf5 Compare August 2, 2022 03:54

rkooo567 reviewed Aug 2, 2022

View reviewed changes

rkooo567 self-assigned this Aug 2, 2022

rkooo567 approved these changes Aug 3, 2022

View reviewed changes

Catch-Bull added 4 commits August 4, 2022 11:35

Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNo…

4d5f481

…deInfo (#26302)" (#27242)" This reverts commit ec69fec. Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

fix test_ray_shutdown

297df27

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

move test_placement_group_2 to large

ca43235

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

fix lint

a1bd774

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

Catch-Bull force-pushed the revert-27242-revert-agent-info branch from ab3043c to a1bd774 Compare August 4, 2022 03:36

fix UT

a813f10

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

fishbone approved these changes Aug 4, 2022

View reviewed changes

SongGuyang added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 5, 2022

SongGuyang merged commit ccf4116 into master Aug 5, 2022

SongGuyang deleted the revert-27242-revert-agent-info branch August 5, 2022 08:32

rkooo567 added a commit to rkooo567/ray that referenced this pull request Aug 7, 2022

Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo …

299a67f

…to GCSNodeInfo (…" (ray-project#27308)" This reverts commit ccf4116.

rkooo567 mentioned this pull request Aug 7, 2022

Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo … #27613

Merged

7 tasks

rkooo567 added a commit that referenced this pull request Aug 8, 2022

Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo …

6084ee5

…to GCSNodeInfo (…" (#27308)" (#27613) This reverts commit ccf4116.

Catch-Bull added a commit that referenced this pull request Aug 9, 2022

Revert "Revert "Revert "Revert "[Job Submission][refactor 1/N] Add Ag…

ed3bd81

…entInfo to GCSNodeInfo (…" (#27308)" (#27613)" This reverts commit 6084ee5. Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNo…

46d8a0f

…deInfo (…" (ray-project#27308) Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" #27308

Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" #27308

Catch-Bull commented Jul 30, 2022 •

edited

Loading

Catch-Bull commented Jul 30, 2022 •

edited

Loading

Catch-Bull commented Aug 1, 2022

SongGuyang Aug 1, 2022

rkooo567 left a comment

Catch-Bull Aug 1, 2022

Catch-Bull Aug 1, 2022

rkooo567 Aug 2, 2022

Catch-Bull commented Aug 2, 2022

rkooo567 left a comment •

edited

Loading

rkooo567 Aug 2, 2022

Catch-Bull commented Aug 3, 2022 •

edited

Loading

Catch-Bull commented Aug 3, 2022

rkooo567 commented Aug 3, 2022 •

edited

Loading

Catch-Bull commented Aug 4, 2022 •

edited

Loading

rkooo567 commented Aug 4, 2022

rkooo567 commented Aug 4, 2022 •

edited

Loading

fishbone commented Aug 4, 2022

SongGuyang commented Aug 5, 2022

SongGuyang commented Aug 5, 2022

Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" #27308

Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" #27308

Conversation

Catch-Bull commented Jul 30, 2022 • edited Loading

Catch-Bull commented Jul 30, 2022 • edited Loading

Catch-Bull commented Aug 1, 2022

SongGuyang Aug 1, 2022

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

Catch-Bull Aug 1, 2022

Choose a reason for hiding this comment

Catch-Bull Aug 1, 2022

Choose a reason for hiding this comment

rkooo567 Aug 2, 2022

Choose a reason for hiding this comment

Catch-Bull commented Aug 2, 2022

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

rkooo567 Aug 2, 2022

Choose a reason for hiding this comment

Catch-Bull commented Aug 3, 2022 • edited Loading

Catch-Bull commented Aug 3, 2022

rkooo567 commented Aug 3, 2022 • edited Loading

Catch-Bull commented Aug 4, 2022 • edited Loading

rkooo567 commented Aug 4, 2022

rkooo567 commented Aug 4, 2022 • edited Loading

fishbone commented Aug 4, 2022

SongGuyang commented Aug 5, 2022

SongGuyang commented Aug 5, 2022

Catch-Bull commented Jul 30, 2022 •

edited

Loading

Catch-Bull commented Jul 30, 2022 •

edited

Loading

rkooo567 left a comment •

edited

Loading

Catch-Bull commented Aug 3, 2022 •

edited

Loading

rkooo567 commented Aug 3, 2022 •

edited

Loading

Catch-Bull commented Aug 4, 2022 •

edited

Loading

rkooo567 commented Aug 4, 2022 •

edited

Loading