-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" #27308
Conversation
54f16cb
to
33610ca
Compare
@rkooo567 I've been testing on my PC for over an hour and should be fine, I'll retry UT multiple times to make sure it's no longer flaky. |
@rkooo567 HI, I retry UT multiple times, and I think it works! |
@@ -1902,6 +1902,11 @@ def connect( | |||
if mode == SCRIPT_MODE: | |||
raise e | |||
elif mode == WORKER_MODE: | |||
if isinstance(e, grpc.RpcError) and e.code() in ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not merge it yet. I will look at it soon.
grpc.StatusCode.UNAVAILABLE, | ||
grpc.StatusCode.UNKNOWN, | ||
): | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if node.check_version_info
raise grpc.RpcError, it means GCS is unreachable, so the ray._private.utils.publish_error_to_driver
will fail too, and we raise the Exception early can avoid wait to publish timeout, it will cost 60s for nothing
@@ -295,6 +293,8 @@ py_test_module_list( | |||
"test_placement_group_mini_integration.py", | |||
"test_scheduling_2.py", | |||
"test_multi_node_3.py", | |||
"test_multiprocessing.py", | |||
"test_placement_group_2.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current PR will make cluster.add_node
takes 1 second more than before, There are nearly 30 cluster.add_node
in test_placement_group_2.py
, it takes about 240s before this PR, So I think it is reasonable to increase the time limit of test_placement_group_2
to avoid flaky tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster.add_node takes 1 second more than before
This actually sounds pretty bad... (I feel like it will increase the runtime of tests too long).
Why is this the case?
19532a2
to
9f3ebf5
Compare
@rkooo567 Hi, could you review this PR soon? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, and the fix makes sense. After we resolve the below thing, we can merge!
- I understand the fix for the issue, but why does it happen after this PR? The PR seems unrelated to me.
- There are total 519
add_node
(and I guess if we include for loop, it could easily exceed 1000), meaning the test runtime will increase 10~20 minutes. Maybe this much of overhead is acceptable because our test runtime is pretty long, but I still would like to understand whyadd_node
takes 1 more second?
@@ -295,6 +293,8 @@ py_test_module_list( | |||
"test_placement_group_mini_integration.py", | |||
"test_scheduling_2.py", | |||
"test_multi_node_3.py", | |||
"test_multiprocessing.py", | |||
"test_placement_group_2.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster.add_node takes 1 second more than before
This actually sounds pretty bad... (I feel like it will increase the runtime of tests too long).
Why is this the case?
@rkooo567 Hi:
PS: I find ray/python/ray/_private/node.py Line 312 in 1c1cca2
|
@rkooo567 about why I think the raylet register itself to GCS will take 1 more second: |
Hmm this seems too long? Can you actually measure it (maybe log how long it takes to register agent)? |
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
ab3043c
to
a1bd774
Compare
@rkooo567 In the current branch: The main time overhead is in the initialization of the agent process. |
I see. So the most of overhead is from the process start up |
I will merge this PR when it passes all tests |
LGTM! |
@Catch-Bull @rkooo567 Ready to merge? |
The failed tune tests are flakey in master. |
…to GCSNodeInfo (…" (ray-project#27308)" This reverts commit ccf4116.
…to GCSNodeInfo (…" (ray-project#27308)" (ray-project#27613) This reverts commit ccf4116. Signed-off-by: Huaiwei Sun <scottsun94@gmail.com>
…deInfo (…" (ray-project#27308) Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
…to GCSNodeInfo (…" (ray-project#27308)" (ray-project#27613) This reverts commit ccf4116. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Reverts #27242
fix flaky test: test_ray_shutdown.py
After the raylet dies, the CoreWorker will have two ways to exit:
the worst-case in case 2: it will get time out when
check_version_info
, then try to publish error to GCS, it will get timeout too, but publish timeout is 60s, our default timeout of functionwait_for_condition
is 10s, so the test case will fail.