Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 #2617

tgaddair · 2022-10-10T16:42:24Z

This also revealed that there are issues with remote syncing for Ray 1.13, so we will only be supporting this feature with Ray 2.0 and above.

This PR uses the RemoteSyncer from #2386. The changes to support injecting credentials will come in a follow-up PR.

github-actions · 2022-10-10T18:41:46Z

Unit Test Results

        5 files ±  0       5 suites ±0 2h 53m 8s ⏱️ - 25m 47s
  3 455 tests +  6 3 379 ✔️ +  5   76 💤 +1 0 ❌ ±0
10 210 runs +13 9 977 ✔️ +10 233 💤 +3 0 ❌ ±0

Results for commit 2ab6b68. ± Comparison against base commit 05ece0c.

♻️ This comment has been updated with latest results.

arnavgarg1 · 2022-10-10T22:54:59Z

tests/integration_tests/test_hyperopt.py

+        "executor": {
+            TYPE: "ray",
+            "num_samples": 1 if search_space == "grid" else RANDOM_SEARCH_SIZE,
+            "max_concurrent_trials": 1,


@tgaddair Is there a reason to set max_concurrent_trials to 1?

Yeah, when you run multiple trials on these runners / locally it often ends up making them compete for resources, which slows everything down. Limiting to 1 trial at a time helps to avoid this resource contention.

Got it, that makes sense! Similar to the problem I was seeing as well - personally I even noticed this happening on my local at times even when not using a RayBackend.

It's a good indicator that 1 CPU per trial is probably too low. In practice, we may want to bump this up.

arnavgarg1 · 2022-10-10T23:17:40Z

Maybe this is not in the scope of this PR, but one of the other things we'll need to do is wrap writing some of the hyperopt_statistics to remote storage in a use_credentials block as well over here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/hyperopt/run.py#L385 (which happens after tune.run() returns). This will ensure that hyperopt_statistics.json is also saved to the same remote storage location for retrieval since we write this manually.

It might be worth either modifying an existing test (or adding a new one, although we should probably avoid that) to make sure we can also retrieve hyperopt_statistics.json from the same remote location that we sync to when running hyperopt E2E.

tgaddair · 2022-10-10T23:24:02Z

Maybe this is not in the scope of this PR, but one of the other things we'll need to do is wrap writing some of the hyperopt_statistics to remote storage in a use_credentials block as well over here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/hyperopt/run.py#L385 (which happens after tune.run() returns). This will ensure that hyperopt_statistics.json is also saved to the same remote storage location for retrieval since we write this manually.

It might be worth either modifying an existing test (or adding a new one, although we should probably avoid that) to make sure we can also retrieve hyperopt_statistics.json from the same remote location that we sync to when running hyperopt E2E.

@arnavgarg1 we do have a test for the existence of this file in the remote here.

It's true that we'll need to do the use_credentials trick if the user doesn't have the credentials in their environment. But I purposely left that out of scope for this PR to focus on just getting syncing working when credentials are being set correctly. In a follow-up, I will add in the plumbing to make use of use_credentials to allow overriding the creds in the environment.

arnavgarg1 · 2022-10-10T23:35:00Z

Sounds good! This looks great, thanks @tgaddair!

arnavgarg1

Approving this for now, but we should probably land this once the plumbing PR is in as well!

tgaddair · 2022-10-10T23:40:25Z

Will avoid cherry-picking into the release branch until we have the credential PR in.

tgaddair added 10 commits October 4, 2022 16:06

tests: Added test to verify s3 artifact upload and download

c2f505a

Added more tests

a02eee6

Fixed upload

130be06

Added ray backend

f04d223

Moved to utils

1d05d33

WrapSyncer

adce9f0

Merge

5c5acec

Added syncer

0a71bf1

Merge branch 'master' into wrap-syncer

a4be13f

Fixed RemoteSyncer

a4d7e84

Fixed ray 1.13 tests

4fa404b

tgaddair marked this pull request as ready for review October 10, 2022 22:29

tgaddair changed the title ~~Fixed sync_config for Ray 2.0~~ Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 Oct 10, 2022

tgaddair requested review from arnavgarg1 and ShreyaR October 10, 2022 22:30

tgaddair added the release-0.6 label Oct 10, 2022

arnavgarg1 reviewed Oct 10, 2022

View reviewed changes

Update syncer.py

2ab6b68

arnavgarg1 approved these changes Oct 10, 2022

View reviewed changes

tgaddair removed the release-0.6 label Oct 10, 2022

arnavgarg1 mentioned this pull request Oct 11, 2022

Enable saving hyperopt checkpoints with multi-node clusters #2386

Closed

tgaddair merged commit d8a0d8f into master Oct 11, 2022

tgaddair deleted the wrap-syncer branch October 11, 2022 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 #2617

Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 #2617

tgaddair commented Oct 10, 2022 •

edited

Loading

github-actions bot commented Oct 10, 2022 •

edited

Loading

arnavgarg1 Oct 10, 2022

tgaddair Oct 10, 2022

arnavgarg1 Oct 10, 2022

tgaddair Oct 10, 2022

arnavgarg1 commented Oct 10, 2022 •

edited

Loading

tgaddair commented Oct 10, 2022

arnavgarg1 commented Oct 10, 2022

arnavgarg1 left a comment

tgaddair commented Oct 10, 2022

Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 #2617

Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 #2617

Conversation

tgaddair commented Oct 10, 2022 • edited Loading

github-actions bot commented Oct 10, 2022 • edited Loading

Unit Test Results

arnavgarg1 Oct 10, 2022

Choose a reason for hiding this comment

tgaddair Oct 10, 2022

Choose a reason for hiding this comment

arnavgarg1 Oct 10, 2022

Choose a reason for hiding this comment

tgaddair Oct 10, 2022

Choose a reason for hiding this comment

arnavgarg1 commented Oct 10, 2022 • edited Loading

tgaddair commented Oct 10, 2022

arnavgarg1 commented Oct 10, 2022

arnavgarg1 left a comment

Choose a reason for hiding this comment

tgaddair commented Oct 10, 2022

tgaddair commented Oct 10, 2022 •

edited

Loading

github-actions bot commented Oct 10, 2022 •

edited

Loading

arnavgarg1 commented Oct 10, 2022 •

edited

Loading