Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 #2617

Merged
merged 12 commits into from
Oct 11, 2022

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Oct 10, 2022

This also revealed that there are issues with remote syncing for Ray 1.13, so we will only be supporting this feature with Ray 2.0 and above.

This PR uses the RemoteSyncer from #2386. The changes to support injecting credentials will come in a follow-up PR.

@github-actions
Copy link

github-actions bot commented Oct 10, 2022

Unit Test Results

         5 files  ±  0         5 suites  ±0   2h 53m 8s ⏱️ - 25m 47s
  3 455 tests +  6  3 379 ✔️ +  5    76 💤 +1  0 ±0 
10 210 runs  +13  9 977 ✔️ +10  233 💤 +3  0 ±0 

Results for commit 2ab6b68. ± Comparison against base commit 05ece0c.

♻️ This comment has been updated with latest results.

@tgaddair tgaddair marked this pull request as ready for review October 10, 2022 22:29
@tgaddair tgaddair changed the title Fixed sync_config for Ray 2.0 Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 Oct 10, 2022
"executor": {
TYPE: "ray",
"num_samples": 1 if search_space == "grid" else RANDOM_SEARCH_SIZE,
"max_concurrent_trials": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair Is there a reason to set max_concurrent_trials to 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, when you run multiple trials on these runners / locally it often ends up making them compete for resources, which slows everything down. Limiting to 1 trial at a time helps to avoid this resource contention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that makes sense! Similar to the problem I was seeing as well - personally I even noticed this happening on my local at times even when not using a RayBackend.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good indicator that 1 CPU per trial is probably too low. In practice, we may want to bump this up.

@arnavgarg1
Copy link
Contributor

arnavgarg1 commented Oct 10, 2022

Maybe this is not in the scope of this PR, but one of the other things we'll need to do is wrap writing some of the hyperopt_statistics to remote storage in a use_credentials block as well over here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/hyperopt/run.py#L385 (which happens after tune.run() returns). This will ensure that hyperopt_statistics.json is also saved to the same remote storage location for retrieval since we write this manually.

It might be worth either modifying an existing test (or adding a new one, although we should probably avoid that) to make sure we can also retrieve hyperopt_statistics.json from the same remote location that we sync to when running hyperopt E2E.

@tgaddair
Copy link
Collaborator Author

Maybe this is not in the scope of this PR, but one of the other things we'll need to do is wrap writing some of the hyperopt_statistics to remote storage in a use_credentials block as well over here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/hyperopt/run.py#L385 (which happens after tune.run() returns). This will ensure that hyperopt_statistics.json is also saved to the same remote storage location for retrieval since we write this manually.

It might be worth either modifying an existing test (or adding a new one, although we should probably avoid that) to make sure we can also retrieve hyperopt_statistics.json from the same remote location that we sync to when running hyperopt E2E.

@arnavgarg1 we do have a test for the existence of this file in the remote here.

It's true that we'll need to do the use_credentials trick if the user doesn't have the credentials in their environment. But I purposely left that out of scope for this PR to focus on just getting syncing working when credentials are being set correctly. In a follow-up, I will add in the plumbing to make use of use_credentials to allow overriding the creds in the environment.

@arnavgarg1
Copy link
Contributor

Sounds good! This looks great, thanks @tgaddair!

Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this for now, but we should probably land this once the plumbing PR is in as well!

@tgaddair
Copy link
Collaborator Author

Will avoid cherry-picking into the release branch until we have the credential PR in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants