Enable saving hyperopt checkpoints with multi-node clusters #2386

arnavgarg1 · 2022-08-15T17:17:48Z

This PR enables two things:

Ludwig users can now create a custom RayTune SyncClient and pass it into RayTune for custom checkpointing syncing behavior.
Function calls to fsspec related functions in fs_utils can now take in custom storage options (credentials) for different protocol types. This means that model metadata, parameters, checkpoints etc can now be written to remote storage options such as GCS, Azure, AWS, Minio, etc - anything that fsspec supports.

Both of these changes together enable being able to perform multi-node checkpoint syncing for hyperopt.

github-actions · 2022-08-15T18:03:05Z

Unit Test Results

6 files +        1 6 suites +1 56s ⏱️ - 3h 17m 59s
2 tests -   3 447 0 ✔️ - 3 374 1 💤 -   74 0 ❌ ±0 1 🔥 +1
9 runs - 10 188 0 ✔️ - 9 967 3 💤 - 227 0 ❌ ±0 6 🔥 +6

For more details on these errors, see this check.

Results for commit c67aaf9. ± Comparison against base commit 05ece0c.

♻️ This comment has been updated with latest results.

tests/ludwig/utils/test_fs_utils.py

ludwig/utils/fs_utils.py

ludwig/hyperopt/execution.py

ludwig/utils/fs_utils.py

tgaddair · 2022-08-27T16:17:31Z

ludwig/utils/fs_utils.py

+    if not storage_options:
+        logger.info(f"Using default storage options for `{protocol}` filesystem.")
+        if protocol == S3:
+            s3 = S3RemoteStorageOptions()


Why is s3 handled specially here? Do we need to do anything for other filesystems?

One way to make this clearer could be:

if not storage_options: logger.info(f"Using default storage options for `{protocol}` filesystem.") if protocol == S3: storage_options = S3RemoteStorageOptions().get_storage_options() try: return fsspec.filesystem(protocol, **storage_options) ...

Following up on my comment below, I now believe that this storage_options handling shouldn't be necessary.

Also, this approach has an additional bug, in that it will break existing use_credentials behavior by overriding it with credentials from the environment (when both are using s3).

@tgaddair S3 is handled especially here because users may want to pass in a different S3 compatible storage like Minio. The issue is that to do this, users need to pass in the --endpoint_url flag to AWSCLI because AWSCLI currently doesn't have an environment variable that can be configured to pick up this URL automatically (although they are currently working on a proposal to add it in: aws/aws-sdk#229). Within fsspec, that would involve manually passing in client_kwargs with the endpoint_url defined in the object. I think this is the only reason to do something special for S3 at the moment.

I can't entirely say what would be required for other file systems yet, but those should automatically get created via fsspec.

About the use_credentials method - I think that this actually won't be an issue because both use_credentials and this approach will end up using the same key and secret since they're derived from the same environment variables. The only thing that changes is the endpoint URL, which in either case would need to change for S3 compatible storage since use_credentials would also require this new endpoint URL (if it is specified).

tgaddair · 2022-08-27T16:24:51Z

ludwig/utils/remote_storage_options.py

+    """Get credentials from environment variables."""
+
+    def __init__(self):
+        super().__init__(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, MLFLOW_S3_ENDPOINT_URL)


Doesn't make sense to me to have mlflow coupling here. Looks like the idea here is to pull storage options from certain environment variables when they're not passed in explicitly. However, this shouldn't be necessary, as fsspec already has a mechanism for plumbing credentials through the environment, either through the use_credentials helper that sets the conf dictionary, or through the FSSPEC_ env vars:

https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration

So I would rely on the built-in fsspec plumbing mechanisms instead of recreating it here.

@tgaddair the MLFLOW_S3_ENDPOINT_URL is actually just a placeholder name and can be change to any other name. It doesn't actually end up involving any coupling with MLFLOW. I'll rename this environment variable to something more generic like S3_ENDPOINT_URL to make this more clear!

I like the idea of using the FSSPEC_ env vars as a potential substitute to avoid plumbing S3 credentials in this particular case with endpoint URLs. Will update and clear this up so that we don't need to do any of this manually.

Although following from the thread here (fsspec/s3fs#432) where you've requested this before as well, it doesn't seem like there's a neat way to specify the endpoint URL since it needs to be set inside a nested dictionary. How do you feel about just using this as an environment variable for now but with a different name?

arnavgarg1 · 2022-09-15T20:39:06Z

ludwig/hyperopt/execution.py

+    def _get_tmp_remote_checkpoint_dir(self, trial_dir: Path) -> Optional[Union[str, Tuple[str, str]]]:
+        """Get the path to remote checkpoint directory."""
+        if self.sync_config is None:
+            return None
+
+        if self.sync_config.upload_dir is not None:
+            # Cloud storage sync config
+            remote_checkpoint_dir = os.path.join(
+                self.sync_config.upload_dir, "tmp", *_get_relative_checkpoints_dir_parts(trial_dir)
+            )
+            return remote_checkpoint_dir
+        elif self.kubernetes_namespace is not None:
+            # Kubernetes sync config. Returns driver node name and path.
+            # When running on kubernetes, each trial is rsynced to the node running the main process.
+            node_name = self._get_kubernetes_node_address_by_ip()(self.head_node_ip)
+            return (node_name, trial_dir)
+        else:
+            logger.warning(
+                "Checkpoint syncing disabled as it is only supported to remote cloud storage or on Kubernetes "
+                "clusters. To use syncing, set the kubernetes_namespace in the config or use a cloud URI "


Merge into _get_remote_checkpoint_dir with optional argument "tmp" to avoid code duplication

for more information, see https://pre-commit.ci

…c-so-that-the-correct-aws-credentials-are-picked-up

arnavgarg1 · 2022-10-11T00:11:07Z

Closing this in favor of #2617

arnavgarg1 linked an issue Aug 15, 2022 that may be closed by this pull request

Define custom SyncClient to use fsspec so that the correct AWS credentials are picked up #2371

Closed

arnavgarg1 changed the title ~~Allow custom sync_client object to pass through~~ Enable multi-node checkpointing Aug 17, 2022

arnavgarg1 changed the title ~~Enable multi-node checkpointing~~ Multi-node checkpointing for hyperopt with credentials for fsspec Aug 17, 2022

arnavgarg1 marked this pull request as ready for review August 18, 2022 17:07

arnavgarg1 requested review from ShreyaR and justinxzhao and removed request for ShreyaR August 18, 2022 17:07

justinxzhao reviewed Aug 18, 2022

View reviewed changes

tests/ludwig/utils/test_fs_utils.py Outdated Show resolved Hide resolved

ludwig/utils/fs_utils.py Outdated Show resolved Hide resolved

arnavgarg1 commented Aug 20, 2022

View reviewed changes

ludwig/hyperopt/execution.py Outdated Show resolved Hide resolved

arnavgarg1 commented Aug 20, 2022

View reviewed changes

ludwig/utils/fs_utils.py Outdated Show resolved Hide resolved

arnavgarg1 self-assigned this Aug 25, 2022

arnavgarg1 requested review from tgaddair and justinxzhao August 27, 2022 00:55

arnavgarg1 commented Aug 27, 2022

View reviewed changes

ludwig/utils/fs_utils.py Outdated Show resolved Hide resolved

tgaddair requested changes Aug 27, 2022

View reviewed changes

arnavgarg1 changed the title ~~Multi-node checkpointing for hyperopt with credentials for fsspec~~ Enable saving hyperopt checkpoints with multi-node clusters Sep 15, 2022

arnavgarg1 commented Sep 15, 2022

View reviewed changes

ShreyaR added 2 commits October 5, 2022 12:40

Squashed commits from #2386

8679f1e

Update remote syncer to use fsspec

c26d97b

ShreyaR force-pushed the 2371-define-custom-syncclient-to-use-fsspec-so-that-the-correct-aws-credentials-are-picked-up branch from d798735 to c26d97b Compare October 6, 2022 01:48

pre-commit-ci bot and others added 3 commits October 6, 2022 01:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

10f5c4a

for more information, see https://pre-commit.ci

Added test

ad2a28b

Merge branch 'master' into 2371-define-custom-syncclient-to-use-fsspe…

c67aaf9

…c-so-that-the-correct-aws-credentials-are-picked-up

tgaddair mentioned this pull request Oct 10, 2022

Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 #2617

Merged

arnavgarg1 closed this Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable saving hyperopt checkpoints with multi-node clusters #2386

Enable saving hyperopt checkpoints with multi-node clusters #2386

arnavgarg1 commented Aug 15, 2022 •

edited

Loading

github-actions bot commented Aug 15, 2022 •

edited

Loading

tgaddair Aug 27, 2022

tgaddair Aug 27, 2022

arnavgarg1 Aug 29, 2022

arnavgarg1 Aug 29, 2022 •

edited

Loading

tgaddair Aug 27, 2022

arnavgarg1 Aug 29, 2022

arnavgarg1 Aug 29, 2022

arnavgarg1 Sep 15, 2022

arnavgarg1 commented Oct 11, 2022

Enable saving hyperopt checkpoints with multi-node clusters #2386

Enable saving hyperopt checkpoints with multi-node clusters #2386

Conversation

arnavgarg1 commented Aug 15, 2022 • edited Loading

github-actions bot commented Aug 15, 2022 • edited Loading

Unit Test Results

tgaddair Aug 27, 2022

Choose a reason for hiding this comment

tgaddair Aug 27, 2022

Choose a reason for hiding this comment

arnavgarg1 Aug 29, 2022

Choose a reason for hiding this comment

arnavgarg1 Aug 29, 2022 • edited Loading

Choose a reason for hiding this comment

tgaddair Aug 27, 2022

Choose a reason for hiding this comment

arnavgarg1 Aug 29, 2022

Choose a reason for hiding this comment

arnavgarg1 Aug 29, 2022

Choose a reason for hiding this comment

arnavgarg1 Sep 15, 2022

Choose a reason for hiding this comment

arnavgarg1 commented Oct 11, 2022

arnavgarg1 commented Aug 15, 2022 •

edited

Loading

github-actions bot commented Aug 15, 2022 •

edited

Loading

arnavgarg1 Aug 29, 2022 •

edited

Loading