Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DRS URI localization. #148

Merged
merged 14 commits into from
Sep 17, 2024
Merged

Implement DRS URI localization. #148

merged 14 commits into from
Sep 17, 2024

Conversation

dheiman
Copy link
Contributor

@dheiman dheiman commented Apr 5, 2024

Update to use DRS URIs, and attempt to use them on detecting GDC API URLs.

This feature takes advantage of Terra's DRSHub. So long as the user is using the same account they registered with to Terra, and has their external identities properly linked in their profile, this will enable acquiring signed URLs for data download, which is several times quicker than downloading via the GDC API.

This also improves compatibility with Terra workspaces, which may use DRS URIs in place of bucket paths.

@dheiman dheiman requested a review from julianhess April 5, 2024 19:03
@@ -376,13 +395,14 @@ def localization_command(self, dest):
## GDC HTTPS URLs {{{
class HandleGDCHTTPURL(FileType):
localization_mode = "url"
gdc_drs_root = "drs://dg.4dfc:"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will DRS URLs always have this root?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDC ones will, yes, that is the one for NCI CRDC.

self._size = int(headers["Content-Length"])
self._hash = headers["Content-MD5"]
self.uri = type(self).gdc_drs_root + self.uuid
self.drs_obj = HandleDRSURI(self.uri, **self.extra_args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to handle the logic of checking if a drs:// URL exists for a GDC HTTP URL besides try/catch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the Python way! It takes advantage of the fact that a straight call to DRS will fail if it doesn't exist.

Copy link
Contributor Author

@dheiman dheiman Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a little smarter and look for the 404 or 500 error return code, but 404 means the DRS isn't there, and 500 would be a server error, in both of which cases I'd think we'd want to fall back to using the API instead.

Maybe the 500 might be worth several retries, but they just tested the API recently to show it can handle 30K requests/s without falling over.

signed_url = f'$(curl -S -X POST --url "{type(self).drs_resolver}" ' + \
'-H "authorization: Bearer $(gcloud auth print-access-token)" ' + \
f'-H "content-type: application/json" --data \'{data_str}\' | ' + \
'python3 -c \'import json,sys; print(json.load(sys.stdin)["accessUrl"]["url"])\')'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol nice hack

@julianhess julianhess merged commit 421e505 into master Sep 17, 2024
1 of 3 checks passed
@julianhess julianhess deleted the drs_via_drshub branch September 17, 2024 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants