Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support passing bearer token for authorization (allows downloading protected datasets) and use HTTP content negotiation #79

Open
uschindler opened this issue Aug 26, 2022 · 1 comment

Comments

@uschindler
Copy link

uschindler commented Aug 26, 2022

Hi,

All HTTP requests sent to PANGAEA (important use HTTPS only!!!) should allow to user to send a Bearer token. This allows to download PANGAEA datasets which are protected. In pangaeapy this is already supported by passing an auth_token parameter to the constructor of the PANGAEA API. The bearer token is some temporary, opaque string the user can get from web page after logged in to PANGAEA. It is valid as long the user is logged in (timeout is 2 weeks). PANGAEA does not support to pass username/password because this would inheritly unsafe. People may suddenly post their R script on Github with their password included. If they post a script with a bearer token inside, it is enough to log out from PANGAEA to reject access and misuse of account.

In the documentation ask users to log in to PANGAEA, go to their profile page (https://www.pangaea.de/user/) and ask them to copy the token from there. Login MUST be done using a browser, all "automated tries" to login will be rejected by our servers. The token can be used as long as the user did not log out. It is recommended to check the box "keep logged in". Users can also login with ORCID at PANGAEA (that's another reason why user/pass does not work with APIs, many users do not even have a password ready).

But there is another change we would suggest to do: We ask you to send HTTP requests in a REST-like approach to PANGAEA dataset pages, because the current code uses some non-standardized format= parameters that might change soon. In addition, the HTTP client should also be prepared to follow redirects (coming soon). There is also a lot of if/then/else and it sometimes parses XML files to figure out if datasets are freely accessible, or if they are parents. This is done for the reason because it needs to guess datatype. It also looks like the code wants to not hammer PANGAEA with useless requests. But this is no problem at all. The response that a content type is not supported is cheap and the http status code comes fast. I'd do the data download like that:

  • Use the plain dataset DOI as URL for the download (both works: "https://doi.pangaea.de/xxxx" but also "https://doi.org/xxxx" and other variants). Previously with a doi.org URL no download was possible as "format=" parameter gets lost.
  • Set Authentication: Bearer token if available (see above). No need to check if it is login protected before. Just send always if available. If the token is invalid it is ignored.
  • Set Accept: text/tab-separated-values as header. This enables content negotiation. As this header does NOT look like a plain stupid browser, the PANGAEA code will switch to real "REST mode" and for example respond with correct headers instead of redirects to login page if the dataset is password protected and the credentials do not match. So you don't need to do best guesses when you were redirected and you get the HTML login page. A real REST client will get correct status code to know: "unauthorized".

This should always return the normal tab-separated-values format. No need to cross-check content-type in response or anything like that. The download code should only look at status code:

  • 200 (OK): All went well, you can be sure it is a tab-delimited matrix in PANGAEA format
  • 401 (Unauthorized): Dataset is protected and access rights do not match the bearer token or there's no bearer token at all (e.g., wrong user) or no bearer token at all. This can be reported as error message.
  • 406 (Not acceptable): The format in Accept header cannot be fulfilled. This happens when it is a parent or another type of collection or a static URL dataset with a different media type
  • 404 (Not Found): Dataset does not exist
  • 429 (Too many requests): Wait a few seconds
  • 5xx: some server error, especially 503 means "PANGAEA is down". Report this as hard error to user.

If you want to get the native PANGAEA metadata in panmd format, please DO NOT use oai-pmh. The native PANGAEA metadata can and should also be retrieved by content negotiation: Accept: application/vnd.pangaea.metadata+xml

And finally to get the citation as string use: Accept: text/x-bibliography.

See also those slides: https://docs.google.com/presentation/d/1mJEufjTK0O823Yc4zmsiNLua77_6p3UsBSXaVjx1A54/edit?usp=sharing

@gavinsimpson
Copy link
Contributor

Thanks for this @uschindler

@naupaka and I have taken over maintaining {pangaear}, so I'm not even sure yet how everything you mention affects the package and what would be involved in implementing the behaviours you describe, but we'll take a look at figure out what needs to be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants