-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed Recipes for eVolv2k_v3 #164
Comments
@cisaacstern (as I see it) URLs will inevitably change, and recipes will need to be updated accordingly, which begs a question: what is PF perspective on small datasets stored in public Github repos? I exchanged emails with the author (Matt Toohey) of this dataset and he suggested making it available from his own GitHub repo to sidestep this particular authentication issue. Is that an accepted approach. (I can imagine this is not relevant in most cases because the size of most datasets is prohibitive, but this is probably not the only instance of its kind.) |
@jordanplanders, if the data is small enough to host on GitHub, perhaps hosting on Zenodo is preferable? Re: side-stepping auth this way, let's just make sure that the data provider's authentication requirement does not also mean that hosting a publicly accessible mirror does not run afoul of the license? Generally, I'm thrilled to see any use of Pangeo Forge, though if the data is small enough to host on GitHub, perhaps it's worth asking what/any value add Pangeo Forge + Zarr is giving here? |
@cisaacstern Yep! Very valid! This is one of those moments when I either need a law degree or practice deciphering licensing (I guess those are sort of the same), but by default it is CC4 and the FAQs suggest that the authentication is catch all because some hosted datasets require users be granted permission to download, so if the dataset isn't actually in that protected category, it seems like it wouldn't be problematic to mirror it elsewhere. As far as the question about "why PF+Zarr for a small dataset?", my instinct was that having a "one stop shop" approach might be most effective way to keep the friction involved in moving to a python-based, multiple and varied dataset, transparent analysis culture below the surrender threshold. For folks who aren't natively data science-style data wranglers (probably particularly true among those who work with small datasets), my hunch is that navigating multiple sources and protocols might result in some attrition. Does any of that ring relevant? I'm not sure about this, but this seems like the slickest way to access and work with data in a cloud hub working environment, which is becoming more common, I think. |
Yep, that makes sense to me. Thanks for thinking through it out loud. Among other things, hopefully these conversations may be useful to others contemplating similar things down the line. In terms of where to host some mirror of the data outside the auth wall (as a stopgap until Pangeo Forge supports user-supplied credentials), I think Zenodo may be the more appropriate choice, but if GitHub is easier and you want to experiment with that, I don't see any reason not to. |
@cisaacstern Great! I'll talk to Matt about Zenodo and revisit the recipe with an eye toward the various things I've learned recently. |
Looking forward to the PR! Please let me know if/how I can help. |
@cisaacstern I'm still waiting to hear from Matt about whether he wants to use Zenodo, but In case others want to point to files stored in GitHub in future, it's worth knowing that urls that point to
I chased it around with This file will work:
but neither of these will: |
Dataset Name
eVolv2k_v3
Dataset URL
https://www.wdc-climate.de/ui/entry?acronym=eVolv2k_v3_ds
Description
The eVolv2k database includes estimates of the magnitudes and approximate source latitudes of major volcanic stratospheric sulfur injection (VSSI) events from 500 BCE to 1900 CE.
License
https://www.wdc-climate.de/ui/info?site=termsofuse
Data Format
NetCDF
Data Format (other)
No response
Access protocol
Other
Source File Organization
There is only one file with data variables corresponding to year, yearCE, month, day, latitude, hemi, vssi, and vssi sigma. The file does not have any declared coordinates.
Example URLs
No response
Authorization
Username / Password
Transformation / Processing
No response
Target Format
Zarr
Comments
This dataset is available from WDC-Climate. Part of the website indicates data is only available with credentials via their JBLOB interface or the web UI (the source code is fairly dense javascript to my relatively untrained eye), but perhaps there is another access point via swift (swift.dkrz.de/) (https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/index.html#dkrz-data-pool). I made a brief (and unsuccessful) go at using
fsspec_open_kwargs
to pass credentials, though based on what I have seen, I'm not surprised that it didn't work.Using GitHub as a temporary location for the data, I got this to work based on examples:
The text was updated successfully, but these errors were encountered: