Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify workflow to allow individual contributions of Tier 1 #117

Open
ksonda opened this issue Apr 21, 2022 · 6 comments
Open

Modify workflow to allow individual contributions of Tier 1 #117

ksonda opened this issue Apr 21, 2022 · 6 comments

Comments

@ksonda
Copy link
Collaborator

ksonda commented Apr 21, 2022

There may be an upcoming activity prioritizing the harvesting of Tier 1 boundaries from the remaining "Very Large" systems, and these should be able to be integrated without too much fanfare into the existing workflow.

A proposal:

  1. Set up /contributions-tier1/{state}subdirectories.
  2. Authorized contributors will place in each folder individual {st}{pwsid}.geojson files of Tier 1 boundaries
  3. Add or Modify src/transformers/states/transform_wsb_{st}.R as appropriate to merge in these new Tier 1 boundaries prior to the match and modeling steps
@jess-goddard
Copy link
Contributor

Thanks for this suggestion, @ksonda.

I agree fully with points 2-3: having individual state transformer files from new Tier 1 boundaries will be critical. This should be easy to incorporate as new data becomes available. Given that you're suggesting individual pwsid boundaries rather than state level, I think your suggestion that we modify the original state transformer to incorporate the one-off pwsid boundaries is straightforward and should be implemented when we have that data. Detailed commenting and a modified developer guide can support this change pretty seamlessly.

My recommendation for point 1 is that, rather than have subdirectories that maintain the data on github, we request states to host their own FTP or Drive folder or site where we can pull data from a reliable/ maintained URL. This is the current work-flow arrangement, where all incoming data is brought in from upstream sources. This ensures 1) upstream data has a clear/reproducible source; 2) there are not conflicts between github data and state/agency maintained data as changes happen over time; and 3) we do not risk hitting file size limits of github (100MB–unlikely for individual pwsids, but I could easily see a state offering a smaller subset of data with many pwsids).

In short, the repository is designed to ingest/transform/load, but not store and maintain external data–which is a formidable task to do well to ensure the data remains current and accessible beyond the repository.

@ksonda
Copy link
Collaborator Author

ksonda commented Apr 30, 2022

Thanks @jess-goddard. I see that there are good reasons to separate this repository from data storage. Regarding "states host their own FTP" recommendation, I agree fully with that for large aggregations that might be made available by more states. The issue is that in the short term there will likely be an EPIC-led activity to source the ~200ish 'very large' systems directly from the relevant utilities, which are generally in states that do not currently have any kind of boundary collection program.

This process will require some way to provide for a publicly visible submission/ version tracking mechanism of its own, to be transparent about which individual boundaries were submitted by whom with what underlying source, so that the data can be folded over and replaced by state sources if and when that is appropriate. GitHub is as good an option as any at this scale, since

  • The submission mechanism will only operate for this collection of individual boundaries over a small period of time
  • GitHub size limits will not be relevant for individual boundaries
  • EPIC does not want to be in the business of maintaining whatever boundaries do need to be web accessible in state-based FeatureCollections from something like a Sharepoint/GDrive/Dropbox for a long period of time.

Perhaps EPIC and I need to coordinate creating a separate repo that has this directory structure. Then steps 2-3 can be implemented against those URLs

@jess-goddard
Copy link
Contributor

@ksonda Yes I see the value here in what you're suggesting!

I like the idea of modularizing the data uploads to a small repo just for that purpose, but we can also discuss offline the pros/cons of keeping it separate from here. Let's connect when I'm back in office May 17

@ksonda
Copy link
Collaborator Author

ksonda commented May 25, 2022

I've mocked something up here https://github.com/cgs-earth/national-cws-boundary-update

@ksonda
Copy link
Collaborator Author

ksonda commented Aug 3, 2022

We have a contribution workflow set up here now https://github.com/cgs-earth/ref_pws

It generates/updates a geopackage here anytime a contribution is made https://www.hydroshare.org/resource/c9d8a6a6d87d4a39a4f05af8ef7675ad/data/contents/contributed_pws.gpkg

If this is of interest to ping

@jess-goddard
Copy link
Contributor

@ksonda great we have it on our agenda to connect with you this month about an integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants