Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move US to the class based implementation #114

Merged
merged 31 commits into from
Jul 10, 2024
Merged

Move US to the class based implementation #114

merged 31 commits into from
Jul 10, 2024

Conversation

stuartlynn
Copy link
Collaborator

@stuartlynn stuartlynn commented Jun 14, 2024

Moves the US code to work with the new class based system.

As far as I can tell I think we can reuse a lot of the code in the original dagster implementation here: https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/us_census/us/us_census/census_tasks.py.

The process for the US census metrics is mostly done with this asset : https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/2dfef3f2aa6fe9c0a3aedceb92b6e2520c100eb1/us/us_census/census_tasks.py#L252-L274 which process the data for multiple levels of the geography hierarchy.

For a base data repository like the 2020 one https://www2.census.gov/programs-surveys/acs/summary_file/2020/prototype/

The steps are to

  1. Get a list of table filenames : https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/2dfef3f2aa6fe9c0a3aedceb92b6e2520c100eb1/us/us_census/census_tasks.py#L191-L201
  2. Get the geometry ids file https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/2dfef3f2aa6fe9c0a3aedceb92b6e2520c100eb1/us/us_census/census_tasks.py#L203-L210
  3. For each table in the list from 1, download it https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/2dfef3f2aa6fe9c0a3aedceb92b6e2520c100eb1/us/us_census/census_tasks.py#L227 and extract the values for the different geometry levels https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/2dfef3f2aa6fe9c0a3aedceb92b6e2520c100eb1/us/us_census/census_tasks.py#L215
  4. Finally merge these into tables for a given geom level

To generate the metadata you can follow this : https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/37d6ef0ab304b4e791955f344f58ec44f3bc1700/python/popgetter/assets/us/__init__.py#L87-L142 which uses the old format for the metadata but might still be useful. We use the table hierarchy (something which is a little different in the US system from the other countries) to construct the human readable name but you could also do this to generate a hxl tag pretty easily I think.

To get a feel for how the US census tables are arranged you can look at one of the shell files (https://www2.census.gov/programs-surveys/acs/summary_file/2020/prototype/ACS2020_Table_Shells.csv).

The geometry files are generated with this code: https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/37d6ef0ab304b4e791955f344f58ec44f3bc1700/python/popgetter/assets/us/__init__.py#L241-L265

Lots of docs for the ACS here : https://www.census.gov/programs-surveys/acs/data/data-via-api.html


Remaining tasks:

  • Update to base class
    • This is now implemented but could probably be improved if we refactored the requirement for the partitions to be related to the metric output file naming. That way it would not be necessary to partition over the county/block group/tract and would reduce the requests for a given census table by a factor of 3
  • Check the missing description/human readable name for 2021
  • Rebase on main
  • Add additional metrics asset/send to sensor calls for changes from Add IO managers/multi-assets to handle metrics and metadata separately #132
    • This is partially implemented (for metadata not partitioned assets). For the partitioned assets a different approach might be required for the sensor. Instead IO manager is used directly on leaf partitioned asset
  • Add the seven manually derived metrics
  • Run full pipeline

@sgreenbury
Copy link
Collaborator

Outstanding issues for future PR:

  • Implement asset sensor for partitioned metrics output (needed new IO managers for metrics and metric metadata)
  • ...

@sgreenbury sgreenbury merged commit cd0255c into main Jul 10, 2024
8 checks passed
@sgreenbury sgreenbury deleted the us_with_class branch July 10, 2024 16:08
@sgreenbury sgreenbury mentioned this pull request Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done:
Development

Successfully merging this pull request may close these issues.

3 participants