Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research Request - download NTD 2022 data #949

Closed
tiffanychu90 opened this issue Nov 15, 2023 · 2 comments · Fixed by #999
Closed

Research Request - download NTD 2022 data #949

tiffanychu90 opened this issue Nov 15, 2023 · 2 comments · Fixed by #999
Assignees
Labels
data Work related to the management of data ntd National Transit Database research request Issues that serve as a request for research (summary and handoff)

Comments

@tiffanychu90
Copy link
Member

tiffanychu90 commented Nov 15, 2023

Complete the below when receiving a research request, and continue to add to this issue as you receive additional details and produce deliverables. Be sure to also add the appropriate project-level label to this issue (eg gtfs-rt, DLA).

Research Question

Single sentence description: Download the latest year...2022 NTD data

Detailed description:

  • warehouse raw source table
  • warehouse mart table
  • Notebook for exploring whether ntd_id changed quite a bit from 2021 data
  • Slack thread

Deliverables

  • Identified relevant scripts in data-infra with DAG that calls this scrape_ntd script
  • Notebook for ntd_id comparison
  • Depending on results of ntd_id...more work to reconcile ids, or we can append to our mart table, and also update dim_organizations
  • GCS bucket: calitp-ntd-data-products and test-calitp-ntd-data-products...what's the difference?
@tiffanychu90 tiffanychu90 added the research request Issues that serve as a request for research (summary and handoff) label Nov 15, 2023
@tiffanychu90 tiffanychu90 self-assigned this Nov 15, 2023
@tiffanychu90 tiffanychu90 added the data Work related to the management of data label Nov 15, 2023
@tiffanychu90
Copy link
Member Author

tiffanychu90 commented Nov 15, 2023

  • Added this _env file and loaded the bucket name in for CALITP_BUCKET__NTD_DATA_PRODUCTS:
  • added import dotenv and dotenv.load_dotenv("_env")
  • in terminal: python scrape_ntd.py annual-database-agency-information 2022 https://www.transit.dot.gov/sites/fta.dot.gov/files/2023-10/2022%20Agency%20Information_2.xlsx
  • make sure to grab the Excel url (like above), not just the website url

@tiffanychu90
Copy link
Member Author

tiffanychu90 commented Dec 5, 2023

  • Debugging errors that come up Slack thread
  • dbt seed file is a csv crosswalk -- we do need one
  • there are some that need to be manually reconciled via analyst helping create this crosswalk
    • notebook shows more stringent merges on id and name will result in many not matching
    • but, if we loosen the merge columns and parse ntd_id into a prefix/suffix portion and merge on the parsed...a lot of these will merge between 2021 and 2022
    • create new column called ntd_id_2022
    • create reporter table, which is ntd_id-reported_by_name-year...which will sit between the raw export with what we want to show in mart_ntd
    • reduce mart_ntd to 1 row per ntd_id-year
  • further exploration: if in 2022, there are 2 separate reported_by_name rows, in the past, was it just 1? why do multiple appear now? On NTD site, there is a new definition for ntd_id, where the one submitting does have a prefix attached
  • Assign to analyst: seed file: 2021 ntd_id to2022 ntd_id...can get at it in notebook, but ultimately distill down to 2 columns to use in seed file
  • reporter matching we can get via joins after that

@shweta487 shweta487 linked a pull request Jan 19, 2024 that will close this issue
@tiffanychu90 tiffanychu90 added the ntd National Transit Database label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Work related to the management of data ntd National Transit Database research request Issues that serve as a request for research (summary and handoff)
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants