-
Notifications
You must be signed in to change notification settings - Fork 0
Data Sources
The pipeline uses the latest dataset as provided by the SBA as of June 30, 2023 for all loans >150K. The original pipeline process uses the dataset as provided by the SBA on June 21, 2021. Both datasets are processed within the data repositories and are referenced with associated labels in the notebook script nomenclature. As updated datasets became available, comparative analysis between the versions was performed.
insert dictionary docThe pipeline integrates the latest publicly available datasets for NAICS Codes as of 2022 for component pairing Industry, Industry Sub-sector, and Long Name for each Loan.
insert 2 DataFramesThe pipeline uses the portal to obtain SBA District Office Names to create Label references for the provided SBA Office Codes contained within the raw data source.
U.S. BUREAU OF LABOR STATISTICS
Used to reference standard ranges of company size (number of employees).
States are processed at 40k limit intervals using Google API geocoder to obtain coordinates for each business address.
U.S. Census Geocoder Coordinates
Used to obtain FIPS codes for previously obtained Latitude and Longitude coordinates.
Used to obtain FIPS codes as well as Latitude and Longitude coordinates when Google API limits are reached using one line address.
Used for pairing ShapeFiles (State, County, Block Group, and Block) with FIPS Codes for geospatial map bounds.
Used for reducing file size of TIGERLINE_Shapefiles