Skip to content

Data Sources

Ciara Spencer edited this page Aug 31, 2023 · 4 revisions

Raw Data Sources

U.S. PPP Loan Data

The pipeline uses the latest dataset as provided by the SBA as of June 30, 2023 for all loans >150K. The original pipeline process uses the dataset as provided by the SBA on June 21, 2021. Both datasets are processed within the data repositories and are referenced with associated labels in the notebook script nomenclature. As updated datasets became available, comparative analysis between the versions was performed.

insert dictionary doc

NAICS Files

The pipeline integrates the latest publicly available datasets for NAICS Codes as of 2022 for component pairing Industry, Industry Sub-sector, and Long Name for each Loan.

insert 2 DataFrames

SBA Office Labels

The pipeline uses the portal to obtain SBA District Office Names to create Label references for the provided SBA Office Codes contained within the raw data source.

Data Resources

U.S. BUREAU OF LABOR STATISTICS

Used to reference standard ranges of company size (number of employees).

Geospatial Data Sources

Google Maps Geocoder

States are processed at 40k limit intervals using Google API geocoder to obtain coordinates for each business address.

U.S. Census Geocoder Coordinates

Used to obtain FIPS codes for previously obtained Latitude and Longitude coordinates.

U.S. Census Geocoder Addresss

Used to obtain FIPS codes as well as Latitude and Longitude coordinates when Google API limits are reached using one line address.

TIGERLINE_Shapefiles

Used for pairing ShapeFiles (State, County, Block Group, and Block) with FIPS Codes for geospatial map bounds.

Geospatial Resources

Mapshaper

Used for reducing file size of TIGERLINE_Shapefiles