-
Notifications
You must be signed in to change notification settings - Fork 0
Pipeline Architecture
Data Ingestion
The pipeline begins with data ingestion, where raw PPP Loan Program data is collected from official sources or relevant datasets. This data is provided in various formats such as CSV, Excel, or JSON. An ingestion component retrieves the data and prepares it for further processing.
Data TransformationIn this stage, the raw data undergoes transformation and cleaning processes. Data quality is ensured by handling missing values, correcting errors, and standardizing formats. This step ensures that the data is accurate and consistent for downstream analysis.
Geospatial EnrichmentThe pipeline incorporates geocoding techniques to convert business addresses from the PPP Loan data into geographic coordinates (latitude and longitude). This enrichment allows businesses to be accurately positioned on a map for geospatial analysis.
Geospatial ProcessingGeospatial processing is a critical link process where enriched data is FIPS encoded based on its coordinates to identify geographic areas to which it belongs. These standardized codes are assigned to various administrative and statistical divisions to facilitate consistent identification of geographic data, and in turn enables streamlined merging of sources with reference to geographic regions.
Geospatial AggregationGeospatial aggregation is a critical component of the end-stage pipeline that prepares enriched data to be analyzed spatially. Techniques such as coordinate transformations, spatial joins, and spatial aggregations are applied using Shapefiles and GeoPandas library. This step creates relationships to identify geographical patterns and trends within the data.
Database CreationA geospatially enabled database is created using a database management systems AWS S3 and CloudFront CDN. This database structure optimizes storage and retrieval of geospatial data. The cleaned and enriched data is stored here for efficient querying and analysis.
Note: Full notebook process scripts are not public.
The following links contain example templates used for tracking changes, updates, and other data processes within the pipeline:
Raw Data - Columns, Descriptions, Data Type, Null Counts 2023
Transformations - Added/Removed Columns, Data Type Changes, Null Updates, Hash Field Inclusions 2023
File Size Tracking by State - Shapefiles, Aggregations, Points 2021 2023
Full Processing by State - Geocoordinates Null, FIPS encoded Null, Hashed, Shapefile, Aggregations, S3 Upload, Parameter Stored 2021 2023