Skip to content

Pipeline Architecture

Ciara Spencer edited this page Jul 31, 2024 · 44 revisions

Overview



Data Ingestion

The pipeline begins with data ingestion, where raw PPP Loan Program data is collected from official sources or relevant datasets. This data is provided in various formats such as CSV, Excel, or JSON. An ingestion component retrieves the data and prepares it for further processing.

Data Transformation

In this stage, the raw data undergoes transformation and cleaning processes. Data quality is ensured by handling missing values, correcting errors, and standardizing formats. This step ensures that the data is accurate and consistent for downstream analysis.

Geospatial Enrichment

The pipeline incorporates geocoding techniques to convert business addresses from the PPP Loan data into geographic coordinates (latitude and longitude). This enrichment allows businesses to be accurately positioned on a map for geospatial analysis.

Geospatial Processing

Geospatial processing is a critical link process where enriched data is FIPS encoded based on its coordinates to identify geographic areas to which it belongs. These standardized codes are assigned to various administrative and statistical divisions to facilitate consistent identification of geographic data, and in turn enables streamlined merging of sources with reference to geographic regions.

Geospatial Aggregation

Geospatial aggregation is a critical component of the end-stage pipeline that prepares enriched data to be analyzed spatially. Techniques such as coordinate transformations, spatial joins, and spatial aggregations are applied using Shapefiles and GeoPandas library. This step creates relationships to identify geographical patterns and trends within the data.

Database Creation

A geospatially enabled database is created using a database management systems AWS S3 and CloudFront CDN. This database structure optimizes storage and retrieval of geospatial data. The cleaned and enriched data is stored here for efficient querying and analysis.

Notebook Process


Pipeline Process Repositories Overview

Note: Full notebook process scripts are not public.


Logging Status and Tracking Changes

The following links contain example templates used for tracking changes, updates, and other data processes within the pipeline:

Raw Data - Columns, Descriptions, Data Type, Null Counts 2023

Transformations - Added/Removed Columns, Data Type Changes, Null Updates, Hash Field Inclusions 2023

File Size Tracking by State - Shapefiles, Aggregations, Points 2021 2023

Full Processing by State - Geocoordinates Null, FIPS encoded Null, Hashed, Shapefile, Aggregations, S3 Upload, Parameter Stored 2021 2023