Skip to content

kdayno/sf-crime-stats

Repository files navigation

image

Table of Contents


1. Overview

1.1. Problem Statement

San Francisco is a vibrant and diverse city, but like any major urban area, it faces ongoing challenges related to public safety and crime. In recent years, it has become the center of attention due to its homelessness and drug problem which has been widely covered in the mainstream media. For someone planning to move to the city, this can be intimdating and even frightening.

1.2. Purpose

The San Francisco Crime Stats data pipeline and dashboard addresses this problem by ingesting, transforming, and visualizing the San Francisco Police Department's (SFPD) incident reports data set. The SFPD publishes detailed, regularly updated incident data through the city's open data portal (DataSF). This empowers a prospective resident to dig into the data and make an informed decision for themselves on if the problems related to public safety are understated or overexaggerated by the media.

1.3. Data Sources


2. Design & Development

2.1. Solution Architecture

image

2.2. Tech Stack

  • Storage: GCP Buckets, Parquet
  • Data Processing: BigQuery, Python, Polars, dbt Core
  • Data Visualization: Preset Cloud/Superset
  • Orchestration: Mage AI
  • DevOps: Terraform, Docker, GitHub, GitHub Actions

2.3. Data Model - Star Schema

image

2.4. dbt Data Lineage

image

2.5. Orchestration

image

"ingest_sfpt_incidents_daily" Pipeline

ingest_sfpt_incidents_daily_pipeline

"etl_gcs_to_bigquery" Pipeline

etl_gcs_to_bigquery_pipeline

"build_dbt_all_models" Pipeline

build_dbt_all_models_pipeline

2.6. Dashboard

image

dashboard-demo.mp4

3. Deployment

  • All infrastructure is deployed via Terraform scripts
  • A custom docker mage-ai docker image is pulled from Docker Hub and deployed to GCP Cloud Run via Terraform scripts

4. Challenges

  1. Terraform Deployment Permissions: It was difficult identifying the correct roles that were required by the service account to create all infrastructure and deploy the docker image to Cloud Run seamlessly. This was only partially defined in the mage-ai documentation and required extensive troubleshooting to resolve missing roles.

5. Future Enhancements

  1. Additional Data Sources: The open data portal (DataSF) offers many other datasets which could be used to provide deeper analysis. For example, the Registered Business Locations - San Francisco dataset could be integrated into the solution to analyze how crime has impacted businesses in the city over time.

6. Project Structure

sf-crime-stats
|
├── docs
│   ├── deployment                 
│   └── images                     -> Images for README
├── .github
│   └── workflows                  -> GitHub Actions
├── Dockerfile                     -> Builds custom docker mage-ai image for GCP deployment
├── docker-compose.yml             -> Used for local development
├── README.md
├── mage
│   └── sf-crime-stats-mage        -> Mage project
│       ├── data_exporters
│       ├── data_loaders
│       ├── dbts
│       ├── markdowns
│       ├── pipelines
│       └── transformers
├── dbt
│   └── sf_crime_stats.             -> dbt project
│       ├── dbt_project.yml
│       ├── macros
│       └── models
│           ├── core
│           ├── marts
│           └── staging
└── terraform                       -> Terraform project
    ├── envs
    │   ├── dev
    │   │   ├── keys
    │   │   ├── main.tf
    │   │   ├── terraform.tfvars
    │   │   └── variables.tf
    │   └── prod
    │       ├── keys
    │       ├── main.tf
    │       ├── terraform.tfvars
    │       └── variables.tf
    └── modules
        ├── dbtcloud
        │   ├── main.tf
        │   └── variables.tf
        └── gcp
            ├── db.tf
            ├── fs.tf
            ├── load_balancer.tf
            ├── main.tf
            └── variables.tf

7. References

Mage AI Docs:

Preset BI Docs:

8. Acknowledgements

This project was built as the capstone for the Data Engineering Zoomcamp - 2024 Cohort.

About

San Francisco crime statistics, analyzing trends over time and identifying location-based patterns

Topics

Resources

License

Stars

Watchers

Forks