This playground is an ideal resource for SQL technical interviews and database skill development. It features real-world production data (3-5 million records per month) and comprehensive technical interview questions covering all levels from mid to senior database engineering roles.
- 15 detailed interview questions with full answers covering data modeling, ETL, performance optimization, complex analytics, geospatial analysis, data quality, and system architecture
- Production-ready scenarios based on real NYC taxi data challenges
- Multi-level difficulty suitable for mid to senior SQL developers
- Complete documentation:
docs/interviews/sql-interview-questions.md
- Data Modeling & Schema Design: Normalization, dimensional modeling, partitioning strategies
- Data Ingestion & ETL: Duplicate prevention, pipeline design, error handling
- Performance & Optimization: Index strategies, query optimization, execution plans
- Complex Queries & Analytics: Window functions, time series analysis, business intelligence
- Geospatial & PostGIS: Spatial analysis, geometric operations, location-based queries
- Data Quality & Integrity: Data validation, anomaly detection, cleaning strategies
- System Architecture & Scalability: High availability, monitoring, production deployment
- Real Production Data: authentic NYC taxi records, not synthetic datasets
- Complex Schema: 21-column fact table with geospatial dimensions and lookup tables
- Production Challenges: Data quality issues, performance optimization, scale considerations
- Complete ETL Pipeline: Hash-based duplicate prevention, chunked processing, error recovery
- Advanced Features: PostGIS spatial analysis, time-series data, multi-borough analytics
- π― Business Intelligence: Apache Superset for creating professional dashboards and data visualizations
- π Data Storytelling: Transform SQL queries into compelling visual narratives and interactive reports
A production-ready Docker-based SQL playground featuring PostgreSQL 17 + PostGIS 3.5, PGAdmin, and Apache Superset for business intelligence with authentic NYC Yellow Taxi trip data. Query millions of records through interactive dashboards, SQL Lab, and custom visualizations while leveraging automated data backfill system that can load multiple months (3-5 million records per month) with single-command deployment.
- PostgreSQL 17 + PostGIS 3.5 with geospatial support and custom Python environment
- PGAdmin 4 (latest) for web-based database management and query execution
- π Apache Superset modern business intelligence platform with interactive dashboards, SQL Lab, and advanced visualizations
- π Visual Analytics - Create charts, graphs, and dashboards from millions of NYC taxi records
- π SQL Lab Integration - Write complex queries with autocomplete and export results directly from Superset
- π± Interactive Dashboards - Cross-filtering, drill-down capabilities, and real-time data exploration
- Automated Backfill System - Download and load multiple months of authentic NYC taxi data
- Flexible Data Loading - Load specific months, last 6/12 months, or all available data (2009-2025)
- Complete Geospatial Data - 263 official NYC taxi zones with polygon boundaries (auto-downloaded)
- Unified Data Management - Single location for all data with automatic downloads from official sources
- Dictionary Table Cleaning - Fresh reference data loaded with each backfill for consistency
- Production-Scale Performance - Optimized schema with spatial indexes for big data analytics
- Memory-Efficient Loading - Chunked processing handles large datasets safely
- Persistent Logging - Organized logs by configuration with full traceability
- Hash-Based Duplicate Prevention - Ultimate protection against duplicate data with SHA-256 row hashing
- Star Schema Support - Dimensional modeling with fact and dimension tables for advanced analytics
- Historical Data Support - Complete NYC TLC data coverage from 2009 to present
- π Quick Start
- ποΈ Data Pipeline Architecture
- π§ Architecture & Data Loading
- π Project Structure
- π³ Container Architecture
- βΈοΈ Pause and Resume Capability
- βοΈ Configuration & Development
- ποΈ Database Schema & Analytics
- π Apache Superset Business Intelligence Features
- π Data Sources & Authenticity
- π Sample Analytics Queries
-
Copy environment configuration:
cp env .env
-
Start the complete SQL playground:
docker-compose up -d --build
β±οΈ Default configuration loads last 12 months of data (configure via
.env
file). First startup takes time based on data volume selected. -
Configure data backfill (optional): Edit
.env
file to customize data loading:# Load last 12 months (default) BACKFILL_MONTHS=last_12_months # Load specific months BACKFILL_MONTHS=2024-01,2024-02,2024-03 # Load all available data (2009-2025, WARNING: Very large!) BACKFILL_MONTHS=all
-
Monitor the initialization progress:
docker logs -f sql-playground-postgres
You'll see real-time progress updates as the system:
- Creates PostGIS-enabled database schema
- Downloads and loads 263 NYC taxi zones with geospatial boundaries
- Downloads and processes trip data in chunks (10K rows each)
- Performs data integrity verification
-
Access PGAdmin Web Interface:
- URL: http://localhost:8080
- Login: admin@admin.com / admin123
-
Connect to PostgreSQL in PGAdmin:
- Host: postgres
- Port: 5432
- Database: playground
- Username: admin
- Password: admin123
- Schema: nyc_taxi
-
π Access Apache Superset for Advanced Business Intelligence:
- URL: http://localhost:8088
- Login: admin / admin123
- π¨ Create Professional Dashboards: Build interactive visualizations from millions of taxi records
- β‘ SQL Lab: Advanced SQL editor with autocomplete, query history, and result exports
- π Chart Gallery: 50+ visualization types including maps, time series, and statistical plots
- π Real-time Filtering: Cross-dashboard filtering and drill-down capabilities
- Auto-connected: Pre-configured PostgreSQL connection to playground database
- Persistent: All dashboards and charts saved in SQLite backend
- Connection string: postgresql+pg8000://admin:admin123@postgres:5432/playground
flowchart TD
%% Data Sources
A[NYC TLC Data Sources] --> B[yellow_tripdata_*.parquet]
A --> C[taxi_zone_lookup.csv]
A --> D[taxi_zones.zip]
%% ETL Process
B --> E[Python ETL Process]
C --> E
D --> E
%% Normalized Schema
E --> F[(yellow_taxi_trips)]
E --> G[(yellow_taxi_trips_invalid)]
E --> H[(taxi_zone_lookup)]
E --> I[(taxi_zone_shapes)]
E --> J[(vendor_lookup)]
E --> K[(payment_type_lookup)]
E --> L[(rate_code_lookup)]
%% Star Schema Dimensions
H --> M[(dim_locations)]
J --> N[(dim_vendor)]
K --> O[(dim_payment_type)]
L --> P[(dim_rate_code)]
E --> Q[(dim_date)]
E --> R[(dim_time)]
%% Star Schema Fact Table
F --> S[(fact_taxi_trips)]
M --> S
N --> S
O --> S
P --> S
Q --> S
R --> S
%% Data Quality Monitoring
E --> T[(data_quality_monitor)]
E --> U[(data_processing_log)]
%% Table Details (Schema Information)
F -.-> F1["yellow_taxi_trips
---
row_hash VARCHAR(64) PK
vendorid INTEGER
tpep_pickup_datetime TIMESTAMP
tpep_dropoff_datetime TIMESTAMP
passenger_count DECIMAL
trip_distance DECIMAL
pulocationid INTEGER FK
dolocationid INTEGER FK
payment_type BIGINT FK
fare_amount DECIMAL
total_amount DECIMAL
+ 12 more columns"]
G -.-> G1["yellow_taxi_trips_invalid
---
invalid_id BIGSERIAL PK
error_message TEXT
error_type VARCHAR
source_file VARCHAR
+ all yellow_taxi_trips columns
raw_data_json JSONB"]
H -.-> H1["taxi_zone_lookup
---
locationid INTEGER PK
borough VARCHAR
zone VARCHAR
service_zone VARCHAR"]
I -.-> I1["taxi_zone_shapes
---
objectid INTEGER PK
locationid INTEGER FK
zone VARCHAR
borough VARCHAR
geometry GEOMETRY"]
M -.-> M1["dim_locations
---
location_key SERIAL PK
locationid INTEGER UNIQUE
zone VARCHAR
borough VARCHAR
zone_type VARCHAR
is_airport BOOLEAN
is_manhattan BOOLEAN
business_district BOOLEAN"]
N -.-> N1["dim_vendor
---
vendor_key SERIAL PK
vendorid INTEGER UNIQUE
vendor_name VARCHAR
vendor_type VARCHAR
is_active BOOLEAN"]
O -.-> O1["dim_payment_type
---
payment_type_key SERIAL PK
payment_type INTEGER UNIQUE
payment_type_desc VARCHAR
is_electronic BOOLEAN
allows_tips BOOLEAN"]
P -.-> P1["dim_rate_code
---
rate_code_key SERIAL PK
ratecodeid INTEGER UNIQUE
rate_code_desc VARCHAR
is_metered BOOLEAN
is_airport_rate BOOLEAN"]
Q -.-> Q1["dim_date
---
date_key INTEGER PK
date_actual DATE
year INTEGER
month INTEGER
day INTEGER
quarter INTEGER
is_weekend BOOLEAN
is_holiday BOOLEAN"]
R -.-> R1["dim_time
---
time_key INTEGER PK
hour_24 INTEGER
hour_12 INTEGER
is_rush_hour BOOLEAN
is_business_hours BOOLEAN
time_period VARCHAR"]
S -.-> S1["fact_taxi_trips
---
trip_key BIGSERIAL PK
pickup_date_key INTEGER FK
pickup_time_key INTEGER FK
dropoff_date_key INTEGER FK
dropoff_time_key INTEGER FK
pickup_location_key INTEGER FK
dropoff_location_key INTEGER FK
vendor_key INTEGER FK
payment_type_key INTEGER FK
rate_code_key INTEGER FK
trip_distance DECIMAL
trip_duration_minutes INTEGER
fare_amount DECIMAL
tip_amount DECIMAL
total_amount DECIMAL
+ calculated measures"]
T -.-> T1["data_quality_monitor
---
monitor_id BIGSERIAL PK
monitored_at TIMESTAMP
source_file VARCHAR
target_table VARCHAR
rows_attempted BIGINT
rows_inserted BIGINT
rows_duplicates BIGINT
rows_invalid BIGINT
quality_level VARCHAR
processing_duration_ms BIGINT"]
U -.-> U1["data_processing_log
---
log_id BIGSERIAL PK
data_year INTEGER
data_month INTEGER
status VARCHAR
processing_started_at TIMESTAMP
processing_completed_at TIMESTAMP
total_records_loaded BIGINT
source_file_path VARCHAR"]
%% Styling
classDef source fill:#e1f5fe
classDef normalized fill:#f3e5f5
classDef star fill:#e8f5e8
classDef quality fill:#fff3e0
classDef details fill:#f5f5f5,stroke:#999,stroke-dasharray: 5 5
class A,B,C,D source
class F,G,H,I,J,K,L normalized
class M,N,O,P,Q,R,S star
class T,U quality
class F1,G1,H1,I1,M1,N1,O1,P1,Q1,R1,S1,T1,U1 details
- π΅ Blue (Source): NYC TLC official data sources
- π£ Purple (Normalized): OLTP-style normalized tables for data integrity
- π’ Green (Star Schema): OLAP-style dimensional model for analytics
- π Orange (Quality): Data quality monitoring and processing logs
- Solid Lines: Data transformation flow
- Dotted Lines: Table schema details with PK/FK indicators
- yellow_taxi_trips: Main fact table with hash-based primary key for duplicate prevention
- Foreign Key Relationships:
pulocationid/dolocationid
βtaxi_zone_lookup.locationid
vendorid
βvendor_lookup.vendorid
payment_type
βpayment_type_lookup.payment_type
ratecodeid
βrate_code_lookup.ratecodeid
- Star Schema: Dimensional model with
fact_taxi_trips
as central fact table - Data Quality: Comprehensive monitoring with invalid row tracking and metrics
The system features a flexible backfill system that automatically downloads and loads data from official NYC TLC sources:
Data Sources:
- Trip Data: NYC Yellow Taxi records (2009-2025, 3-5 M records per month)
- Zone Data: 263 official NYC TLC taxi zones with lookup table and PostGIS geometries
- Reference Data: Vendors, payment types, rate codes with proper relationships
The postgres/docker/init-data.py
script orchestrates the entire process:
- PostgreSQL Readiness: Waits for database to be fully operational
- Idempotent Schema Creation: Executes SQL scripts with
IF NOT EXISTS
clauses for safe restarts- Schema Resilience: All CREATE TABLE and CREATE INDEX statements use
IF NOT EXISTS
- Data Conflict Resolution: INSERT statements use
ON CONFLICT ... DO UPDATE
for data consistency - PostGIS Extensions: Spatial extensions enabled with proper error handling
- Schema Resilience: All CREATE TABLE and CREATE INDEX statements use
- Dictionary Table Cleaning: Cleans all reference tables for fresh data loading
- Zone Data Processing:
- Downloads CSV lookup table and shapefile ZIP from official sources
- Extracts shapefiles and loads with NULL value cleanup (263 valid zones)
- Processes geometries with CRS conversion to NYC State Plane (EPSG:2263)
- Reloads all lookup tables (rate codes, payment types, vendors)
- Optimized Trip Data Backfill:
- Downloads parquet files for configured months automatically (2009-2025 coverage)
- Converts column names to lowercase for schema compatibility
- Handles missing columns (e.g., cbd_congestion_fee in older data)
- Handles numeric precision issues and NULL values
- Performance-Optimized Loading: 100K row chunks with dimension caching and vectorized operations
- Bulk Transaction Processing: Pandas
to_sql()
withmethod='multi'
for maximum performance - Enhanced Error Handling: Bulk validation with comprehensive error classification
- Star Schema Population: Optimized fact table loading with cached dimension lookups
- Data Verification: Performs integrity checks with sample analytical queries
- Zero Manual Data Management: All data downloaded automatically from official sources
- Flexible Backfill: Load specific months, last N months, or all available data
- Clean State Management: Dictionary tables refreshed with each backfill for consistency
- Production-Ready: Handles real-world data challenges (case sensitivity, precision, memory)
- Idempotent Operations: Safe container restarts with
IF NOT EXISTS
schema creation - Error Recovery:
docker-compose down -v
provides clean slate for troubleshooting - Persistent Storage: Data persists between container restarts via Docker volumes
- Organized Logging: Logs organized by configuration with full traceability
- Resumable Processing: Safe pause/resume capability with automatic continuation from interruption point
- High-Performance Processing: 67-100x faster data loading with optimized algorithms
- Problem Solved: Container restart failures due to "relation already exists" errors
- Solution: Added
IF NOT EXISTS
to all 18 CREATE TABLE and 26 CREATE INDEX statements - Data Handling: Updated to
ON CONFLICT ... DO UPDATE
for consistent data overwriting - Benefit: Zero-downtime container restarts and reliable deployment
- 67-100x Performance Improvement: Complete ETL pipeline optimization with Phase 1 and Phase 2A implementations
- Dimension Caching: Eliminated 18M individual SQL lookups with in-memory caching
- Vectorized Operations: Replaced row-by-row processing with pandas/NumPy vectorized calculations
- Bulk Transactions: Single bulk operations instead of 3.6M individual transactions
- Processing Time: Reduced from ~3.3 hours to ~22 minutes for 3.6M records
- Function Parameter Fix: Resolved
source_file
parameter passing through processing chain - Type Safety: Fixed
isinstance()
error in invalid row storage functionality - Bulk Validation: Comprehensive error classification with proper invalid row handling
- Graceful Degradation: System continues processing despite individual row errors
sql-playgrounds/
βββ docker-compose.yml # Multi-service configuration (PostgreSQL + PGAdmin + Superset)
βββ .env # Environment variables (credentials, ports, backfill config)
βββ CLAUDE.md # Detailed architecture guide for Claude Code
βββ postgres/ # PostgreSQL-related files
β βββ data/ # NYC taxi data storage (auto-populated during initialization)
β β βββ zones/ # NYC taxi zone reference data (auto-downloaded)
β β β βββ taxi_zone_lookup.csv # 263 official taxi zones
β β β βββ taxi_zones.* (shp/dbf/shx/prj/sbn/sbx) # Extracted shapefiles
β β βββ yellow/ # NYC Yellow taxi trip data (auto-downloaded)
β β βββ yellow_tripdata_*.parquet # Trip data files based on backfill config
β βββ logs/ # PostgreSQL persistent logging
β βββ sql-scripts/ # SQL scripts
β βββ init-scripts/ # Database schema creation (executed automatically)
β β βββ 00-postgis-setup.sql # PostGIS extensions and spatial references
β β βββ 01-nyc-taxi-schema.sql # Complete NYC taxi schema (lowercase columns)
β βββ reports-scripts/ # Pre-built analytical queries (available in PGAdmin)
β βββ nyc-taxi-analytics.sql # Trip volume, financial, and temporal analysis
β βββ geospatial-taxi-analytics.sql # PostGIS spatial queries and zone analysis
β βββ docker/ # PostgreSQL Docker files
β βββ Dockerfile.postgres # Custom image: PostgreSQL + PostGIS + Python environment
β βββ init-data.py # Comprehensive initialization script with backfill system
βββ superset/ # Apache Superset business intelligence platform
β βββ config/ # Superset configuration files
β β βββ superset_config.py # SQLite-based config (no Redis dependency)
β βββ logs/ # Superset application logs
β βββ docker/ # Superset Docker files
β βββ Dockerfile.superset # Superset with PostgreSQL drivers
β βββ create-db-connection.py # Database connection script
β βββ init-superset.sh # Superset initialization
# Note: Python dependencies embedded in Docker containers - no local setup required
- Base:
postgis/postgis:17-3.5
(PostgreSQL 17 + PostGIS 3.5) - Python Environment: Virtual environment with data processing packages
- Custom Entrypoint: Starts PostgreSQL, waits for readiness, runs initialization
- Embedded Dependencies: pandas, geopandas, pyarrow, psycopg2, sqlalchemy
- Base:
apache/superset:latest
with custom PostgreSQL drivers - Metadata Database: SQLite for persistent dashboard/chart storage
- Caching Strategy: SQLite-based using SupersetMetastoreCache (no Redis required)
- Auto-initialization: Pre-configured database connection and admin setup
- Features: Interactive dashboards, SQL Lab, chart creation, native filters
- Database Persistence:
postgres_data
volume (survives container restarts) - PGAdmin Configuration:
pgadmin_data
volume (settings, connections) - Superset Configuration:
superset_data
volume (dashboards, charts, user settings) - SQL Scripts:
./postgres/sql-scripts:/sql-scripts
(database schema and reports) - NYC Taxi Data:
./postgres/data:/postgres/data
(auto-downloaded data files) - PostgreSQL Persistent Logging:
./postgres/logs:/postgres/logs
(organized by backfill configuration) - Superset Configuration:
./superset/config:/app/config
(SQLite-based setup files) - Superset Logging:
./superset/logs:/app/logs
(application logs) - Script Access: SQL scripts available both for initialization and PGAdmin queries
- Optimized Chunked Loading: 100K rows per chunk (10x improvement) with memory-efficient processing
- Progress Tracking: Real-time logging with execution time tracking and performance metrics
- Advanced Bulk Operations: Dimension caching and vectorized calculations for 67-100x performance improvement
- Optimized Indexes: Spatial GIST, temporal, location, and composite indexes
- Production Scale: Handles millions of records efficiently with sub-30-minute processing times
- Ultimate Duplicate Prevention: SHA-256 hash-based system prevents any duplicate rows across backfills
- Enhanced Error Handling: Bulk validation with comprehensive error classification and invalid row storage
The system is designed to handle interruptions gracefully with automatic resume capability:
Pause Processing:
# Safely stop containers (preserves all data and progress)
docker-compose stop
# System can be safely interrupted at any point during backfill
Resume Processing:
# Automatically continues from where it left off
docker-compose up -d
# Monitor continuation progress
docker logs -f sql-playground-postgres
When restarted, the system automatically:
- π Detects Completed Months: Checks
data_processing_log
table for already-processed data - βοΈ Skips Finished Data: Already loaded months show as "Already processed (X records)"
βΆοΈ Continues From Interruption: Resumes with the next unprocessed month- π‘οΈ Prevents Duplicates: Hash-based system prevents duplicate rows during resume
- π Reuses Downloads: Existing parquet files are reused, no re-downloading needed
From logs (logs/last_12_months/log_20250924_215426.log
):
π 2024-10: Already processed (3,833,769 records) β
Skipped
π 2024-11: Already processed (3,646,369 records) β
Skipped
π 2024-12: Already processed (3,668,371 records) β
Skipped
π 2025-01: Already processed (3,475,226 records) β
Skipped
β οΈ 2025-02: Previous processing incomplete, will retry π Resumes here
π₯ Loading yellow_tripdata_2025-02.parquet...
β οΈ Chunk 5/358 - 10000 duplicates skipped (hash-based) π‘οΈ Duplicate protection
β οΈ Chunk 10/358 - 10000 duplicates skipped (hash-based) π‘οΈ Working as expected
The hash-based duplicate prevention ensures that even if processing was interrupted mid-file, no duplicate records are inserted when resuming.
# PostgreSQL Configuration
POSTGRES_DB=playground
POSTGRES_USER=admin
POSTGRES_PASSWORD=admin123
POSTGRES_PORT=5432
# PGAdmin Configuration
PGADMIN_EMAIL=admin@admin.com
PGADMIN_PASSWORD=admin123
PGADMIN_PORT=8080
# Apache Superset Configuration
SUPERSET_PORT=8088
SUPERSET_ADMIN_USER=admin
SUPERSET_ADMIN_EMAIL=admin@admin.com
SUPERSET_ADMIN_PASSWORD=admin123
SUPERSET_LOAD_EXAMPLES=false
# Data Loading Control
DATA_CHUNK_SIZE=100000
INIT_LOAD_ALL_DATA=true
# Backfill Configuration (controls which months to load)
BACKFILL_MONTHS=last_12_months
- Docker & Docker Compose (only requirement for deployment)
- Python 3.12+ with uv (optional, for local development only)
- Available Memory: 4GB+ recommended for data loading process
- Available Storage: ~2GB for complete dataset and indexes
# Full deployment (production-ready)
docker-compose up -d --build
# Development with clean rebuild (removes all data)
docker-compose down -v && docker-compose up -d --build
# Check data loading status
docker exec sql-playground-postgres psql -U admin -d playground -c "SELECT COUNT(*) FROM nyc_taxi.yellow_taxi_trips;"
# Check backfill progress
docker exec sql-playground-postgres psql -U admin -d playground -c "SELECT * FROM nyc_taxi.data_processing_log ORDER BY data_year, data_month;"
# Direct database access
docker exec -it sql-playground-postgres psql -U admin -d playground
# Monitor initialization progress
docker logs -f sql-playground-postgres
# Monitor persistent logs (organized by configuration)
tail -f logs/last_12_months/log_*.log
# Install dependencies for local scripts
uv sync
# Add new dependencies
uv add package-name
Note: Python environment is embedded in Docker container. Local Python setup only needed for custom script development.
Main Table: nyc_taxi.yellow_taxi_trips
(variable records based on backfill configuration, 21 columns)
- Primary Key: row_hash (SHA-256 hash of all row values for ultimate duplicate prevention)
- Trip Identifiers: vendorid, pickup/dropoff timestamps
- Location Data: pulocationid, dolocationid (references taxi zones)
- Financial Data: fare_amount, tip_amount, total_amount, taxes, fees
- Trip Metrics: passenger_count, trip_distance, payment_type
- Missing Column Handling: cbd_congestion_fee (handled gracefully for older data), airport_fee
Star Schema Tables:
fact_taxi_trips
: Dimensional fact table with calculated measures and foreign keysdim_locations
,dim_vendor
,dim_payment_type
,dim_rate_code
: Dimension tables for analyticsdim_date
: Date dimension with partition support (2009-2025)
Geospatial Tables:
taxi_zone_lookup
: 263 official NYC taxi zones with borough and service zone infotaxi_zone_shapes
: PostGIS MULTIPOLYGON geometries in NYC State Plane coordinates
Reference Tables: vendor_lookup, payment_type_lookup, rate_code_lookup
- Enhanced Chunk Processing: Increased chunk size from 10K to 100K rows (10x improvement)
- Optimized Hash Generation: Improved duplicate detection performance by 10x (30K rows/second)
- Memory-Efficient Processing: Chunked loading prevents memory overflow while maximizing throughput
- Overall ETL Improvement: 6.7x faster data loading pipeline
- Dimension Key Caching: In-memory cache eliminates 18M individual SQL dimension lookups
- Before: 5 SQL joins per row Γ 3.6M rows = 18M database queries
- After: 4 one-time cache population queries + instant dictionary lookups (1000x faster)
- Vectorized Operations: Pandas/NumPy vectorized calculations replace row-by-row processing
- Trip duration calculations: Vectorized datetime operations across entire DataFrames
- Derived measures: Tip percentages, average speeds calculated in bulk (~100x faster)
- Bulk Transaction Strategy: Single bulk transactions replace individual row transactions
- Before: 3.6M individual
engine.begin()
transactions - After: ~37 bulk transactions using
pandas.to_sql()
withmethod='multi'
- Before: 3.6M individual
- Enhanced Error Handling: Bulk validation with comprehensive error classification
- Total Improvement: 67-100x performance improvement
- Before: ~3.3 hours for 3.6M records
- After: ~22 minutes for 3.6M records
- Star Schema Processing: <3 minutes (vs 95% of original processing time)
- Hash Generation: Maintains 30K rows/second with larger chunks
- Spatial Index: GIST index on geometry column for fast spatial queries
- Time-Series Indexes: On pickup_datetime and dropoff_datetime
- Location Indexes: On pickup and dropoff location IDs
- Composite Indexes: Combined indexes for common analytical patterns
Source: NYC TLC Trip Record Data
- Format: Official parquet files (updated monthly by NYC TLC)
- Available Data: 2009-2025 (monthly files, ~60MB/3-5 M records per month)
- Auto-Download: System automatically downloads configured months from official sources
- Coverage: Complete historical coverage of taxi trip data from NYC Taxi & Limousine Commission
- Data Quality: Handles missing columns across different data periods with graceful fallbacks
Source: NYC TLC Taxi Zone Maps and Lookup Tables
- Lookup Table:
taxi_zone_lookup.csv
- 263 official taxi zones (automatically downloaded) - Geospatial Data: Complete shapefile ZIP with polygon boundaries (automatically downloaded and extracted)
- Coordinate System: Converted to NYC State Plane (EPSG:2263) for optimal spatial analysis
- Components:
.shp
,.dbf
,.shx
,.prj
,.sbn
,.sbx
files extracted automatically
This is production-scale real-world data from New York City's official transportation authority:
- β Authentic NYC taxi trips - every record represents a real taxi ride
- β Flexible temporal coverage - load specific months, last N months, or all available data
- β Official geospatial boundaries - precise NYC taxi zone polygons
- β Rich analytical dimensions - financial, temporal, spatial, and operational data
- β Always current - downloads latest data directly from official NYC TLC sources
Perfect for learning advanced SQL, big data analytics, and geospatial analysis with realistic datasets that mirror production database challenges. The flexible backfill system allows you to work with datasets ranging from a single month to 16+ years of historical data (2009-2025).
The playground includes a fully configured Apache Superset instance that transforms your NYC taxi data analysis into professional business intelligence:
- Geospatial Maps: Visualize pickup/dropoff patterns across NYC boroughs with PostGIS integration
- Time Series Charts: Track taxi demand patterns, revenue trends, and seasonal variations
- Statistical Distributions: Payment type breakdowns, trip distance histograms, fare analysis
- Cross-Tab Tables: Multi-dimensional analysis with drill-down capabilities
- Custom Metrics: Calculate KPIs like average trip duration, revenue per mile, tip percentages
- Intelligent Autocomplete: Schema-aware query suggestions for all tables and columns
- Query History: Save and reuse complex analytical queries
- Result Visualization: Instantly convert query results into charts and graphs
- Data Export: Download results in CSV, Excel, or JSON formats
- Query Performance: Built-in query optimization and execution plan analysis
- Real-time Filtering: Apply filters across multiple charts simultaneously
- Cross-Dashboard Navigation: Link related dashboards for comprehensive analysis
- Responsive Design: Dashboards work seamlessly on desktop and mobile devices
- Scheduled Reports: Automate dashboard delivery via email
- Public Sharing: Share dashboards with stakeholders via secure URLs
- Executive Dashboards: High-level KPIs and trends for decision makers
- Operational Reports: Daily/weekly performance monitoring and alerts
- Analytical Deep-Dives: Detailed exploration of taxi industry patterns
- Geographic Analysis: Borough-by-borough performance comparisons
- Financial Insights: Revenue analysis, payment pattern trends, profitability metrics
- SQLite Metadata Backend: No Redis dependency, simplified deployment
- Persistent Storage: All dashboards, charts, and user settings preserved
- Security Features: Role-based access control, user management
- Performance Optimized: Efficient caching and query optimization
- Scalable Architecture: Ready for production deployment
Get started with these dashboard concepts using your NYC taxi data:
- π Geographic Performance: Map visualizations showing hotspot areas and trip flows
- β° Temporal Analysis: Hourly, daily, and seasonal demand patterns
- π° Financial Dashboard: Revenue trends, payment type analysis, profitability metrics
- π Operations Monitor: Trip volume, average duration, distance distributions
- ποΈ Borough Comparison: Cross-borough analytics and performance benchmarks
Transform your SQL skills into compelling data stories with Superset's powerful visualization engine!
-- Trip volume by hour with zone names (uses lowercase column names)
SELECT EXTRACT(HOUR FROM yt.tpep_pickup_datetime) as hour,
COUNT(*) as trips,
STRING_AGG(DISTINCT pickup_zone.borough, ', ') as pickup_boroughs
FROM nyc_taxi.yellow_taxi_trips yt
JOIN nyc_taxi.taxi_zone_lookup pickup_zone ON yt.pulocationid = pickup_zone.locationid
GROUP BY hour ORDER BY hour;
-- Geospatial: Largest taxi zones by area (PostGIS spatial functions)
SELECT tzl.zone, tzl.borough,
ROUND((ST_Area(geometry) / 43560)::numeric, 2) as area_acres
FROM nyc_taxi.taxi_zone_shapes tzs
JOIN nyc_taxi.taxi_zone_lookup tzl ON tzs.locationid = tzl.locationid
ORDER BY ST_Area(geometry) DESC LIMIT 10;
-- Cross-borough trip analysis with concatenated boroughs
SELECT
pickup_zone.borough || ' β ' || dropoff_zone.borough AS trip_route,
COUNT(*) AS trip_count,
AVG(yt.fare_amount) AS avg_fare
FROM nyc_taxi.yellow_taxi_trips yt
JOIN nyc_taxi.taxi_zone_lookup pickup_zone
ON yt.pulocationid = pickup_zone.locationid
JOIN nyc_taxi.taxi_zone_lookup dropoff_zone
ON yt.dolocationid = dropoff_zone.locationid
GROUP BY pickup_zone.borough, dropoff_zone.borough
ORDER BY trip_count DESC;
-- Financial analysis: Payment patterns by borough
SELECT
pickup_zone.borough || ' - ' || ptl.payment_type_desc AS borough_payment_type,
COUNT(*) AS trips,
AVG(yt.total_amount) AS avg_total
FROM nyc_taxi.yellow_taxi_trips yt
JOIN nyc_taxi.taxi_zone_lookup pickup_zone
ON yt.pulocationid = pickup_zone.locationid
JOIN nyc_taxi.payment_type_lookup ptl
ON yt.payment_type = ptl.payment_type
GROUP BY pickup_zone.borough, ptl.payment_type_desc
ORDER BY avg_total DESC;
-- NYC Yellow Taxi Data Analytics Queries
-- Sample queries to explore the NYC taxi dataset
-- 1. Basic Data Overview
SELECT 'Total Trips' as metric, COUNT(*)::text as value
FROM nyc_taxi.yellow_taxi_trips
UNION ALL
SELECT 'Date Range' as metric,
MIN(tpep_pickup_datetime)::text || ' to ' || MAX(tpep_pickup_datetime)::text as value
FROM nyc_taxi.yellow_taxi_trips
UNION ALL
SELECT 'Total Revenue' as metric, '$' || ROUND(SUM(total_amount), 2)::text as value
FROM nyc_taxi.yellow_taxi_trips;
-- 2. Trip Volume by Hour of Day
SELECT
EXTRACT(HOUR FROM tpep_pickup_datetime) as pickup_hour,
COUNT(*) as trip_count,
ROUND(AVG(trip_distance), 2) as avg_distance,
ROUND(AVG(total_amount), 2) as avg_fare
FROM nyc_taxi.yellow_taxi_trips
GROUP BY EXTRACT(HOUR FROM tpep_pickup_datetime)
ORDER BY pickup_hour;
-- 3. Payment Method Analysis
SELECT
CASE payment_type
WHEN 1 THEN 'Credit Card'
WHEN 2 THEN 'Cash'
WHEN 3 THEN 'No Charge'
WHEN 4 THEN 'Dispute'
WHEN 5 THEN 'Unknown'
WHEN 6 THEN 'Voided'
ELSE 'Other'
END as payment_method,
COUNT(*) as trip_count,
ROUND(AVG(total_amount), 2) as avg_fare,
ROUND(AVG(tip_amount), 2) as avg_tip,
ROUND(SUM(total_amount), 2) as total_revenue,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER(), 2) as percentage
FROM nyc_taxi.yellow_taxi_trips
GROUP BY payment_type
ORDER BY trip_count DESC;
-- 4. Top Revenue Generating Trips - possible erroneous data inputs
SELECT
tpep_pickup_datetime,
trip_distance,
total_amount,
tip_amount,
PULocationID as pickup_zone,
DOLocationID as dropoff_zone,
EXTRACT(HOUR FROM tpep_pickup_datetime) as pickup_hour
FROM nyc_taxi.yellow_taxi_trips
WHERE total_amount > 100
ORDER BY total_amount DESC
LIMIT 20;
-- 5. Trip Distance Distribution
SELECT
CASE
WHEN trip_distance <= 1 THEN '0-1 miles'
WHEN trip_distance <= 3 THEN '1-3 miles'
WHEN trip_distance <= 5 THEN '3-5 miles'
WHEN trip_distance <= 10 THEN '5-10 miles'
WHEN trip_distance <= 20 THEN '10-20 miles'
ELSE '20+ miles'
END as distance_range,
COUNT(*) as trip_count,
ROUND(AVG(trip_distance),2) AS avg_distance,
ROUND(AVG(total_amount), 2) as avg_fare,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER(), 2) as percentage
FROM nyc_taxi.yellow_taxi_trips
WHERE trip_distance > 0 AND trip_distance < 500
GROUP BY
CASE
WHEN trip_distance <= 1 THEN '0-1 miles'
WHEN trip_distance <= 3 THEN '1-3 miles'
WHEN trip_distance <= 5 THEN '3-5 miles'
WHEN trip_distance <= 10 THEN '5-10 miles'
WHEN trip_distance <= 20 THEN '10-20 miles'
ELSE '20+ miles'
END
ORDER BY MIN(trip_distance);
-- 6. Daily Trip Patterns
SELECT
DATE(tpep_pickup_datetime) as trip_date,
COUNT(*) as total_trips,
ROUND(AVG(total_amount), 2) as avg_fare,
ROUND(SUM(total_amount), 2) as daily_revenue,
COUNT(CASE WHEN payment_type = 1 THEN 1 END) as credit_card_trips,
COUNT(CASE WHEN payment_type = 2 THEN 1 END) as cash_trips
FROM nyc_taxi.yellow_taxi_trips
GROUP BY DATE(tpep_pickup_datetime)
ORDER BY trip_date;
-- 7. Rush Hour Analysis
SELECT
CASE
WHEN EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 7 AND 9 THEN 'Morning Rush (7-9 AM)'
WHEN EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 17 AND 19 THEN 'Evening Rush (5-7 PM)'
WHEN EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 22 AND 23 OR
EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 0 AND 5 THEN 'Night (10 PM - 5 AM)'
ELSE 'Regular Hours'
END as time_period,
COUNT(*) as trip_count,
ROUND(AVG(trip_distance), 2) as avg_distance,
ROUND(AVG(total_amount), 2) as avg_fare,
ROUND(AVG(tip_amount), 2) as avg_tip
FROM nyc_taxi.yellow_taxi_trips
GROUP BY
CASE
WHEN EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 7 AND 9 THEN 'Morning Rush (7-9 AM)'
WHEN EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 17 AND 19 THEN 'Evening Rush (5-7 PM)'
WHEN EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 22 AND 23 OR
EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 0 AND 5 THEN 'Night (10 PM - 5 AM)'
ELSE 'Regular Hours'
END
ORDER BY trip_count DESC;
-- 8. Tip Analysis by Payment Type
SELECT
CASE payment_type
WHEN 1 THEN 'Credit Card'
WHEN 2 THEN 'Cash'
ELSE 'Other'
END as payment_method,
COUNT(*) as trip_count,
ROUND(AVG(tip_amount), 2) as avg_tip,
ROUND(AVG(fare_amount), 2) as avg_fare,
ROUND(AVG(tip_amount / NULLIF(fare_amount, 0) * 100), 2) as avg_tip_percentage,
COUNT(CASE WHEN tip_amount > 0 THEN 1 END) as trips_with_tips
FROM nyc_taxi.yellow_taxi_trips
WHERE fare_amount > 0 AND payment_type IN (1, 2)
GROUP BY payment_type
ORDER BY avg_tip DESC;
-- 9. Weekend vs Weekday Analysis
SELECT
CASE
WHEN EXTRACT(DOW FROM tpep_pickup_datetime) IN (0, 6) THEN 'Weekend'
ELSE 'Weekday'
END as day_type,
COUNT(*) as trip_count,
ROUND(AVG(trip_distance), 2) as avg_distance,
ROUND(AVG(total_amount), 2) as avg_fare,
ROUND(SUM(total_amount), 2) as total_revenue
FROM nyc_taxi.yellow_taxi_trips
GROUP BY
CASE
WHEN EXTRACT(DOW FROM tpep_pickup_datetime) IN (0, 6) THEN 'Weekend'
ELSE 'Weekday'
END
ORDER BY trip_count DESC;
-- 10. Long Distance Trips (Over 20 miles) - possible erroneous data inputs
SELECT
tpep_pickup_datetime,
trip_distance,
total_amount,
ROUND(total_amount / trip_distance, 2) as fare_per_mile,
PULocationID,
DOLocationID,
passenger_count
FROM nyc_taxi.yellow_taxi_trips
WHERE trip_distance > 20
ORDER BY trip_distance DESC
LIMIT 4000;