TripTok is a data engineering + ML project built for the NYC TLC Trip Record data. The goal is to provide a clean, queryable, and regularly updated dataset that can be used by downstream ML components (fare prediction, ETA, routing, etc.) and LLM-powered interfaces.
At a high level:
- Raw monthly TLC parquet files are stored locally and in S3 (bronze layer).
- A cleaning + standardization pipeline produces service-specific “silver” parquet (yellow, green, fhv, fhvhv) in S3.
- AWS Athena exposes those silver files as external tables for easy SQL access.
- A unified
silver_alltable / view provides a common schema across all services for modeling and analytics.
Repo skeleton + configs only.
configs/— storage & source placeholdersdata/reference/— TLC taxi zone lookup CSV + shapefilesdata/parquet/— your local monthly parquet (ignored by Git)src/triptok/— package code will live here (next stages)notebooks/— scratch & exploration
python -m venv .venv && source .venv/bin/activate
pip install -U pip && pip install -e .
Bucket: s3://triptok-nyctlc-data/
bronze/
service=<service>/year=<YYYY>/month=<MM>/source.parquet
silver/
service=<service>/year=<YYYY>/month=<MM>/data.parquet
reference/
taxi_zone_lookup.csv
taxi_zones/...
Column Type Notes
------------------------------------------------------------
service string yellow | green | fhv | fhvhv
pickup_datetime timestamp standardized to UTC or local
dropoff_datetime timestamp
trip_duration_min float minutes, may contain outliers
pu_location_id int? TLC zone ID (optional for fhv)
do_location_id int?
PUZone string? human-readable pickup zone
DOZone string?
PUBorough string?
DOBorough string?
fare_amount float? main fare (where defined)
total_amount float? full fare (tip, tolls, surcharges, etc.)
...
```markdown
## Architecture
TripTok follows a lightweight medallion-style layout:
### Bronze (raw)
- Location: `s3://triptok-nyctlc-data/bronze/...`
- Layout: `bronze/service=<service>/year=<YYYY>/month=<MM>/<raw_file>.parquet`
- Contents: Raw TLC trip parquet files as downloaded (column names/types intact).
- Purpose: Immutable source of truth; we never rewrite bronze.
### Silver (cleaned)
- Location: `s3://triptok-nyctlc-data/silver/...`
- Layout: `silver/service=<service>/year=<YYYY>/month=<MM>/data.parquet`
- Contents:
- Standardized column names across services:
- `pickup_datetime`, `dropoff_datetime`
- `pu_location_id`, `do_location_id`
- `trip_duration_min`, `trip_distance`
- fare-related columns (`fare_amount`, `total_amount`, `tips`, etc.)
- Enriched with TLC zone lookups:
- `PUBorough`, `PUZone`, `DOBorough`, `DOZone`
- Purpose: Single, consistent schema that downstream ML code and Athena can rely on.
### Query layer (Athena)
- Database: `triptok`
- External tables:
- `silver_yellow` – Yellow taxi trips
- `silver_green` – Green taxi trips
- `silver_fhv` – FHV trips
- `silver_fhvhv` – HVFHV trips
- `silver_all` – unified schema across all four services, backed by the same S3 silver paths
The SQL used to create these tables is checked into version control under `sql/athena/`.
## Running the pipeline
### 1. Environment setup
```bash
# from the repo root
python -m venv .venv
source .venv/bin/activate # or: source .venv/bin/activate.zsh
pip install -U pip
pip install -e .
# Local silver test (optional)
triptok silver-local --service yellow --year 2025 --month 8
# The above command writes cleaned files under:
data/silver_local/service=yellow/year=2025/month=08/data.parquet
# Ingesting data to S3 (bronze + silver)
triptok ingest-s3 --service yellow --year 2025 --month 8
The above command:
1. Finds the local parquet for that month in data/parquet/<year>/<service>/.
2. Uploads it to the bronze path in S3 if it is not already present.
3. Runs the standardization + validation pipeline to produce silver.
4. Uploads the silver parquet to s3://triptok-nyctlc-data/silver/....
# Can be backfilled using the below command:
for service in yellow green fhv fhvhv; do
for year in 2019 2020 2021 2022 2023 2024 2025; do
for month in $(seq -w 1 12); do
triptok ingest-s3 --service "$service" --year "$year" --month "$month" \
|| echo "⚠️ skipped $service $year-$month"
done
done
done
---
## 4️⃣ Athena + `sql/` folder section
```markdown
## Querying data with AWS Athena
All cleaned (silver) data is stored as partitioned Parquet files in S3 and
queried via AWS Athena.
### Athena database and tables
We use the `triptok` database in Athena and define the following external tables:
- `silver_yellow`
- `silver_green`
- `silver_fhv`
- `silver_fhvhv`
- `silver_all`
The `CREATE TABLE` statements for these live in:
```text
sql/
athena/
silver_yellow.sql
silver_green.sql
silver_fhv.sql
silver_fhvhv.sql
silver_all.sql
Creating / recreating tables
1. In the Athena console, select the triptok database.
2. Open the corresponding .sql file from sql/athena/ and paste it into the query editor.
3. Run the CREATE EXTERNAL TABLE ... statement.
4. After creating a table, run:
MSCK REPAIR TABLE triptok.<table_name>;
```markdown
# SQL Assets
This folder contains SQL used to interact with the TripTok data in AWS Athena.
Structure:
- `athena/`
- `silver_yellow.sql` – `CREATE EXTERNAL TABLE` for Yellow silver data
- `silver_green.sql` – Green silver table
- `silver_fhv.sql` – FHV silver table
- `silver_fhvhv.sql` – FHVHV silver table
- `silver_all.sql` – unified schema across all services
Usage:
1. Open the desired file in this folder.
2. Paste the SQL into the Athena query editor.
3. Run the statement against the `triptok` database.
4. After creating a table, run `MSCK REPAIR TABLE triptok.<table_name>;` to load partitions.
# Example queries
1. Trips per month for Yellow:
SELECT
year,
month,
COUNT(*) AS trips
FROM triptok.silver_yellow
GROUP BY year, month
ORDER BY year, month;
2. Unified slice for modeling:
SELECT
service,
pickup_datetime,
dropoff_datetime,
pu_location_id,
do_location_id,
trip_distance,
total_amount,
passenger_count
FROM triptok.silver_all
WHERE year = 2024
AND trip_distance > 0
AND total_amount > 0;