TripTok – NYC TLC Trip Data Pipeline

TripTok is a data engineering + ML project built for the NYC TLC Trip Record data. The goal is to provide a clean, queryable, and regularly updated dataset that can be used by downstream ML components (fare prediction, ETA, routing, etc.) and LLM-powered interfaces.

At a high level:

Raw monthly TLC parquet files are stored locally and in S3 (bronze layer).
A cleaning + standardization pipeline produces service-specific “silver” parquet (yellow, green, fhv, fhvhv) in S3.
AWS Athena exposes those silver files as external tables for easy SQL access.
A unified silver_all table / view provides a common schema across all services for modeling and analytics.

TripTok — NYC TLC Pipeline (Stage 1)

Repo skeleton + configs only.

Structure

configs/ — storage & source placeholders
data/reference/ — TLC taxi zone lookup CSV + shapefiles
data/parquet/ — your local monthly parquet (ignored by Git)
src/triptok/ — package code will live here (next stages)
notebooks/ — scratch & exploration

Quick setup

python -m venv .venv && source .venv/bin/activate
pip install -U pip && pip install -e .

Bucket: s3://triptok-nyctlc-data/

bronze/
  service=<service>/year=<YYYY>/month=<MM>/source.parquet

silver/
  service=<service>/year=<YYYY>/month=<MM>/data.parquet

reference/
  taxi_zone_lookup.csv
  taxi_zones/...

Column              Type        Notes
------------------------------------------------------------
service             string      yellow | green | fhv | fhvhv
pickup_datetime     timestamp   standardized to UTC or local
dropoff_datetime    timestamp
trip_duration_min   float       minutes, may contain outliers
pu_location_id      int?        TLC zone ID (optional for fhv)
do_location_id      int?
PUZone              string?     human-readable pickup zone
DOZone              string?
PUBorough           string?
DOBorough           string?
fare_amount         float?      main fare (where defined)
total_amount        float?      full fare (tip, tolls, surcharges, etc.)
...

```markdown
## Architecture

TripTok follows a lightweight medallion-style layout:

### Bronze (raw)

- Location: `s3://triptok-nyctlc-data/bronze/...`
- Layout: `bronze/service=<service>/year=<YYYY>/month=<MM>/<raw_file>.parquet`
- Contents: Raw TLC trip parquet files as downloaded (column names/types intact).
- Purpose: Immutable source of truth; we never rewrite bronze.

### Silver (cleaned)

- Location: `s3://triptok-nyctlc-data/silver/...`
- Layout: `silver/service=<service>/year=<YYYY>/month=<MM>/data.parquet`
- Contents:
  - Standardized column names across services:
    - `pickup_datetime`, `dropoff_datetime`
    - `pu_location_id`, `do_location_id`
    - `trip_duration_min`, `trip_distance`
    - fare-related columns (`fare_amount`, `total_amount`, `tips`, etc.)
  - Enriched with TLC zone lookups:
    - `PUBorough`, `PUZone`, `DOBorough`, `DOZone`
- Purpose: Single, consistent schema that downstream ML code and Athena can rely on.

### Query layer (Athena)

- Database: `triptok`
- External tables:
  - `silver_yellow` – Yellow taxi trips
  - `silver_green` – Green taxi trips
  - `silver_fhv` – FHV trips
  - `silver_fhvhv` – HVFHV trips
  - `silver_all` – unified schema across all four services, backed by the same S3 silver paths

The SQL used to create these tables is checked into version control under `sql/athena/`.

## Running the pipeline

### 1. Environment setup

```bash
# from the repo root
python -m venv .venv
source .venv/bin/activate   # or: source .venv/bin/activate.zsh
pip install -U pip
pip install -e .

# Local silver test (optional)
triptok silver-local --service yellow --year 2025 --month 8
# The above command writes cleaned files under:
data/silver_local/service=yellow/year=2025/month=08/data.parquet

# Ingesting data to S3 (bronze + silver)
triptok ingest-s3 --service yellow --year 2025 --month 8

The above command:

1. Finds the local parquet for that month in data/parquet/<year>/<service>/.
2. Uploads it to the bronze path in S3 if it is not already present.
3. Runs the standardization + validation pipeline to produce silver.
4. Uploads the silver parquet to s3://triptok-nyctlc-data/silver/....

# Can be backfilled using the below command:
for service in yellow green fhv fhvhv; do
  for year in 2019 2020 2021 2022 2023 2024 2025; do
    for month in $(seq -w 1 12); do
      triptok ingest-s3 --service "$service" --year "$year" --month "$month" \
        || echo "⚠️ skipped $service $year-$month"
    done
  done
done


---

## 4️⃣ Athena + `sql/` folder section

```markdown
## Querying data with AWS Athena

All cleaned (silver) data is stored as partitioned Parquet files in S3 and
queried via AWS Athena.

### Athena database and tables

We use the `triptok` database in Athena and define the following external tables:

- `silver_yellow`
- `silver_green`
- `silver_fhv`
- `silver_fhvhv`
- `silver_all`

The `CREATE TABLE` statements for these live in:

```text
sql/
  athena/
    silver_yellow.sql
    silver_green.sql
    silver_fhv.sql
    silver_fhvhv.sql
    silver_all.sql

Creating / recreating tables
1. In the Athena console, select the triptok database.
2. Open the corresponding .sql file from sql/athena/ and paste it into the query editor.
3. Run the CREATE EXTERNAL TABLE ... statement.
4. After creating a table, run:

MSCK REPAIR TABLE triptok.<table_name>;


```markdown
# SQL Assets

This folder contains SQL used to interact with the TripTok data in AWS Athena.

Structure:

- `athena/`
  - `silver_yellow.sql` – `CREATE EXTERNAL TABLE` for Yellow silver data
  - `silver_green.sql` – Green silver table
  - `silver_fhv.sql` – FHV silver table
  - `silver_fhvhv.sql` – FHVHV silver table
  - `silver_all.sql` – unified schema across all services

Usage:

1. Open the desired file in this folder.
2. Paste the SQL into the Athena query editor.
3. Run the statement against the `triptok` database.
4. After creating a table, run `MSCK REPAIR TABLE triptok.<table_name>;` to load partitions.

# Example queries

1. Trips per month for Yellow:
SELECT
  year,
  month,
  COUNT(*) AS trips
FROM triptok.silver_yellow
GROUP BY year, month
ORDER BY year, month;

2. Unified slice for modeling:
SELECT
  service,
  pickup_datetime,
  dropoff_datetime,
  pu_location_id,
  do_location_id,
  trip_distance,
  total_amount,
  passenger_count
FROM triptok.silver_all
WHERE year = 2024
  AND trip_distance > 0
  AND total_amount > 0;

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
configs		configs
ml/fare		ml/fare
notebooks		notebooks
sql		sql
src/triptok		src/triptok
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data-ingestion.ipynb		data-ingestion.ipynb
data-testing.ipynb		data-testing.ipynb
makefile		makefile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TripTok – NYC TLC Trip Data Pipeline

TripTok — NYC TLC Pipeline (Stage 1)

Structure

Quick setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TripTok – NYC TLC Trip Data Pipeline

TripTok — NYC TLC Pipeline (Stage 1)

Structure

Quick setup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages