A ground-up ML pipeline for predicting urban transit ridership demand using the ZTBus dataset from ETH Zurich.
Dataset: ZTBus -- Second-Resolution Trolleybus Dataset from Zurich -- ETH Research Collection Course: IE 7275 Data Mining in Engineering (Northeastern University, Spring 2026)
All credit for the ZTBus dataset belongs to the original authors at ETH Zurich.
The ZTBus dataset contains 1,409 trolleybus missions recorded at 1-second resolution across 26 sensor channels in Zurich (2019--2022). Passenger counts are reported as sparse stop events (~1% of rows).
This repository builds a full classification pipeline:
- Task: 3-class ridership demand classification (low / medium / high)
- Key Insight: Forward-filling passenger counts between stops (physically valid -- count is constant between boarding events) transforms 0.9% usable data into 100%
- Models: Decision Tree, Random Forest
- Evaluation: Macro F1, balanced accuracy, per-class confusion matrices
Given second-resolution sensor telemetry from a trolleybus (GPS, speed, power demand, temperature, door state), classify the current ridership demand level:
| Class | Passenger Count | Distribution |
|---|---|---|
| Low | <= q1 (~9) | ~37% |
| Medium | q1 < x <= q2 (~19) | ~32% |
| High | > q2 (~19) | ~32% |
Tercile boundaries computed on training data only. Mission-level train/test splits prevent data leakage.
Scalable-Demand-Optimization/
├── src/
│ ├── config.py # Centralized paths, constants, hyperparameters
│ ├── data_loading.py # Metadata parsing, stratified sampling, CSV loading
│ ├── preprocessing.py # Unit conversions, temporal features, forward-fill
│ ├── feature_engineering.py # Categorical encoding, rolling windows, acceleration
│ ├── target.py # Tercile binning, demand class assignment
│ └── model_pipeline.py # Train/test split, model configs, evaluation
├── tests/ # 150 tests (TDD -- all written before implementation)
├── scripts/
│ ├── 01_eda.py # Exploratory data analysis (11 figures)
│ ├── 02_train.py # Full training pipeline (2 models: DT + RF)
│ └── train.sbatch # SLURM batch script for CPU cluster
├── Final-Project-Proposal-Markdown/ # 10-section project proposal
├── data/ # Dataset (gitignored)
├── figures/ # EDA and evaluation plots (gitignored)
├── results/ # Model metrics and summaries (gitignored)
├── logs/ # Training logs (gitignored)
├── ML-EXPERIMENT_DESIGN.md # Full experiment plan, model configs, EDA rationale
├── DATASET_README.txt # Authoritative column definitions from dataset authors
├── TEST_VALIDATION.md # TDD methodology and test coverage strategy
└── TASK.md # Reproducible execution guide
Features are extracted from the dense (forward-filled) telemetry stream:
- Temporal: hour, day of week, month, year, weekend flag, rush hour flag
- Sensor: altitude, power demand, traction force, brake pressure, door state
- Kinematic: speed (km/h), acceleration (m/s^2), rolling mean/std of speed (60s, 300s windows)
- Spatial: latitude/longitude (degrees), route (one-hot), stop name (top-20 + other bucket)
Training runs on the NEU Explorer cluster (16 CPUs, 128GB RAM). All stochastic operations seeded for reproducibility.
# Full pipeline (Decision Tree + Random Forest)
python scripts/02_train.py
# SLURM submission
sbatch scripts/train.sbatchTDD methodology -- all 150 tests written before implementation. Mock data throughout, no dataset dependency.
python -m pytest tests/ -v| Decision | Rationale |
|---|---|
| Forward-fill passenger counts | Physically valid: count constant between stops. Transforms 0.9% to 100% usable data |
| Mission-level train/test split | Prevents temporal leakage from same-mission observations appearing in both sets |
| Tercile boundaries from train only | Prevents information leakage from test distribution |
| No feature scaling needed | Both models are tree-based; scaling is irrelevant for split-based classifiers |
| Top-20 stop encoding | Avoids 147-column explosion; rare stops bucketed as "other" |
| Rolling windows (60s, 300s) | Captures short-term and medium-term kinematic context |
This project is for educational purposes. See the ETH Research Collection for dataset terms.