Python CLI for the Every Coffee multi-source cafe ingestion pipeline.
- Ingest cafe records from OSM PBF, Overture GeoParquet, and Foursquare CSV
- Normalize and deduplicate with geohash blocking + similarity scoring
- Parse OSM
opening_hoursinto structured per-day rows - Compute
specialty_scorefrom weighted specialty signals - Track ingestion runs in Supabase/Postgres
python -m venv .venv
source .venv/bin/activate
pip install -e .
cp .env.example .envOptional (only if you need libpostal-backed address parsing for dedupe normalization):
pip install -e ".[postal]"everycoffee-ingest osm-import --pbf ./data/planet.osm.pbf --region global
everycoffee-ingest overture-import --parquet ./data/overture.parquet --region global
everycoffee-ingest foursquare-import --csv ./data/foursquare.csv --region global
everycoffee-ingest stockist-import --roaster-id <uuid> --url https://example.com/find-us
everycoffee-ingest dedupe --source osm --region global --dry-run
everycoffee-ingest dedupe --source osm --region global --apply
everycoffee-ingest enrich-hours --source osm --region global
everycoffee-ingest enrich-osm --region global
everycoffee-ingest enrich-specialty --recompute-all
everycoffee-ingest statusAll commands return structured JSON so automation can parse outputs reliably.
See .env.example for required variables.
Use this staged path for first real-world validation. Start with small files, then expand.
- OSM: city-level
.osm.pbfextract - Overture: regional GeoParquet sample
- Foursquare: CSV subset (~500 to 5,000 rows)
mkdir -p ./trial-data
# put files into ./trial-data/osm.pbf ./trial-data/overture.parquet ./trial-data/foursquare.csveverycoffee-ingest osm-import --pbf ./trial-data/osm.pbf --region trial
everycoffee-ingest overture-import --parquet ./trial-data/overture.parquet --region trial
everycoffee-ingest foursquare-import --csv ./trial-data/foursquare.csv --region trialeverycoffee-ingest dedupe --region trial --dry-runConfirm output fields look sane before applying:
compared_pairsaccepted_pairsclusters
everycoffee-ingest dedupe --region trial --applyReview:
persisted_matchesmerges_attemptedmerges_applied
everycoffee-ingest enrich-hours --source osm --region trial
everycoffee-ingest enrich-osm --region trial
everycoffee-ingest enrich-specialty --recompute-alleverycoffee-ingest status --limit 20Use trial metrics to guide fixes:
- Retry recovery: transient errors should recover without aborting a full run
- Bad-row isolation: malformed source rows should increment failures/skips, not kill command
- Merge behavior: repeated
--applyshould remain idempotent - Data quality: accepted match volume should be plausible for dataset size