Skip to content

allang/everycoffee-ingest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

everycoffee-ingest

Python CLI for the Every Coffee multi-source cafe ingestion pipeline.

Features

  • Ingest cafe records from OSM PBF, Overture GeoParquet, and Foursquare CSV
  • Normalize and deduplicate with geohash blocking + similarity scoring
  • Parse OSM opening_hours into structured per-day rows
  • Compute specialty_score from weighted specialty signals
  • Track ingestion runs in Supabase/Postgres

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -e .
cp .env.example .env

Optional (only if you need libpostal-backed address parsing for dedupe normalization):

pip install -e ".[postal]"

CLI Commands

everycoffee-ingest osm-import --pbf ./data/planet.osm.pbf --region global
everycoffee-ingest overture-import --parquet ./data/overture.parquet --region global
everycoffee-ingest foursquare-import --csv ./data/foursquare.csv --region global
everycoffee-ingest stockist-import --roaster-id <uuid> --url https://example.com/find-us
everycoffee-ingest dedupe --source osm --region global --dry-run
everycoffee-ingest dedupe --source osm --region global --apply
everycoffee-ingest enrich-hours --source osm --region global
everycoffee-ingest enrich-osm --region global
everycoffee-ingest enrich-specialty --recompute-all
everycoffee-ingest status

All commands return structured JSON so automation can parse outputs reliably.

Environment

See .env.example for required variables.

Trial And Error Runbook

Use this staged path for first real-world validation. Start with small files, then expand.

1) Prepare Small Trial Inputs

  • OSM: city-level .osm.pbf extract
  • Overture: regional GeoParquet sample
  • Foursquare: CSV subset (~500 to 5,000 rows)
mkdir -p ./trial-data
# put files into ./trial-data/osm.pbf ./trial-data/overture.parquet ./trial-data/foursquare.csv

2) Run Ingestion Sources

everycoffee-ingest osm-import --pbf ./trial-data/osm.pbf --region trial
everycoffee-ingest overture-import --parquet ./trial-data/overture.parquet --region trial
everycoffee-ingest foursquare-import --csv ./trial-data/foursquare.csv --region trial

3) Dedupe In Safe Mode First

everycoffee-ingest dedupe --region trial --dry-run

Confirm output fields look sane before applying:

  • compared_pairs
  • accepted_pairs
  • clusters

4) Apply Dedupe Merges

everycoffee-ingest dedupe --region trial --apply

Review:

  • persisted_matches
  • merges_attempted
  • merges_applied

5) Run Enrichment

everycoffee-ingest enrich-hours --source osm --region trial
everycoffee-ingest enrich-osm --region trial
everycoffee-ingest enrich-specialty --recompute-all

6) Verify Run Health

everycoffee-ingest status --limit 20

7) Iterate Until Stable

Use trial metrics to guide fixes:

  • Retry recovery: transient errors should recover without aborting a full run
  • Bad-row isolation: malformed source rows should increment failures/skips, not kill command
  • Merge behavior: repeated --apply should remain idempotent
  • Data quality: accepted match volume should be plausible for dataset size

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages