Centralized, searchable schedules for San Francisco public swimming pools. This app scrapes and downloads official pool schedule PDFs, uses an LLM to extract structured schedules, and provides a clean UI to browse by program, day, time, and pool.
- Next.js (App Router), React 19, TypeScript
- Tailwind CSS v4 (via
@tailwindcss/postcssand@import "tailwindcss") - Vercel AI SDK (
ai) with Google Generative AI provider (@ai-sdk/google) - Zod for strict schema validation
- Node.js 24.4.1+
- npm
- A Google Generative AI API key
-
Install dependencies:
npm install -
Create
.env.localin the project root and add:GOOGLE_GENERATIVE_AI_API_KEY=your_key_here # optional: override autodiscovery of MLK pool PDF MLK_PDF_URL=https://sfrecpark.org/DocumentCenter/View/25795 -
Run the dev server:
npm run dev -
Generate schedules (in another terminal):
curl -X POST http://localhost:3000/api/extract-schedule
This will write public/data/all_schedules.json and you can view it at /schedules.
- PDFs are not committed. Place any local PDFs under
data/pdfs/for testing. - Extracted data is written to
public/data/all_schedules.jsonfor the UI, plus a per-PDF cache underdata/extracted/to avoid re-prompting the LLM. - The pipeline writes these pool-level fields when known:
poolName(raw as found),poolNameTitle(title case),poolShortName(frompublic/data/pools.jsonmapping)address,sfRecParkUrl,pdfScheduleUrlscheduleLastUpdated,scheduleSeason,scheduleStartDate,scheduleEndDatelanes(pool-wide context when available)
- Program entries include:
programName(canonicalized for consistent filtering)programNameOriginal(original text from the PDF)programNameCanonical(same asprogramNamefor now)dayOfWeek,startTime,endTime,notes,lanes(per-program lanes if listed)
- Multi-program time blocks in a single box (e.g., "Senior Lap Swim (6)" stacked above "Lap Swim (4)") are split into separate program entries, each with its own
lanesvalue. - Writing
public/data/all_schedules.jsonat runtime is fine locally; on Vercel it is ephemeral. Durable storage can be added later.
npm run dev— start Next.js dev servernpm run build— build for productionnpm run start— start production buildnpm run lint— run ESLintnpm run scrape— scrape pool pages to discover schedule PDF URLs; writespublic/data/discovered_pool_schedules.jsonnpm run download-pdfs— download schedule PDFs intodata/pdfs/npm run process-all-pdfs— extract schedules from PDFs (uses per-PDF cache); writespublic/data/all_schedules.jsonnpm run build-schedules— runsscrape→download-pdfs→process-all-pdfsnpm run analyze-programs— analyze raw vs canonical program names across the dataset
- Scrape: collect pool metadata and schedule PDF URLs from the SF Rec & Park site.
- Download: fetch PDFs into
data/pdfs/. - Extract: for each PDF, send content to the LLM with a strict schema (Zod). The system prompt instructs the model to:
- Use precise time formats like
h:mm[a|p](e.g.,9:00a,2:15p). - Split multi-program blocks into separate entries, capturing per-program lane counts.
- Keep original program text for provenance (
programNameOriginal).
- Use precise time formats like
- Normalize: pipeline maps
programNameto a canonical label (taxonomy rules) while preserving the original. It also writespoolNameTitleandpoolShortNameusingpublic/data/pools.json. - Render: the UI provides filters by program, pool, day, and time. The schedules page shows day-by-day program blocks with time ranges and per-program lane counts.
public/data/pools.jsonmaps canonical pool titles to short names used in the UI. Example:
{
"Mission Aquatic Center": { "shortName": "Mission" }
}- During processing, the pipeline computes
poolNameTitlefrom the PDF name (title case) and looks uppoolShortNamefrom this mapping.
- Raw structured extraction for each PDF is cached in
data/extracted/<pdf-base>.json. - By default, the pipeline prefers the cache to avoid re-prompting the LLM.
- Force a refresh with:
REFRESH_EXTRACT=1 npm run process-all-pdfs
POST /api/extract-schedule— runs a one-off extraction (primarily for development) and writes topublic/data/all_schedules.json.
.env.localvalues:
GOOGLE_GENERATIVE_AI_API_KEY=your_key_here
# override MLK autodiscovery (optional)
MLK_PDF_URL=https://sfrecpark.org/DocumentCenter/View/25795
- Optional at runtime:
# when set to 1, re-extract PDFs even if a cache exists
REFRESH_EXTRACT=1
MIT