Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Tests

on:
push:
branches: ['**']
pull_request:
branches: [main]
workflow_dispatch:

jobs:
tests:
name: Run Tests
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip'

- name: Install dependencies
run: pip install -r requirements-dev.txt

- name: Run tests with coverage
run: pytest --cov=. --cov-report=term-missing --cov-report=xml -v

- name: Upload coverage report
if: always()
uses: actions/upload-artifact@v4
with:
name: coverage-report
path: coverage.xml
retention-days: 7
126 changes: 126 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Sync2Cal Events API — a Python FastAPI application that converts data from various websites into calendar events. Serves JSON and ICS feeds compatible with Google Calendar, Apple Calendar, Outlook, etc. Backend for sync2cal.com.

## Commands

```bash
# Development
pip install -r requirements.txt # Install dependencies
uvicorn main:app --reload # Dev server (localhost:8000)
uvicorn main:app --host 0.0.0.0 # Production server

# Documentation
# Interactive API docs at http://localhost:8000/docs (auto-generated by FastAPI)
```

## Architecture

### Tech Stack
- **Python 3.11+** with **FastAPI** + **Uvicorn**
- **Requests** for HTTP calls, **BeautifulSoup4** + **lxml** for web scraping
- **gspread** + **google-auth** for Google Sheets integration
- **python-dotenv** for environment variable management

### Key Directories
- `base/` — Core framework: `Event` dataclass, `CalendarBase`, `IntegrationBase`, `mount_integration_routes()`
- `integrations/` — Individual source integrations (12 total)
- `docs/` — API endpoint documentation
- `.cursor/rules/` — Cursor IDE rules (project conventions)

### Plugin Architecture
Every integration follows the same pattern:
1. A `CalendarBase` subclass with `fetch_events(*args, **kwargs) -> List[Event]`
2. An `IntegrationBase` subclass registered in `main.py`
3. `mount_integration_routes()` auto-creates `GET /<id>/events` endpoint from `fetch_events` signature
4. The `ics` query param (default `true`) toggles ICS vs JSON response

### Integrations
| ID | Source | Method | Credentials Required |
|----|--------|--------|---------------------|
| twitch | Twitch | API | Yes (TWITCH_CLIENT_ID/SECRET) |
| google-sheets | Google Sheets | API | Yes (service account JSON) |
| thetvdb | TheTVDB | API | Yes (API key + bearer token) |
| sportsdb | TheSportsDB | API | Yes (SPORTSDB_API_KEY) |
| daily-weather-forecast | OpenWeatherMap | API | Yes (OPENWEATHERMAP_API_KEY) |
| investing | Investing.com | Scraping | No |
| imdb | IMDb | Scraping | No |
| moviedb | TheMovieDB | Scraping | No |
| wwe | WWE | Scraping | No |
| shows | TVInsider | Scraping | No |
| releases | Releases.com | Scraping | No |

### Data Model
```python
@dataclass
class Event:
uid: str # Unique identifier
title: str # Event name
start: datetime # Start datetime
end: datetime # End datetime
all_day: bool # All-day event flag
description: str # Event description
location: str # Event location
extra: Dict # Provider-specific metadata
```

### API Pattern
All endpoints follow: `GET /<integration-id>/events?<params>&ics=true|false`
- Integration ID uses hyphens (e.g., `google_sheets` becomes `/google-sheets/events`)
- `fetch_events` parameters become query params automatically
- `ics=true` (default): returns `text/plain` ICS content
- `ics=false`: returns JSON list of Event objects

## Patterns & Conventions

### Adding New Integrations
1. Create `integrations/<name>.py` with `<Name>Calendar(CalendarBase)` and `<Name>Integration(IntegrationBase)`
2. Implement `fetch_events()` — all params must be JSON-serializable primitives with defaults
3. Register in `main.py`: create integration instance, mount routes via loop
4. Add any required env vars to `env.template`

### HTTP Requests
- Always set explicit timeouts (10-20s) on `requests.get/post`
- Set `User-Agent` header when scraping websites
- Use `response.raise_for_status()` for error detection
- Wrap in try/except, raise `HTTPException` with appropriate status codes (400, 401, 429, 500, 502)

### Scraping
- Use BeautifulSoup with `lxml` parser
- Skip individual items that fail to parse (don't crash the whole request)
- Construct deterministic UIDs (e.g., `tmdb-{title}-{date}`)

### All-Day Events
- Set `all_day=True` and `end = start + timedelta(days=1)`

### ICS Generation
- `utils.generate_ics()` handles RFC 5545 compliance: line folding, text escaping, VTIMEZONE
- `utils.make_slug()` for URL-friendly text conversion

## Critical Files
- `main.py` — App setup, CORS config, integration registration loop
- `base/routes.py` — `mount_integration_routes()` — the glue that turns `fetch_events` into API endpoints
- `base/models.py` — `Event` dataclass
- `base/calendar.py` — `CalendarBase` abstract class
- `base/integration.py` — `IntegrationBase` abstract class
- `utils.py` — `generate_ics()` and `make_slug()` utilities
- `env.template` — Required environment variables reference

## Environment Variables
See `env.template` for the full list. Key ones:
- `TWITCH_CLIENT_ID` / `TWITCH_CLIENT_SECRET` — Twitch API
- `GOOGLE_SHEETS_SERVICE_ACCOUNT_FILE` — Path to service account JSON
- `THE_TVDB_API_KEY` / `THE_TVDB_BEARER_TOKEN` — TheTVDB API
- `SPORTSDB_API_KEY` — TheSportsDB API
- `OPENWEATHERMAP_API_KEY` — Weather integration
- `CORS_ORIGINS` — Comma-separated allowed origins (defaults to sync2cal.com)

## Gotchas
- **No deployment config**: No Dockerfile, Procfile, or railway.json exists yet.
- **CORS with credentials**: `allow_credentials=True` means `allow_origins=["*"]` is not allowed. Must specify exact origins via `CORS_ORIGINS` env var.
- **Scraping fragility**: IMDb, TMDB, and other scraped sources may break if the site changes its HTML structure.
- **`multi_calendar`**: Only Twitch uses `multi_calendar=True`. The `master_csv()` method on IntegrationBase is a TODO stub.
145 changes: 145 additions & 0 deletions docs/plans/2026-02-01-scraper-consolidation-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Scraper Consolidation Design

**Goal:** Retire `sync2cal-custom-scraper` and consolidate all ICS feed generation into `S2C-events-api`, eliminating a redundant Railway service.

**Date:** 2026-02-01

---

## Background

Two services produce ICS calendar feeds from external sources:

| Service | Repo | Domain | Status |
|---------|------|--------|--------|
| custom-scraper | sync2cal-custom-scraper | `sync2cal-scraper.up.railway.app` | Private, no tests, no CI |
| events-api | S2C-events-api | `api.sync2cal.com` | Public, 215 tests, 92% coverage, CI |

All 10 shared integrations are **identical code** (copy-pasted). Events-api additionally has weather integration, better architecture (loop-based routing, CORS), and a contributor guide.

**7,959 categories** in the database have SOURCE URLs pointing to custom-scraper. Zero point to events-api.

## Problem: URL Patterns Don't Match

A simple domain swap won't work. The two services use different URL structures:

| Integration | Count | custom-scraper URL | events-api URL |
|---|---|---|---|
| TheTVDB | 7,849 | `/thetvdb/series/{id}/episodes.ics` | `/thetvdb/events?series_id={id}` |
| TV Shows | 41 | `/tv/platform/{slug}.ics` or `/tv/genre/{slug}.ics` | `/shows/events?mode=platform&slug={slug}` or `?mode=genre&slug={slug}` |
| SportsDB | 20 | `/sportsdb/league/{id}.ics` or `/sportsdb/team/{id}.ics` | `/sportsdb/events?mode=league&id={id}` or `?mode=team&id={id}` |
| Google Sheets | 15 | `/sheets/events.ics?sheet_url=...` | `/google-sheets/events?sheet_url=...` |
| Investing | 10 | `/investing/earnings.ics` or `/investing/ipo.ics` | `/investing/events?kind=earnings` or `?kind=ipo` |
| Yahoo Finance | 8 | `/yahoo/generate_earnings_ics?k=100&ticker=NVDA` | **Does not exist** (broken anyway — expired cookies) |
| Releases | 7 | `/releases/generate_game_ics` | `/releases/events?kind=games` |
| Twitch | 5 | `/twitch/{name}/schedule.ics` | `/twitch/events?streamer_name={name}` |
| IMDb | 4 | `/imdb/movies.ics?genre=...&actor=...&country=...` | `/imdb/events?genre=...&actor=...&country=...` |

## Approach: Database Migration Script

Write a Python migration script that:

1. Connects to the production PostgreSQL database
2. Reads all categories with SOURCE URLs containing `sync2cal-scraper.up.railway.app`
3. Rewrites each URL to the equivalent `api.sync2cal.com` endpoint
4. Updates the database in a single transaction (atomic — all or nothing)

### URL Rewrite Rules

```python
REWRITE_RULES = [
# TheTVDB: /thetvdb/series/{id}/episodes.ics -> /thetvdb/events?series_id={id}
(r'sync2cal-scraper\.up\.railway\.app/thetvdb/series/(\d+)/episodes\.ics',
r'api.sync2cal.com/thetvdb/events?series_id=\1'),

# TV Shows platform: /tv/platform/{slug}.ics -> /shows/events?mode=platform&slug={slug}
(r'sync2cal-scraper\.up\.railway\.app/tv/platform/([^/.]+)\.ics',
r'api.sync2cal.com/shows/events?mode=platform&slug=\1'),

# TV Shows genre: /tv/genre/{slug}.ics -> /shows/events?mode=genre&slug={slug}
(r'sync2cal-scraper\.up\.railway\.app/tv/genre/([^/.]+)\.ics',
r'api.sync2cal.com/shows/events?mode=genre&slug=\1'),

# SportsDB league: /sportsdb/league/{id}.ics -> /sportsdb/events?mode=league&id={id}
(r'sync2cal-scraper\.up\.railway\.app/sportsdb/league/(\d+)\.ics',
r'api.sync2cal.com/sportsdb/events?mode=league&id=\1'),

# SportsDB team: /sportsdb/team/{id}.ics -> /sportsdb/events?mode=team&id={id}
(r'sync2cal-scraper\.up\.railway\.app/sportsdb/team/(\d+)\.ics',
r'api.sync2cal.com/sportsdb/events?mode=team&id=\1'),

# Google Sheets: /sheets/events.ics?sheet_url=... -> /google-sheets/events?sheet_url=...
(r'sync2cal-scraper\.up\.railway\.app/sheets/events\.ics\?',
r'api.sync2cal.com/google-sheets/events?'),

# Investing: /investing/{kind}.ics -> /investing/events?kind={kind}
(r'sync2cal-scraper\.up\.railway\.app/investing/earnings\.ics',
r'api.sync2cal.com/investing/events?kind=earnings'),
(r'sync2cal-scraper\.up\.railway\.app/investing/ipo\.ics',
r'api.sync2cal.com/investing/events?kind=ipo'),

# Releases: /releases/generate_game_ics -> /releases/events?kind=games
(r'sync2cal-scraper\.up\.railway\.app/releases/generate_game_ics',
r'api.sync2cal.com/releases/events?kind=games'),

# Twitch: /twitch/{name}/schedule.ics -> /twitch/events?streamer_name={name}
(r'sync2cal-scraper\.up\.railway\.app/twitch/([^/]+)/schedule\.ics',
r'api.sync2cal.com/twitch/events?streamer_name=\1'),

# IMDb: /imdb/movies.ics?... -> /imdb/events?...
(r'sync2cal-scraper\.up\.railway\.app/imdb/movies\.ics\?',
r'api.sync2cal.com/imdb/events?'),
]
```

### Yahoo Finance (8 categories)

The Yahoo Finance integration is broken (hardcoded expired cookies). Options:
1. **Remove the SOURCE** from these 8 categories so the scheduled job skips them
2. **Replace with Investing.com earnings** if that integration supports per-ticker queries
3. **Leave broken** — the scheduled job already handles download failures gracefully

Recommendation: Remove the SOURCE field. These can be re-enabled if/when a working earnings integration is built.

## Migration Steps

### Pre-migration (verify)

1. Confirm events-api endpoints work for each integration type by testing one URL per integration against `api.sync2cal.com`
2. Compare ICS output between old and new endpoints to verify identical calendar data

### Execute migration

3. Run the migration script against production database (during the 14-hour gap between scheduled job runs)
4. Verify row count matches: 7,959 rows updated

### Post-migration (validate)

5. Wait for next scheduled job run
6. Check Discord report for success/failure counts — should match pre-migration baseline
7. Spot-check a few categories (TheTVDB, SportsDB, Sheets) to confirm events populated

### Retire custom-scraper

8. Keep custom-scraper running for 1 week as safety net (in case rollback needed)
9. After 1 week with no issues, remove custom-scraper from Railway
10. Archive the sync2cal-custom-scraper repo on GitHub

## Rollback Plan

If migration fails or scheduled job reports increased failures:

```sql
-- Reverse the migration (swap api.sync2cal.com back to sync2cal-scraper.up.railway.app)
-- Keep the reverse rewrite rules in the migration script
```

The migration script should log all changes (old URL -> new URL) to enable reversal.

## References to Update

After migration, update these files that reference `sync2cal-scraper.up.railway.app`:
- `sync2cal-ics-version/CLAUDE.md` — Railway architecture table
- `new-baklava/app/admin/scraper/page.tsx` — embedded iframe to scraper docs (change to events-api docs)
- `S2C-events-api/BACKEND_STAGING_SETUP_PLAN.md` — production URL reference
- `S2C-frontend/server/meta/categoryMap.json` — regenerated automatically by prebuild script
10 changes: 8 additions & 2 deletions integrations/google_sheets.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from fastapi import HTTPException
import json
import os
from base import CalendarBase, Event, IntegrationBase
from typing import List
Expand Down Expand Up @@ -33,8 +34,13 @@ def fetch_events(
"""
try:
try:
sa_path = os.getenv("GOOGLE_SHEETS_SERVICE_ACCOUNT_FILE", "service_account.json")
gc = gspread.service_account(filename=sa_path)
sa_json = os.getenv("GOOGLE_SHEETS_SERVICE_ACCOUNT_JSON")
if sa_json:
creds = json.loads(sa_json)
gc = gspread.service_account_from_dict(creds)
else:
sa_path = os.getenv("GOOGLE_SHEETS_SERVICE_ACCOUNT_FILE", "service_account.json")
gc = gspread.service_account(filename=sa_path)
except Exception as auth_error:
raise HTTPException(
status_code=500,
Expand Down
Loading