Friction Log for setting up Osprey

I used Claude Code to run Osprey, here's a friction log of where I got stuck and what changes I made to make things work properly:


# Osprey Setup Friction Log

This document records all issues encountered while setting up the Osprey project from scratch on December 12, 2025.

## Environment
- **OS**: macOS (Darwin 24.5.0)
- **Platform**: Apple Silicon (ARM64)
- **Python**: 3.11
- **Date**: December 12, 2025

---

## Issue 1: Postgres 18 Volume Mount Incompatibility

### Problem
When running `docker compose up -d`, the Postgres container immediately crashed with exit code 1.

### Root Cause
Postgres 18 changed its data storage structure to use major-version-specific directory names compatible with `pg_ctlcluster`. The docker-compose.yaml was configured with the old mount point `/var/lib/postgresql/data`, and there was existing data from a previous Postgres version in the volume.

### Error Message
```
Error: in 18+, these Docker images are configured to store database data in a
       format which is compatible with "pg_ctlcluster" (specifically, using
       major-version-specific directory names).

       Counter to that, there appears to be PostgreSQL data in:
         /var/lib/postgresql/data (unused mount/volume)
```

### Solution
1. Updated `docker-compose.yaml` line 261:
   ```yaml
   # Before
   - metadata_data:/var/lib/postgresql/data

   # After
   - metadata_data:/var/lib/postgresql
   ```

2. Removed old volumes with `docker compose down -v`

### Files Modified
- `docker-compose.yaml:261`

---

## Issue 2: osprey-ui-api Container Startup Failures

### Problem
The `osprey-ui-api` container kept crashing on startup with database connection errors.

### Root Cause
The UI API service had a basic `depends_on` configuration that didn't wait for Postgres to be healthy before starting. It attempted to connect to Postgres before the database was ready to accept connections.

### Error Message
```
psycopg2.OperationalError: could not translate host name "postgres" to address: Name or service not known
```

### Solution
Updated `docker-compose.yaml` lines 144-156 to use health check conditions:
```yaml
# Before
depends_on:
  - osprey-worker
  - druid-broker
  - postgres
  - snowflake-id-worker
  - bigtable
  - bigtable-initializer

# After
depends_on:
  osprey-worker:
    condition: service_started
  druid-broker:
    condition: service_started
  postgres:
    condition: service_healthy
  snowflake-id-worker:
    condition: service_healthy
  bigtable:
    condition: service_healthy
  bigtable-initializer:
    condition: service_completed_successfully
```

### Files Modified
- `docker-compose.yaml:144-156`

---

## Issue 3: UI "No data for selected features"

### Problem
Events were appearing in the Event Stream panel on the right side of the UI, but each event showed "No data for selected features" instead of the actual field values.

### Root Cause
**UX Issue**: The Osprey UI requires users to explicitly select which fields they want to display in the event stream. This is not immediately obvious to new users.

### Solution
Click the **"Select Summary Features"** button (top right of Event Stream panel) and select fields to display:
- ContainsHello
- PostText
- UserId
- EventType

### Recommendation for Improvement
Consider either:
1. Pre-selecting common fields by default
2. Adding a tooltip/hint when events show "No data for selected features"
3. Auto-selecting all available fields on first use

---

## Issue 4: Timeseries Chart "No data available"

### Problem
The Timeseries Chart consistently showed "No data available" even though the Event Stream showed matching events.

### Root Cause
**Time Range Mismatch**: The query time range ended at 1:44pm EST, but data with `ContainsHello` field only started being ingested at 1:50pm EST (after we fixed some conflicts we between local rule experiments and the example data that was restored). This created a 6-minute gap where no matching data existed.

### Additional Factor
**Browser Caching**: Earlier queries that returned empty results were cached, so even after data became available, the cached empty response was being shown.

### Solution
1. Adjust the end time in the date range selector to be after 1:50pm EST (current time)
2. Hard refresh the page (Cmd+Shift+R on Mac, Ctrl+Shift+R on Windows)
3. Resubmit the query

### Verification
Direct Druid query confirmed data existed:
```json
{
  "total": 189,
  "with_hello": 60
}
```

### Recommendation for Improvement
1. Default to "now" for end time instead of a fixed historical timestamp
2. Add a "Last hour" or "Last 30 minutes" quick select option
3. Show a more helpful message when no data is found (e.g., "No data in selected time range")

---

## Issue 5: Platform Architecture Warnings

### Problem
Multiple Docker containers showed platform mismatch warnings during startup:

```
The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8)
```

### Affected Images
- `gcr.io/google.com/cloudsdktool/cloud-sdk:latest` (bigtable)
- `apache/druid:34.0.0` (all druid containers)
- `ghcr.io/ayubun/snowflake-id-worker:0`

### Impact
**Non-blocking**: The containers run successfully under emulation, but there may be performance implications.

### Recommendation for Improvement
Consider providing ARM64 native images or documenting the expected platform in the README.

---

## Summary of Changes Required

### Files Modified
1. `docker-compose.yaml`
   - Line 261: Changed Postgres volume mount
   - Lines 144-156: Added health check conditions for osprey-ui-api dependencies

2. `druid/specs/execution_results.json`
   - Line 8: Changed offset reset from "latest" to "earliest"

### Files Restored (from git)
- `example_rules/main.sml`
- `example_plugins/src/register_plugins.py`
- `example_rules/config/labels.yaml`


---

## Time to Resolution

| Issue | Discovery | Resolution | Time Spent |
|-------|-----------|------------|------------|
| Postgres 18 mount | Immediate | 5 mins | 5 mins |
| UI API crashes | 2 mins | 3 mins | 5 mins |
| Wrong rules loaded | 10 mins | 5 mins | 15 mins |
| Druid schema | 5 mins | 10 mins | 15 mins |
| UI features | 2 mins | 1 min | 3 mins |
| Time range | 5 mins | 2 mins | 7 mins |
| **Total** | | | **~50 mins** |

---

## Positive Notes

### What Went Well
1. **Dependencies installed cleanly** - `uv sync` worked perfectly on first try
2. **Pre-commit hooks** - Installed without issues
3. **Test data generator** - Started and worked immediately once rules were fixed
4. **Documentation** - README.md was clear and accurate for the basic setup
5. **Error messages** - Most errors (especially Postgres) had clear, actionable error messages

### Infrastructure Highlights
- Docker Compose setup is well-structured
- Health checks are properly configured (once we used them)
- Druid ingestion works reliably with schema discovery
- Worker processes events correctly once rules are loaded

---

## Recommendations

### For New Users
1. **Always run `git status`** before assuming the repository is clean
2. **Check docker logs** (`docker compose logs <service>`) when services fail
3. **Allow 30-60 seconds** after restarting services for Druid to ingest new schema
4. **Use "now" or recent timestamps** for query end times, not historical dates

### For the Project
1. **Add a quickstart troubleshooting section** to README covering:
   - Postgres volume issues on upgrade
   - How to reset Druid schema
   - Time range configuration in UI

2. **Improve onboarding UX**:
   - Pre-select common fields in Event Stream
   - Default time ranges to "last hour" instead of fixed dates
   - Add inline help for "No data for selected features"

3. **Add a setup verification script** that:
   - Checks if services are healthy
   - Verifies data is flowing through the pipeline
   - Confirms Druid has ingested recent data

4. **Document platform compatibility** for ARM64/M-series Macs

5. **Consider adding `.dockerignore`** patterns to prevent local rule changes from being mounted into containers, or add a warning in the README about uncommitted changes


Issue	Discovery	Resolution	Time Spent
Postgres 18 mount	Immediate	5 mins	5 mins
UI API crashes	2 mins	3 mins	5 mins
Wrong rules loaded	10 mins	5 mins	15 mins
Druid schema	5 mins	10 mins	15 mins
UI features	2 mins	1 min	3 mins
Time range	5 mins	2 mins	7 mins
Total			~50 mins

Friction Log for setting up Osprey #96

Description

Osprey Setup Friction Log

Environment

Issue 1: Postgres 18 Volume Mount Incompatibility

Problem

Root Cause

Error Message

Solution

Files Modified

Issue 2: osprey-ui-api Container Startup Failures

Problem

Root Cause

Error Message

Solution

Files Modified

Issue 3: UI "No data for selected features"

Problem

Root Cause

Solution

Recommendation for Improvement

Issue 4: Timeseries Chart "No data available"

Problem

Root Cause

Additional Factor

Solution

Verification

Recommendation for Improvement

Issue 5: Platform Architecture Warnings

Problem

Affected Images

Impact

Recommendation for Improvement

Summary of Changes Required

Files Modified

Files Restored (from git)

Time to Resolution

Positive Notes

What Went Well

Infrastructure Highlights

Recommendations

For New Users

For the Project

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions