Smart Pre-Processing Pipeline: Validation, Triage, and Corrections

## Summary

A comprehensive pre-processing system that validates files, triages folders, and pushes corrections - all while preserving full automation.

**Core philosophy**: Don't trust garbage folders, DO trust audio. And if we screw up, fix it.

---

## Part 1: File Validation (Local, Fast)

Before wasting Skaldleita's time, verify files are actually valid audiobooks.

### What we check

```python
def validate_audio_file(path: str) -> tuple[bool, str]:
    """Quick validation using ffprobe"""
    try:
        result = subprocess.run([
            'ffprobe', '-v', 'error',
            '-show_entries', 'format=duration,size',
            '-of', 'json', path
        ], capture_output=True, timeout=30)
        
        if result.returncode != 0:
            return False, "corrupt_or_unreadable"
        
        data = json.loads(result.stdout)
        duration = float(data['format'].get('duration', 0))
        size = int(data['format'].get('size', 0))
        
        if duration == 0:
            return False, "no_duration_truncated"
        if duration < 600:  # < 10 minutes
            return False, "too_short_not_audiobook"
        if size < 1_000_000:  # < 1MB
            return False, "too_small"
            
        # Try reading the end of file (catches truncated downloads)
        seek_result = subprocess.run([
            'ffmpeg', '-v', 'error', '-sseof', '-10',
            '-i', path, '-f', 'null', '-'
        ], capture_output=True, timeout=30)
        
        if seek_result.returncode != 0:
            return False, "truncated_cant_seek_end"
            
        return True, "valid"
    except Exception as e:
        return False, f"validation_error_{e}"
```

### What this catches

| Check | Problem Detected |
|-------|------------------|
| ffprobe fails | Completely corrupt, not audio, wrong format |
| Duration = 0 | Truncated/incomplete download |
| Duration < 10 min | Sample, trailer, not full audiobook |
| Size < 1MB | Empty or stub file |
| Can't seek to end | Download interrupted mid-file |
| No audio stream | Video mislabeled, container without audio |

### Result

- **Valid** → Continue to triage
- **Invalid** → Quarantine with reason, don't process

---

## Part 2: Folder Triage (Local, Fast)

Categorize folders by "cleanliness" to decide processing strategy.

### Detection patterns

```python
MESSY_PATTERNS = [
    # Scene release tags
    r'\{[a-z]+\}',           # {mb}, {cbt}
    r'\[[A-Z0-9]+\]',        # [FLAC], [MP3]
    r'$[A-Za-z]+$',        # (Thorne), (narrator)
    
    # Embedded dates/numbers
    r'^\d{4}\s*-',           # 2023 -
    r'\d{2}\.\d{2}\.\d{2}',  # 01.10.42
    
    # Size/quality markers
    r'\d+k\b',               # 62k, 128k
    r'\d+kbps',              # 64kbps
    r'\bHQ\b|\bLQ\b',        # Quality markers
    
    # Torrent/scene indicators
    r'-[A-Z]{2,4}$',         # -TEAM suffix
    r'\.com\b',              # Website in name
]

GARBAGE_PATTERNS = [
    r'^[a-f0-9]{12,}$',      # Hash-only names
    r'^[\d\s\-\.]+$',        # Numbers only
    r'^(New Folder|tmp|downloads?|torrents?)$',
]

def triage_folder(path: str) -> str:
    folder_name = os.path.basename(path)
    
    for pattern in GARBAGE_PATTERNS:
        if re.match(pattern, folder_name, re.I):
            return "garbage"  # 🔴
    
    for pattern in MESSY_PATTERNS:
        if re.search(pattern, folder_name):
            return "messy"    # 🟡
    
    return "clean"            # 🟢
```

### Processing strategy by category

| Category | Path Parsing | Skaldleita | Confidence Modifier |
|----------|--------------|------------|---------------------|
| 🟢 Clean | Use as hints | Normal | None |
| 🟡 Messy | **SKIP** | Audio-only, no folder hints | None |
| 🔴 Garbage | **SKIP** | Audio-only, no folder hints | -10% (expect harder match) |

**Key insight**: Messy folders don't stop automation - they just change the strategy to "trust audio only."

---

## Part 3: Push Corrections (Distributed Fix)

When Skaldleita discovers it gave wrong info, push corrections to opted-in users.

### Why this matters

1. Skaldleita tells User A: "Book X narrated by Ray Porter"
2. User A's library gets this metadata
3. Later, voice fingerprinting reveals: Ray Porter doesn't narrate Book X
4. **User A now has wrong data we gave them**

### Skaldleita side

```sql
-- Log what we told users (for correction matching)
CREATE TABLE identification_log (
    id SERIAL PRIMARY KEY,
    audio_hash TEXT,              -- Hash of the audio sample we analyzed
    skaldleita_id TEXT,           -- The SL_ID we assigned
    result_json JSONB,            -- Full result we returned
    created_at TIMESTAMP DEFAULT NOW()
);

-- Store corrections when we discover mistakes
CREATE TABLE corrections (
    id SERIAL PRIMARY KEY,
    affected_sl_ids TEXT[],       -- Which SL_IDs are affected
    affected_audio_hashes TEXT[], -- Which audio hashes are affected
    original_result JSONB,        -- What we said (wrong)
    corrected_result JSONB,       -- What it should be
    reason TEXT,                  -- "narrator_mismatch", "wrong_book", "merged_duplicate"
    created_at TIMESTAMP DEFAULT NOW()
);

-- API endpoint
GET /api/corrections?since=2026-02-01T00:00:00Z
```

### Library Manager side

```python
# Settings
{
    "receive_corrections": true,      # Opt-in
    "auto_apply_corrections": false,  # Auto-fix or queue for review
    "last_correction_check": "2026-02-03T00:00:00Z"
}

# Periodic check (on startup + every 24h)
async def check_for_corrections():
    if not config.get('receive_corrections'):
        return
    
    corrections = await skaldleita.get_corrections(
        since=config['last_correction_check']
    )
    
    for correction in corrections:
        # Find local books that match
        affected = db.query("""
            SELECT * FROM books 
            WHERE skaldleita_id = ANY(%s) 
               OR audio_hash = ANY(%s)
        """, [correction['affected_sl_ids'], correction['affected_audio_hashes']])
        
        for book in affected:
            if config.get('auto_apply_corrections'):
                apply_correction(book, correction)
                log_activity(f"Auto-corrected: {book.title}")
            else:
                queue_for_review(book, correction)
                log_activity(f"Correction available: {book.title}")
    
    config['last_correction_check'] = datetime.utcnow().isoformat()
```

### Narrator ID integration

This becomes powerful with voice fingerprinting:

```python
def verify_narrator_match(identification):
    """Cross-check: does this narrator actually narrate this book?"""
    narrator_id = identification.get('narrator_voice_id')
    book_asin = identification.get('asin')
    
    if not narrator_id or not book_asin:
        return True  # Can't verify
    
    narrator_catalog = get_narrator_catalog(narrator_id)
    
    if book_asin not in narrator_catalog:
        # Voice matches narrator X, but narrator X doesn't narrate this book
        create_correction(
            sl_id=identification['sl_id'],
            reason="narrator_catalog_mismatch",
            details=f"Voice={narrator_id}, but ASIN {book_asin} not in their catalog"
        )
        return False
    
    return True
```

---

## Full Pipeline Flow

```
┌─────────────────────────────────────────────────────────────────────┐
│  INCOMING BOOK                                                      │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 1: FILE VALIDATION (local, fast)                      │   │
│  │  • ffprobe check                                            │   │
│  │  • Duration/size sanity                                     │   │
│  │  • Can seek to end?                                         │   │
│  │  Result: VALID → continue | INVALID → quarantine            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 2: FOLDER TRIAGE (local, fast)                        │   │
│  │  • Pattern matching on folder name                          │   │
│  │  • 🟢 Clean → use path hints                                │   │
│  │  • 🟡 Messy → skip path, audio-only                         │   │
│  │  • 🔴 Garbage → skip path, audio-only, expect difficulty    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 3: SKALDLEITA PROCESSING                              │   │
│  │  • GPU Whisper transcription                                │   │
│  │  • Database matching                                        │   │
│  │  • Log identification (for future corrections)              │   │
│  │  • Return result with confidence                            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 4: PRECOG CONSENSUS (if multiple sources)             │   │
│  │  • Gather votes from all sources                            │   │
│  │  • Weight by reliability                                    │   │
│  │  • Flag disagreements for review                            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 5: APPLY TO LIBRARY                                   │   │
│  │  • Update metadata                                          │   │
│  │  • Embed tags (if enabled)                                  │   │
│  │  • Rename/reorganize (if enabled)                           │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

                    LATER (async, periodic)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  CORRECTION CHECK (if opted in)                                     │
│  • Fetch corrections from Skaldleita                                │
│  • Match against local books by SL_ID or audio hash                 │
│  • Auto-apply or queue for review                                   │
└─────────────────────────────────────────────────────────────────────┘
```

---

## User-Facing Features

### Dashboard indicators

```
Scan Results:
├── 500 files found
├── 485 valid audiobooks
│   ├── 400 clean folders (normal processing)
│   ├── 70 messy folders (audio-only processing)
│   └── 15 garbage folders (audio-only, flagged)
├── 10 quarantined (corrupt/incomplete)
└── 5 skipped (too short, likely samples)

Corrections:
└── 3 corrections available (click to review)
```

### Settings

| Setting | Description | Default |
|---------|-------------|---------|
| `validate_files` | Check files before processing | `true` |
| `skip_messy_path_parsing` | Don't trust messy folder names | `true` |
| `receive_corrections` | Get corrections from Skaldleita | `true` |
| `auto_apply_corrections` | Apply corrections automatically | `false` |

---

## Implementation Order

1. **File validation** - Quick win, prevents garbage from entering pipeline
2. **Folder triage** - Adjusts strategy per-entry, preserves automation
3. **Corrections infrastructure** - Skaldleita side first, then LM integration
4. **Narrator ID integration** - Once voice fingerprinting is solid

---

## Related

- Precog consensus voting: #102, PR #107
- Original discussion: #79 (Merijeek's folder name observations)

Suggested by: @Merijeek (triage idea), @deucebucket (corrections concept)

Setting	Description	Default
`validate_files`	Check files before processing	`true`
`skip_messy_path_parsing`	Don't trust messy folder names	`true`
`receive_corrections`	Get corrections from Skaldleita	`true`
`auto_apply_corrections`	Apply corrections automatically	`false`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Smart Pre-Processing Pipeline: Validation, Triage, and Corrections #110

Summary

Part 1: File Validation (Local, Fast)

What we check

What this catches

Result

Part 2: Folder Triage (Local, Fast)

Detection patterns

Processing strategy by category

Part 3: Push Corrections (Distributed Fix)

Why this matters

Skaldleita side

Library Manager side

Narrator ID integration

Full Pipeline Flow

User-Facing Features

Dashboard indicators

Settings

Implementation Order

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Check	Problem Detected
ffprobe fails	Completely corrupt, not audio, wrong format
Duration = 0	Truncated/incomplete download
Duration < 10 min	Sample, trailer, not full audiobook
Size < 1MB	Empty or stub file
Can't seek to end	Download interrupted mid-file
No audio stream	Video mislabeled, container without audio

Category	Path Parsing	Skaldleita	Confidence Modifier
🟢 Clean	Use as hints	Normal	None
🟡 Messy	SKIP	Audio-only, no folder hints	None
🔴 Garbage	SKIP	Audio-only, no folder hints	-10% (expect harder match)

Uh oh!

Smart Pre-Processing Pipeline: Validation, Triage, and Corrections #110

Description

Summary

Part 1: File Validation (Local, Fast)

What we check

What this catches

Result

Part 2: Folder Triage (Local, Fast)

Detection patterns

Processing strategy by category

Part 3: Push Corrections (Distributed Fix)

Why this matters

Skaldleita side

Library Manager side

Narrator ID integration

Full Pipeline Flow

User-Facing Features

Dashboard indicators

Settings

Implementation Order

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions