Skip to content

Smart Pre-Processing Pipeline: Validation, Triage, and Corrections #110

@deucebucket

Description

@deucebucket

Summary

A comprehensive pre-processing system that validates files, triages folders, and pushes corrections - all while preserving full automation.

Core philosophy: Don't trust garbage folders, DO trust audio. And if we screw up, fix it.


Part 1: File Validation (Local, Fast)

Before wasting Skaldleita's time, verify files are actually valid audiobooks.

What we check

def validate_audio_file(path: str) -> tuple[bool, str]:
    """Quick validation using ffprobe"""
    try:
        result = subprocess.run([
            'ffprobe', '-v', 'error',
            '-show_entries', 'format=duration,size',
            '-of', 'json', path
        ], capture_output=True, timeout=30)
        
        if result.returncode != 0:
            return False, "corrupt_or_unreadable"
        
        data = json.loads(result.stdout)
        duration = float(data['format'].get('duration', 0))
        size = int(data['format'].get('size', 0))
        
        if duration == 0:
            return False, "no_duration_truncated"
        if duration < 600:  # < 10 minutes
            return False, "too_short_not_audiobook"
        if size < 1_000_000:  # < 1MB
            return False, "too_small"
            
        # Try reading the end of file (catches truncated downloads)
        seek_result = subprocess.run([
            'ffmpeg', '-v', 'error', '-sseof', '-10',
            '-i', path, '-f', 'null', '-'
        ], capture_output=True, timeout=30)
        
        if seek_result.returncode != 0:
            return False, "truncated_cant_seek_end"
            
        return True, "valid"
    except Exception as e:
        return False, f"validation_error_{e}"

What this catches

Check Problem Detected
ffprobe fails Completely corrupt, not audio, wrong format
Duration = 0 Truncated/incomplete download
Duration < 10 min Sample, trailer, not full audiobook
Size < 1MB Empty or stub file
Can't seek to end Download interrupted mid-file
No audio stream Video mislabeled, container without audio

Result

  • Valid → Continue to triage
  • Invalid → Quarantine with reason, don't process

Part 2: Folder Triage (Local, Fast)

Categorize folders by "cleanliness" to decide processing strategy.

Detection patterns

MESSY_PATTERNS = [
    # Scene release tags
    r'\{[a-z]+\}',           # {mb}, {cbt}
    r'\[[A-Z0-9]+\]',        # [FLAC], [MP3]
    r'\([A-Za-z]+\)',        # (Thorne), (narrator)
    
    # Embedded dates/numbers
    r'^\d{4}\s*-',           # 2023 -
    r'\d{2}\.\d{2}\.\d{2}',  # 01.10.42
    
    # Size/quality markers
    r'\d+k\b',               # 62k, 128k
    r'\d+kbps',              # 64kbps
    r'\bHQ\b|\bLQ\b',        # Quality markers
    
    # Torrent/scene indicators
    r'-[A-Z]{2,4}$',         # -TEAM suffix
    r'\.com\b',              # Website in name
]

GARBAGE_PATTERNS = [
    r'^[a-f0-9]{12,}$',      # Hash-only names
    r'^[\d\s\-\.]+$',        # Numbers only
    r'^(New Folder|tmp|downloads?|torrents?)$',
]

def triage_folder(path: str) -> str:
    folder_name = os.path.basename(path)
    
    for pattern in GARBAGE_PATTERNS:
        if re.match(pattern, folder_name, re.I):
            return "garbage"  # 🔴
    
    for pattern in MESSY_PATTERNS:
        if re.search(pattern, folder_name):
            return "messy"    # 🟡
    
    return "clean"            # 🟢

Processing strategy by category

Category Path Parsing Skaldleita Confidence Modifier
🟢 Clean Use as hints Normal None
🟡 Messy SKIP Audio-only, no folder hints None
🔴 Garbage SKIP Audio-only, no folder hints -10% (expect harder match)

Key insight: Messy folders don't stop automation - they just change the strategy to "trust audio only."


Part 3: Push Corrections (Distributed Fix)

When Skaldleita discovers it gave wrong info, push corrections to opted-in users.

Why this matters

  1. Skaldleita tells User A: "Book X narrated by Ray Porter"
  2. User A's library gets this metadata
  3. Later, voice fingerprinting reveals: Ray Porter doesn't narrate Book X
  4. User A now has wrong data we gave them

Skaldleita side

-- Log what we told users (for correction matching)
CREATE TABLE identification_log (
    id SERIAL PRIMARY KEY,
    audio_hash TEXT,              -- Hash of the audio sample we analyzed
    skaldleita_id TEXT,           -- The SL_ID we assigned
    result_json JSONB,            -- Full result we returned
    created_at TIMESTAMP DEFAULT NOW()
);

-- Store corrections when we discover mistakes
CREATE TABLE corrections (
    id SERIAL PRIMARY KEY,
    affected_sl_ids TEXT[],       -- Which SL_IDs are affected
    affected_audio_hashes TEXT[], -- Which audio hashes are affected
    original_result JSONB,        -- What we said (wrong)
    corrected_result JSONB,       -- What it should be
    reason TEXT,                  -- "narrator_mismatch", "wrong_book", "merged_duplicate"
    created_at TIMESTAMP DEFAULT NOW()
);

-- API endpoint
GET /api/corrections?since=2026-02-01T00:00:00Z

Library Manager side

# Settings
{
    "receive_corrections": true,      # Opt-in
    "auto_apply_corrections": false,  # Auto-fix or queue for review
    "last_correction_check": "2026-02-03T00:00:00Z"
}

# Periodic check (on startup + every 24h)
async def check_for_corrections():
    if not config.get('receive_corrections'):
        return
    
    corrections = await skaldleita.get_corrections(
        since=config['last_correction_check']
    )
    
    for correction in corrections:
        # Find local books that match
        affected = db.query("""
            SELECT * FROM books 
            WHERE skaldleita_id = ANY(%s) 
               OR audio_hash = ANY(%s)
        """, [correction['affected_sl_ids'], correction['affected_audio_hashes']])
        
        for book in affected:
            if config.get('auto_apply_corrections'):
                apply_correction(book, correction)
                log_activity(f"Auto-corrected: {book.title}")
            else:
                queue_for_review(book, correction)
                log_activity(f"Correction available: {book.title}")
    
    config['last_correction_check'] = datetime.utcnow().isoformat()

Narrator ID integration

This becomes powerful with voice fingerprinting:

def verify_narrator_match(identification):
    """Cross-check: does this narrator actually narrate this book?"""
    narrator_id = identification.get('narrator_voice_id')
    book_asin = identification.get('asin')
    
    if not narrator_id or not book_asin:
        return True  # Can't verify
    
    narrator_catalog = get_narrator_catalog(narrator_id)
    
    if book_asin not in narrator_catalog:
        # Voice matches narrator X, but narrator X doesn't narrate this book
        create_correction(
            sl_id=identification['sl_id'],
            reason="narrator_catalog_mismatch",
            details=f"Voice={narrator_id}, but ASIN {book_asin} not in their catalog"
        )
        return False
    
    return True

Full Pipeline Flow

┌─────────────────────────────────────────────────────────────────────┐
│  INCOMING BOOK                                                      │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 1: FILE VALIDATION (local, fast)                      │   │
│  │  • ffprobe check                                            │   │
│  │  • Duration/size sanity                                     │   │
│  │  • Can seek to end?                                         │   │
│  │  Result: VALID → continue | INVALID → quarantine            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 2: FOLDER TRIAGE (local, fast)                        │   │
│  │  • Pattern matching on folder name                          │   │
│  │  • 🟢 Clean → use path hints                                │   │
│  │  • 🟡 Messy → skip path, audio-only                         │   │
│  │  • 🔴 Garbage → skip path, audio-only, expect difficulty    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 3: SKALDLEITA PROCESSING                              │   │
│  │  • GPU Whisper transcription                                │   │
│  │  • Database matching                                        │   │
│  │  • Log identification (for future corrections)              │   │
│  │  • Return result with confidence                            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 4: PRECOG CONSENSUS (if multiple sources)             │   │
│  │  • Gather votes from all sources                            │   │
│  │  • Weight by reliability                                    │   │
│  │  • Flag disagreements for review                            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ PHASE 5: APPLY TO LIBRARY                                   │   │
│  │  • Update metadata                                          │   │
│  │  • Embed tags (if enabled)                                  │   │
│  │  • Rename/reorganize (if enabled)                           │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

                    LATER (async, periodic)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  CORRECTION CHECK (if opted in)                                     │
│  • Fetch corrections from Skaldleita                                │
│  • Match against local books by SL_ID or audio hash                 │
│  • Auto-apply or queue for review                                   │
└─────────────────────────────────────────────────────────────────────┘

User-Facing Features

Dashboard indicators

Scan Results:
├── 500 files found
├── 485 valid audiobooks
│   ├── 400 clean folders (normal processing)
│   ├── 70 messy folders (audio-only processing)
│   └── 15 garbage folders (audio-only, flagged)
├── 10 quarantined (corrupt/incomplete)
└── 5 skipped (too short, likely samples)

Corrections:
└── 3 corrections available (click to review)

Settings

Setting Description Default
validate_files Check files before processing true
skip_messy_path_parsing Don't trust messy folder names true
receive_corrections Get corrections from Skaldleita true
auto_apply_corrections Apply corrections automatically false

Implementation Order

  1. File validation - Quick win, prevents garbage from entering pipeline
  2. Folder triage - Adjusts strategy per-entry, preserves automation
  3. Corrections infrastructure - Skaldleita side first, then LM integration
  4. Narrator ID integration - Once voice fingerprinting is solid

Related

Suggested by: @Merijeek (triage idea), @deucebucket (corrections concept)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions