-
-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
A comprehensive pre-processing system that validates files, triages folders, and pushes corrections - all while preserving full automation.
Core philosophy: Don't trust garbage folders, DO trust audio. And if we screw up, fix it.
Part 1: File Validation (Local, Fast)
Before wasting Skaldleita's time, verify files are actually valid audiobooks.
What we check
def validate_audio_file(path: str) -> tuple[bool, str]:
"""Quick validation using ffprobe"""
try:
result = subprocess.run([
'ffprobe', '-v', 'error',
'-show_entries', 'format=duration,size',
'-of', 'json', path
], capture_output=True, timeout=30)
if result.returncode != 0:
return False, "corrupt_or_unreadable"
data = json.loads(result.stdout)
duration = float(data['format'].get('duration', 0))
size = int(data['format'].get('size', 0))
if duration == 0:
return False, "no_duration_truncated"
if duration < 600: # < 10 minutes
return False, "too_short_not_audiobook"
if size < 1_000_000: # < 1MB
return False, "too_small"
# Try reading the end of file (catches truncated downloads)
seek_result = subprocess.run([
'ffmpeg', '-v', 'error', '-sseof', '-10',
'-i', path, '-f', 'null', '-'
], capture_output=True, timeout=30)
if seek_result.returncode != 0:
return False, "truncated_cant_seek_end"
return True, "valid"
except Exception as e:
return False, f"validation_error_{e}"What this catches
| Check | Problem Detected |
|---|---|
| ffprobe fails | Completely corrupt, not audio, wrong format |
| Duration = 0 | Truncated/incomplete download |
| Duration < 10 min | Sample, trailer, not full audiobook |
| Size < 1MB | Empty or stub file |
| Can't seek to end | Download interrupted mid-file |
| No audio stream | Video mislabeled, container without audio |
Result
- Valid → Continue to triage
- Invalid → Quarantine with reason, don't process
Part 2: Folder Triage (Local, Fast)
Categorize folders by "cleanliness" to decide processing strategy.
Detection patterns
MESSY_PATTERNS = [
# Scene release tags
r'\{[a-z]+\}', # {mb}, {cbt}
r'\[[A-Z0-9]+\]', # [FLAC], [MP3]
r'\([A-Za-z]+\)', # (Thorne), (narrator)
# Embedded dates/numbers
r'^\d{4}\s*-', # 2023 -
r'\d{2}\.\d{2}\.\d{2}', # 01.10.42
# Size/quality markers
r'\d+k\b', # 62k, 128k
r'\d+kbps', # 64kbps
r'\bHQ\b|\bLQ\b', # Quality markers
# Torrent/scene indicators
r'-[A-Z]{2,4}$', # -TEAM suffix
r'\.com\b', # Website in name
]
GARBAGE_PATTERNS = [
r'^[a-f0-9]{12,}$', # Hash-only names
r'^[\d\s\-\.]+$', # Numbers only
r'^(New Folder|tmp|downloads?|torrents?)$',
]
def triage_folder(path: str) -> str:
folder_name = os.path.basename(path)
for pattern in GARBAGE_PATTERNS:
if re.match(pattern, folder_name, re.I):
return "garbage" # 🔴
for pattern in MESSY_PATTERNS:
if re.search(pattern, folder_name):
return "messy" # 🟡
return "clean" # 🟢Processing strategy by category
| Category | Path Parsing | Skaldleita | Confidence Modifier |
|---|---|---|---|
| 🟢 Clean | Use as hints | Normal | None |
| 🟡 Messy | SKIP | Audio-only, no folder hints | None |
| 🔴 Garbage | SKIP | Audio-only, no folder hints | -10% (expect harder match) |
Key insight: Messy folders don't stop automation - they just change the strategy to "trust audio only."
Part 3: Push Corrections (Distributed Fix)
When Skaldleita discovers it gave wrong info, push corrections to opted-in users.
Why this matters
- Skaldleita tells User A: "Book X narrated by Ray Porter"
- User A's library gets this metadata
- Later, voice fingerprinting reveals: Ray Porter doesn't narrate Book X
- User A now has wrong data we gave them
Skaldleita side
-- Log what we told users (for correction matching)
CREATE TABLE identification_log (
id SERIAL PRIMARY KEY,
audio_hash TEXT, -- Hash of the audio sample we analyzed
skaldleita_id TEXT, -- The SL_ID we assigned
result_json JSONB, -- Full result we returned
created_at TIMESTAMP DEFAULT NOW()
);
-- Store corrections when we discover mistakes
CREATE TABLE corrections (
id SERIAL PRIMARY KEY,
affected_sl_ids TEXT[], -- Which SL_IDs are affected
affected_audio_hashes TEXT[], -- Which audio hashes are affected
original_result JSONB, -- What we said (wrong)
corrected_result JSONB, -- What it should be
reason TEXT, -- "narrator_mismatch", "wrong_book", "merged_duplicate"
created_at TIMESTAMP DEFAULT NOW()
);
-- API endpoint
GET /api/corrections?since=2026-02-01T00:00:00ZLibrary Manager side
# Settings
{
"receive_corrections": true, # Opt-in
"auto_apply_corrections": false, # Auto-fix or queue for review
"last_correction_check": "2026-02-03T00:00:00Z"
}
# Periodic check (on startup + every 24h)
async def check_for_corrections():
if not config.get('receive_corrections'):
return
corrections = await skaldleita.get_corrections(
since=config['last_correction_check']
)
for correction in corrections:
# Find local books that match
affected = db.query("""
SELECT * FROM books
WHERE skaldleita_id = ANY(%s)
OR audio_hash = ANY(%s)
""", [correction['affected_sl_ids'], correction['affected_audio_hashes']])
for book in affected:
if config.get('auto_apply_corrections'):
apply_correction(book, correction)
log_activity(f"Auto-corrected: {book.title}")
else:
queue_for_review(book, correction)
log_activity(f"Correction available: {book.title}")
config['last_correction_check'] = datetime.utcnow().isoformat()Narrator ID integration
This becomes powerful with voice fingerprinting:
def verify_narrator_match(identification):
"""Cross-check: does this narrator actually narrate this book?"""
narrator_id = identification.get('narrator_voice_id')
book_asin = identification.get('asin')
if not narrator_id or not book_asin:
return True # Can't verify
narrator_catalog = get_narrator_catalog(narrator_id)
if book_asin not in narrator_catalog:
# Voice matches narrator X, but narrator X doesn't narrate this book
create_correction(
sl_id=identification['sl_id'],
reason="narrator_catalog_mismatch",
details=f"Voice={narrator_id}, but ASIN {book_asin} not in their catalog"
)
return False
return TrueFull Pipeline Flow
┌─────────────────────────────────────────────────────────────────────┐
│ INCOMING BOOK │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ PHASE 1: FILE VALIDATION (local, fast) │ │
│ │ • ffprobe check │ │
│ │ • Duration/size sanity │ │
│ │ • Can seek to end? │ │
│ │ Result: VALID → continue | INVALID → quarantine │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ PHASE 2: FOLDER TRIAGE (local, fast) │ │
│ │ • Pattern matching on folder name │ │
│ │ • 🟢 Clean → use path hints │ │
│ │ • 🟡 Messy → skip path, audio-only │ │
│ │ • 🔴 Garbage → skip path, audio-only, expect difficulty │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ PHASE 3: SKALDLEITA PROCESSING │ │
│ │ • GPU Whisper transcription │ │
│ │ • Database matching │ │
│ │ • Log identification (for future corrections) │ │
│ │ • Return result with confidence │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ PHASE 4: PRECOG CONSENSUS (if multiple sources) │ │
│ │ • Gather votes from all sources │ │
│ │ • Weight by reliability │ │
│ │ • Flag disagreements for review │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ PHASE 5: APPLY TO LIBRARY │ │
│ │ • Update metadata │ │
│ │ • Embed tags (if enabled) │ │
│ │ • Rename/reorganize (if enabled) │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
LATER (async, periodic)
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ CORRECTION CHECK (if opted in) │
│ • Fetch corrections from Skaldleita │
│ • Match against local books by SL_ID or audio hash │
│ • Auto-apply or queue for review │
└─────────────────────────────────────────────────────────────────────┘
User-Facing Features
Dashboard indicators
Scan Results:
├── 500 files found
├── 485 valid audiobooks
│ ├── 400 clean folders (normal processing)
│ ├── 70 messy folders (audio-only processing)
│ └── 15 garbage folders (audio-only, flagged)
├── 10 quarantined (corrupt/incomplete)
└── 5 skipped (too short, likely samples)
Corrections:
└── 3 corrections available (click to review)
Settings
| Setting | Description | Default |
|---|---|---|
validate_files |
Check files before processing | true |
skip_messy_path_parsing |
Don't trust messy folder names | true |
receive_corrections |
Get corrections from Skaldleita | true |
auto_apply_corrections |
Apply corrections automatically | false |
Implementation Order
- File validation - Quick win, prevents garbage from entering pipeline
- Folder triage - Adjusts strategy per-entry, preserves automation
- Corrections infrastructure - Skaldleita side first, then LM integration
- Narrator ID integration - Once voice fingerprinting is solid
Related
- Precog consensus voting: Implement consensus voting across data sources (Minority Report model) #102, PR feat: Precog consensus voting system (#102) #107
- Original discussion: [BUG] 3 last books stuck in the queue #79 (Merijeek's folder name observations)
Suggested by: @Merijeek (triage idea), @deucebucket (corrections concept)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request