Skip to content

feat: Folder triage for smart pre-processing (#110)#149

Merged
deucebucket merged 3 commits intodevelopfrom
feature/issue-110-folder-triage
Feb 11, 2026
Merged

feat: Folder triage for smart pre-processing (#110)#149
deucebucket merged 3 commits intodevelopfrom
feature/issue-110-folder-triage

Conversation

@deucebucket
Copy link
Owner

Summary

  • Adds folder name triage system (clean/messy/garbage) to categorize book folders before processing
  • Messy folders (scene tags, torrent markers, codec labels) skip path-derived hints
  • Garbage folders (hash names, numbers-only, generic placeholders) also get a confidence penalty
  • Integrated across scanning, AI identification, Whisper transcription hints, and queue processing

Changes

  • New: library_manager/folder_triage.py — regex pattern matching with compiled patterns
  • Modified: app.py — scanning stores triage result, AI skips unreliable folder names, Whisper skips bad hints
  • Modified: library_manager/database.pyfolder_triage column migration
  • Modified: library_manager/pipeline/layer_ai_queue.py — triage-aware prompt building

Test plan

  • 281 naming tests pass
  • ruff F821 clean (only deprecated files flagged)
  • folder_triage module smoke tests pass
  • Sandbox testing with chaos library (messy/garbage folders)

Closes #110 (Part 2)

Categorize folder names as clean/messy/garbage to adjust processing
strategy. Messy/garbage folders skip path-derived hints and rely on
audio metadata and AI instead.

- New module: library_manager/folder_triage.py
- Regex patterns for scene tags, torrent markers, hash names, etc.
- Integrated into scanning, AI identification, Whisper hints, and queue
- Stored per-book in database, exposed via API
@bucket-agent
Copy link

bucket-agent bot commented Feb 11, 2026

🔍 Vibe Check Review

Context

PR #149 implements folder triage (Issue #110 Part 2) - a new system to categorize folder names as clean/messy/garbage and adjust processing strategy accordingly. Clean folders use path hints normally, messy folders skip path parsing, and garbage folders get confidence penalties.

Codebase Patterns I Verified

Error Handling: Database migrations use bare except: pass with inline comments like # Column already exists - this is the established pattern for schema changes (checked database.py lines 41-134). This is acceptable for ALTER TABLE operations where the exception is expected and documented.

Type Hints: Mixed usage - newer modules like file_validation.py and worker.py use full type hints, but app.py and older code doesn't. The new folder_triage.py lacks type hints, which is inconsistent with newer code standards.

Logging: Uses module-level logger = logging.getLogger(__name__) consistently across the codebase. The new module follows this pattern correctly.

Database Changes: Schema updates follow established pattern - add column with try/except, default value, inline comment explaining purpose. The folder_triage column addition matches this exactly.

✅ Good

  • Comprehensive integration: Folder triage is properly integrated into 5 key areas: deep_scan, AI identification, Whisper transcription, queue processing, and dashboard display
  • Performance optimization: Regex patterns pre-compiled at module import time (_MESSY_COMPILED, _GARBAGE_COMPILED)
  • Database migration safety: Uses try/except pattern with clear default value and issue reference
  • CHANGELOG properly updated: Clear description of functionality with issue reference
  • Defensive coding: Empty string checks before processing (if not folder_name or not folder_name.strip())
  • Backwards compatibility: Database column has DEFAULT 'clean' and code uses or 'clean' fallback
  • Clear documentation: Module docstring explains the three categories and their processing strategies

🚨 Issues Found

Severity Location Issue Fix
MEDIUM library_manager/folder_triage.py:48-98 Missing type hints - All functions lack type hints, inconsistent with project's newer code (file_validation.py, worker.py, signing.py all use full type hints) Add return type annotations: def triage_folder(folder_name: str) -> str:, def should_use_path_hints(triage_result: str) -> bool:, etc.
LOW library_manager/folder_triage.py:21 Regex may match legitimate names - r'\([A-Za-z]+\)' marks as messy any folder with parentheses like "Book Title (2023)" or "Series Name (Book 1)". These are common, legitimate naming patterns for audiobooks. Consider narrowing to r'\(narrator:?\s*[A-Za-z]+\)' or only flagging if combined with other markers
LOW app.py:9177 Unnecessary fallback - row['folder_triage'] or 'clean' but DB column has DEFAULT 'clean' so this should never be NULL unless explicitly set Remove or 'clean' since DB guarantees non-null value (line 132: TEXT DEFAULT 'clean')

📋 Scope Verification

Issue Problem Addressed? Notes
#110 Smart pre-processing pipeline with folder triage ⚠️ PARTIAL Part 2 (folder triage) completed. Part 1 (file validation) already merged in prior PR. Part 3 (push corrections with LM integration) remains pending as documented in issue.

Scope Status: SCOPE_PARTIAL - This PR correctly implements Part 2 of the 3-part feature as described in the issue.

📝 Documentation Check

  • CHANGELOG.md: ✅ Updated - Clear entry for version 0.9.0-beta.123 explaining the folder triage feature
  • README.md: ⚠️ Consider updating - README Recent Changes section still shows beta.114 as latest. Since this adds user-visible behavior (folders categorized, path hints selectively applied, triage logged), the README should document this feature.

🎯 Verdict

REQUEST_CHANGES

Required fixes:

  1. Add type hints to folder_triage.py - This is a new module and should follow the type-hinted pattern established by file_validation.py, worker.py, and signing.py
  2. Review parentheses regex - Current pattern \([A-Za-z]+\) will flag legitimate book titles like "Foundation (Book 1)" or "The Lord of the Rings (2001)" as messy

Recommended (not blocking):
3. Consider updating README.md Recent Changes section to document the folder triage feature for users
4. Remove redundant or 'clean' fallback on line 9177 since DB guarantees non-null

Why REQUEST_CHANGES: The type hints issue is significant because this is new code that will establish patterns for future maintenance. The parentheses regex could cause false positives on common audiobook naming conventions, potentially degrading the quality of AI hints for legitimately clean folders.

- Add type hints to all folder_triage.py functions
- Narrow parentheses regex to avoid false positives on legitimate
  names like "Foundation (Book 1)" - now only matches known messy
  keywords (narrator, unabridged, rip, scene, kbps)
- Remove redundant 'or clean' fallback since DB DEFAULT handles it
@bucket-agent
Copy link

bucket-agent bot commented Feb 11, 2026

🔍 Vibe Check Review

Context

PR #149 implements Issue #110 Part 2: Folder triage system that categorizes folder names as clean/messy/garbage to control whether path-derived hints should be trusted during audiobook identification.

Codebase Patterns I Verified

  1. Logging pattern: Uses logger = logging.getLogger(__name__) consistently (verified in folder_triage.py:16, app.py, and other modules)
  2. Error handling: Project uses except: for SQL migration rollback (database.py pattern seen at lines 41, 47, 54, etc.)
  3. Null coalescing: Codebase extensively uses row['field'] or 'default' pattern (app.py:8825, 8885, 9224, library_manager/pipeline/layer_ai_queue.py:211)
  4. Type hints: Used on public functions in modules (folder_triage.py shows List[str], Tuple[str, str], str -> consistently)
  5. Database migrations: Uses try/except to safely add columns (database.py:114-134 pattern)
  6. Issue tracking: Comments reference issue numbers throughout (standard practice in this codebase)

✅ Good

  • Excellent defensive coding: Empty/None folder name handled (folder_triage.py:58-59)
  • Performance optimization: Regex patterns compiled once at import time (folder_triage.py:45-46)
  • Clear separation of concerns: New module is focused and standalone
  • Comprehensive integration: Triage consistently applied across all processing paths (transcription, AI prompts, queue processing, scanning)
  • Good documentation: Inline comments explain WHY triage is applied at each integration point
  • Safe database migration: Uses try/except pattern matching existing codebase style
  • Proper fallback handling: Uses row.get('folder_triage') or 'clean' matching existing null-coalescing patterns
  • Logging coverage: Important triage decisions logged for debugging (non-clean folders, skipped hints)
  • Stats tracking: Triage counts logged at scan completion for visibility

🚨 Issues Found

Severity Location Issue Fix
None N/A No security, error handling, or defensive programming issues found N/A

📋 Scope Verification

Issue Problem Addressed? Notes
#110 Smart Pre-Processing Pipeline (Part 2: Folder triage) ✅ FULLY Implements folder categorization as clean/messy/garbage with integration into all processing layers. Part 1 (file validation) completed in previous PR. Part 3 (corrections with LM) remains for future work.

Scope Status: SCOPE_OK

📝 Documentation Check

  • CHANGELOG.md: ✅ Updated with comprehensive description of folder triage feature
  • README.md: N/A - Internal processing enhancement, not user-facing UI feature

🎯 Verdict

APPROVE

This is clean, well-integrated code that matches existing codebase patterns:

  1. Security: No hardcoded secrets, no unsafe operations, regex patterns are safe
  2. Error Handling: Defensive checks for None/empty, database migration uses standard try/except pattern
  3. Code Quality: Type hints present, clear function names, no duplication, no magic numbers
  4. Integration: Consistently applied across transcription, AI prompts, queue processing, and scanning
  5. Scope: Precisely addresses Issue Smart Pre-Processing Pipeline: Validation, Triage, and Corrections #110 Part 2 as described
  6. Pattern Matching: Follows codebase conventions for logging, null-coalescing, database migrations

No changes required. Ship it.

@deucebucket deucebucket merged commit 8304b12 into develop Feb 11, 2026
@deucebucket deucebucket deleted the feature/issue-110-folder-triage branch February 11, 2026 11:12
@bucket-agent
Copy link

bucket-agent bot commented Feb 11, 2026

🔍 Vibe Check Review

Context

PR #149 implements Part 2 of Issue #110 - adds folder triage system that categorizes folder names as clean/messy/garbage and uses this to decide whether to trust path-derived hints during book identification.

Codebase Patterns I Verified

  • Logging: Uses logger = logging.getLogger(__name__) pattern ✅
  • Type hints: Consistently uses type hints on public functions (verified in file_validation.py, precog.py, signing.py)
  • Error handling: Bare except: used ONLY for database migrations (ALTER TABLE) - this is the established pattern (verified lines 91-127 in database.py)
  • Database migrations: All use try/except/pass pattern for idempotent schema updates ✅
  • Constants: Module-level constants are uppercase (MESSY_PATTERNS, GARBAGE_PATTERNS) ✅
  • SQL parameterization: All queries use ? placeholders, not string formatting ✅

✅ What's Good

  • Smart pre-compilation: Regex patterns compiled once at module load (_MESSY_COMPILED, _GARBAGE_COMPILED) for performance
  • Type safety: Full type hints on all new functions (triage_folder(folder_name: str) -> str)
  • Defensive coding: Checks for None and strips whitespace before processing (if not folder_name or not folder_name.strip())
  • Backward compatible: DB migration adds column with DEFAULT value, old code won't break
  • Security: All SQL queries properly parameterized - no injection risk
  • Proper integration: Updates triage info during scans with backfill for existing books
  • Documentation: CHANGELOG clearly explains the feature and integration points
  • Scope alignment: Directly implements Part 2 of Issue Smart Pre-Processing Pipeline: Validation, Triage, and Corrections #110 as specified

🚨 Issues Found

NONE - This PR is clean and ready to ship.

📋 Scope Verification

Issue Problem Addressed? Notes
#110 Part 2 Implement folder triage to categorize folder names and control hint usage ✅ YES Fully implemented with clean/messy/garbage categories, integrated into all relevant code paths

Original Problem: "Implement a smart pre-processing pipeline that validates audiobook files, triages folders, and applies corrections automatically."

Part 2 (Folder Triage):COMPLETE

  • New folder_triage.py module with pattern-based categorization
  • Three triage categories: clean, messy, garbage
  • Integrated into: calculate_input_quality, transcribe_audio_intro, identify_book_with_ai, deep_scan_library, layer_ai_queue processing
  • Database column added with proper migration
  • Triage results logged and counted during scans
  • Confidence modifiers applied for garbage folders

Part 3 (Push corrections): ⚠️ Not in scope for this PR - Part 3 remains pending per issue description

Scope Status: SCOPE_OK - PR fully addresses Part 2 as specified

📝 Documentation Check

  • CHANGELOG.md: ✅ Updated with comprehensive description of the feature
  • README.md: N/A - Internal processing feature, not user-facing

🎯 Verdict

APPROVE

This is excellent work. The implementation is:

  • ✅ Secure (no security vulnerabilities)
  • ✅ Robust (proper error handling, defensive coding)
  • ✅ Well-integrated (touches all the right places)
  • ✅ Backward compatible (DB migration with defaults)
  • ✅ Well-documented (CHANGELOG entry is clear)
  • ✅ Follows codebase patterns (logging, type hints, SQL parameterization)
  • ✅ Performance-conscious (pre-compiled regex patterns)
  • ✅ Scope-aligned (implements exactly Part 2 of Issue Smart Pre-Processing Pipeline: Validation, Triage, and Corrections #110)

No changes requested. Ship it! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant