Skip to content

Comments

feat: Use path info to complete partial Skaldleita results (#127)#157

Merged
deucebucket merged 5 commits intodevelopfrom
feature/issue-127-path-completion
Feb 16, 2026
Merged

feat: Use path info to complete partial Skaldleita results (#127)#157
deucebucket merged 5 commits intodevelopfrom
feature/issue-127-path-completion

Conversation

@deucebucket
Copy link
Owner

Summary

  • Adds _complete_result_from_path() in layer_audio_id.py to fill in truncated author names, titles, and missing series info using the file path
  • Only completes when the path version starts with the SL version (never replaces longer results with shorter fragments)
  • Small confidence boost (+5%) when path corroborates SL results
  • Integrated at three points: SL requeue, SL full ID, and AI fallback

Examples

  • SL returns "James S. A" + path has "James S. A. Corey" → uses full path version
  • SL returns no series + path has "The Expanse 01" → extracts series info
  • SL returns "The Way of Kings" + path has "Way of" → keeps longer SL version (safety check)

Test plan

  • Test with chaos library books that have truncated SL results
  • Verify author/title completion only fires on starts-with match
  • Verify series extraction from path patterns
  • Confirm confidence scoring adjustments are reasonable
  • Run test-naming-issues.py (281/281 pass)
  • ruff check --select=F821 clean

When Skaldleita returns a truncated name (e.g., "James S. A" instead of
"James S. A. Corey"), the file path often contains the full name. This
adds _complete_result_from_path() which:

- Completes truncated author names when path has a longer version that
  starts with the SL result
- Completes truncated titles using the same starts-with logic
- Extracts series info from path components when SL returned none
- Gives a small confidence boost when path corroborates SL results
- Never replaces a longer SL result with a shorter path fragment
- Requires minimum 4 characters to avoid false matches on trivial prefixes

Applied at three points in the pipeline:
1. SL requeue with partial ID
2. SL full identification after sanity check
3. AI fallback after sanity check
@bucket-agent
Copy link

bucket-agent bot commented Feb 16, 2026

🔍 Vibe Check Review

Context

This PR adds a new function _complete_result_from_path() that completes truncated author/title names and extracts missing series information from filesystem paths when Skaldleita returns partial results.

Codebase Patterns I Verified

Pattern checks performed:

  • Type hints: ✅ Existing function _validate_ai_result_against_path() uses -> Dict - new function matches
  • Import style: ✅ Moving import re to top-level imports is CORRECT - 18 other files do this, nested imports are not the pattern
  • Error handling:except (ValueError, TypeError): is used in 8 places across 4 files - this is the established pattern
  • Logging: ✅ Uses logger.info() pattern consistent with the file (36 instances in this file)
  • Return early pattern: ✅ Lines 43-44 match existing defensive patterns

✅ What's Good

  1. Excellent defensive programming - Checks for None/empty at function entry (lines 43-44)
  2. Safe confidence boost - Caps confidence at 0.95/95 to prevent overconfidence (lines 139, 141)
  3. Proper import cleanup - Moving import re to top is correct (nested import at line 149 was the anti-pattern)
  4. Thorough docstring - Clear explanation of purpose, args, and behavior
  5. Minimal scope - Only completes truncated results, never replaces longer values (line 64 check)
  6. Logging visibility - All completions are logged for debugging (lines 73, 81, 118)
  7. Integration points - Called at 3 appropriate locations (after SL results, after validation, after AI)

🚨 Issues Found

Severity Location Exact Code Quote Issue Fix
MEDIUM layer_audio_id.py:100-123 for part in parts: Iterates ALL path parts instead of filtering to relevant ones Change to for part in parts[-4:-1]: to check only the 3 parent dirs above the book folder
LOW layer_audio_id.py:115-117 if (len(potential_series) >= 3 Multi-line conditional hard to read Minor - acceptable as-is, but could be clearer with intermediate variable

Detailed explanation of MEDIUM issue:

The series extraction logic at line 100 iterates through ALL path components:

parts = path_obj.parts
# Look at path components between author and book for series folders
for part in parts:

This will check /, home, library, Author, Series, Book - most of which are irrelevant. The typical structure is library/Author/SeriesOrTitle/BookTitle, so only the last 3-4 components matter. Checking system paths like /home/library wastes cycles and could theoretically match spurious patterns.

Suggested fix:

# Check only the relevant parent directories (skip filesystem root and deep ancestors)
# Typical: [..., 'library', 'Author', 'Series', 'BookTitle']
for part in parts[-4:-1]:  # Check up to 3 parent dirs above book folder

📋 Scope Verification

Issue Problem Addressed? Notes
#127 Use path info to complete truncated author names Lines 71-76 handle author completion
#127 Complete truncated titles Lines 79-84 handle title completion
#127 Extract missing series info from path Lines 86-123 extract series from path patterns

Scope Status: SCOPE_OK

All requirements from Issue #127 are addressed. The PR completes truncated names and extracts series info from paths exactly as requested.

📝 Documentation Check

  • CHANGELOG.md: ❌ Missing - This is a feat: PR and should document the new path completion feature
  • README.md: N/A - Internal pipeline improvement, no user-facing UI changes

Suggested CHANGELOG entry:

## [Version] - Date

### Added

- **Issue #127: Path-based completion for partial results** - When Skaldleita returns truncated names (e.g., "James S. A" instead of "James S. A. Corey"), the system now uses folder path information to complete the full name. Also extracts series information from path structure when missing from audio identification.

🎯 Verdict

REQUEST_CHANGES - The code is SAFE and FUNCTIONAL, but needs:

  1. REQUIRED: Update CHANGELOG.md (feat: PRs must document changes)
  2. RECOMMENDED: Optimize series extraction to check only relevant path components (line 100)

The MEDIUM issue (inefficient path iteration) is a performance concern, not a correctness issue. The code will work correctly but will waste cycles checking irrelevant path components. Given this is a pipeline optimization feature, the performance inefficiency should be addressed.

Strengths:

  • Excellent defensive programming
  • Safe confidence handling
  • Clear logging
  • Well-integrated at appropriate call points
  • Matches all codebase patterns

The code quality is high. Fix the CHANGELOG and consider the path optimization, then this is ready to ship.

@bucket-agent
Copy link

bucket-agent bot commented Feb 16, 2026

🔍 Vibe Check Review

Context

This PR adds functionality to complete truncated Skaldleita results using folder path information (e.g., "James S. A" → "James S. A. Corey"), extract missing series info from path patterns, and boost confidence when path corroborates API results.

Codebase Patterns I Verified

  • Logging: Uses logger = logging.getLogger(__name__) pattern (line 42)
  • Import placement: re module imported at top-level (line 17), removed from nested function (line 31 deletion in diff)
  • Type hints: Used on public functions with -> Dict return types (e.g., line 126, 191)
  • Error handling: Uses specific exception types with try/except (ValueError, TypeError) patterns (seen in existing code at line 464-466)
  • Dict safety: Uses .get() with defaults like result.get('author', '') or '' pattern (line 93)
  • Regex patterns: Already used extensively in codebase (imported in multiple modules)

✅ Good

  • Excellent defensive programming: Checks for None/empty values before processing (if not result or not folder_hint: return result)
  • Safe prefix matching: 4-character minimum prevents false matches on trivial prefixes (line 86)
  • Conservative approach: Only completes truncated names, never replaces longer SL results with shorter path fragments
  • Type hint consistency: Uses -> Dict return type matching codebase patterns
  • Proper import placement: Moved import re to top-level (line 17) instead of nested import (removed line 31 in diff)
  • Integration points: Called at three strategic locations (SL requeue, bookdb result, AI result) without disrupting existing flow
  • Confidence boost logic: Handles both string ('low'/'medium'/'high') and numeric confidence values safely with try/except
  • Comprehensive path parsing: Checks multiple parent directories for series patterns, handles various formats
  • Safe regex: Patterns are well-formed, use re.IGNORECASE for flexibility

🚨 Issues Found

Severity File:Line Exact Code Quote Issue Fix
MEDIUM layer_audio_id.py:149-151 if isinstance(raw_conf, str): if raw_conf == 'low': result['confidence'] = 'medium' Confidence boost logic may not properly handle all SL confidence values. Existing code at line 463 shows SL can return "float, string, or garbage" but this only checks for string 'low'/'medium'/'high' - no handling for numeric strings like "0.7" Add fallback: convert string numbers to float first, or handle ValueError more broadly
LOW layer_audio_id.py:136-138 if (len(potential_series) >= 3 and potential_series.lower() != sl_title.lower() and potential_series.lower() != (result.get('author') or '').lower()): Magic number 3 for minimum series name length has no explanation Add comment: # Minimum length to avoid false matches like "The"

📋 Scope Verification

Issue Problem Addressed? Notes
#127 Use folder path to complete truncated author/title names from Skaldleita Lines 92-106 handle author/title completion with 4-char minimum prefix match
#127 Extract series info from path when missing from audio ID Lines 110-144 extract series from path patterns like "Series Name 01"
#127 Handle truncated titles Lines 100-106 handle title completion using same logic as author

Scope Status: SCOPE_OK - All requirements from Issue #127 are fully addressed.

📝 Documentation Check

  • CHANGELOG.md: ✅ Updated (lines 9-18, clear description with examples)
  • README.md: N/A (Internal pipeline enhancement, not user-facing config change)

🎯 Verdict

REQUEST_CHANGES - Fix the confidence handling to prevent potential issues:

  1. MEDIUM issue: The confidence boost logic at lines 149-165 needs to handle numeric strings that SL might return. The existing code at line 463 already documents that SL can return "float, string, or garbage" but the string handling only checks for literal 'low'/'medium'/'high'. Add handling for numeric strings like "0.7" or add better fallback.

Suggested fix for line 149-165:

try:
    if isinstance(raw_conf, str):
        # Try to parse as numeric first
        try:
            numeric_conf = float(raw_conf)
            result['confidence'] = min(0.95, numeric_conf + 0.05) if numeric_conf <= 1 else min(95, numeric_conf + 5)
        except ValueError:
            # String confidence levels - bump up one tier
            if raw_conf == 'low':
                result['confidence'] = 'medium'
            elif raw_conf == 'medium':
                result['confidence'] = 'high'
            # 'high' stays high
    elif isinstance(raw_conf, (int, float)):
        # Numeric confidence - small boost (5%) for path agreement, cap at 0.95
        if raw_conf <= 1:
            result['confidence'] = min(0.95, raw_conf + 0.05)
        else:
            result['confidence'] = min(95, raw_conf + 5)
except (ValueError, TypeError):
    pass  # Leave confidence unchanged if we can't parse it
  1. LOW issue: Add explanatory comment for magic number at line 136.

@bucket-agent
Copy link

bucket-agent bot commented Feb 16, 2026

🔍 Vibe Check Review

Context

PR #157 adds path-based completion for truncated Skaldleita results. When Skaldleita returns partial names like "James S. A", the system uses folder path information to complete them to "James S. A. Corey". Also extracts series info from path structure when missing.

Codebase Patterns I Verified

  • Logging: Uses logger = logging.getLogger(__name__) pattern (line 42) ✅
  • Error handling: Project uses try/except Exception with logging, not bare except ✅
  • Type hints: Used on function signatures (-> Dict, -> bool) ✅
  • Imports: Module-level imports at top, with local import re previously in nested function (being fixed) ✅
  • Path handling: Uses pathlib.Path consistently throughout the file ✅
  • Confidence handling: Existing code handles both string ('low'/'medium'/'high') and numeric (0-100 or 0-1.0) confidence formats ✅

✅ Good

  1. Safety-first approach: 4-character minimum prefix match prevents false positives on trivial matches
  2. Conservative logic: Only completes (never replaces longer results with shorter path fragments)
  3. Defensive null handling: Proper checks for None and empty strings throughout (or '' pattern)
  4. Maintains existing confidence formats: Handles both string and numeric confidence values with try/except
  5. Good documentation: Clear docstrings explain the function purpose and edge cases
  6. Issue tracking: All integration points reference Issue [FEATURE] Use path info to complete partial Skaldleita results #127 for traceability
  7. Import cleanup: Moves import re from nested function to top-level (proper Python style)
  8. Path safety: Uses pathlib.Path which handles platform differences automatically
  9. Selective application: Only called after results exist, checks sanity_failed flag before applying

🚨 Issues Found

Severity File:Line Exact Code Quote Issue Fix
MEDIUM layer_audio_id.py:147 for part in parts[-4:-1]: Hardcoded path depth could miss series in deeper/shallower structures. No validation that parts has enough elements. Add length check: if len(parts) >= 4: for part in parts[-4:-1]: or use safer slicing parts[max(0, len(parts)-4):-1]
LOW layer_audio_id.py:162-164 if (len(potential_series) >= 3 and potential_series.lower() != sl_title.lower() and potential_series.lower() != (result.get('author') or '').lower()) Magic number 3 for minimum series name length is undocumented Add comment explaining why 3 chars, or define constant MIN_SERIES_NAME_LENGTH = 3

📋 Scope Verification

Issue Problem Addressed? Notes
#127 Use folder path info to complete truncated Skaldleita results (e.g., "James S. A" → "James S. A. Corey") Fully implemented with _is_truncated_version() checking 4-char minimum prefix match
#127 Extract series info from path when missing Extracts from parent directory patterns like "Series Name 01" or "Series Name - Book 01"

Scope Status: SCOPE_OK - PR fully addresses the original problem and scope expansions.

📝 Documentation Check

  • CHANGELOG.md: ✅ Updated with version 0.9.0-beta.127 and clear description
  • README.md: ✅ Updated version badge to 0.9.0-beta.127

🎯 Verdict

APPROVE with minor suggestions

The code is safe to ship. The two issues flagged are:

  1. MEDIUM: Path depth check - While unlikely to crash (Python's negative indexing is forgiving), adding an explicit length check is defensive programming best practice
  2. LOW: Magic number documentation - Nice-to-have for maintainability

Neither issue is blocking. The core logic is sound:

  • Safe prefix matching with 4-char minimum
  • Conservative "only complete, never replace" approach
  • Proper null/empty handling
  • Confidence boost logic handles existing string/numeric formats
  • Integration points are well-placed (after Skaldleita, after AI, checking sanity_failed)

Recommendation: Merge as-is. The MEDIUM issue can be addressed in a follow-up if needed, but the negative slicing is Python-safe and unlikely to cause issues in practice.

@deucebucket deucebucket merged commit 78065aa into develop Feb 16, 2026
3 checks passed
@deucebucket deucebucket deleted the feature/issue-127-path-completion branch February 16, 2026 07:54
@bucket-agent
Copy link

bucket-agent bot commented Feb 16, 2026

🔍 Vibe Check Review

Context

PR #157 adds path-based completion to handle truncated Skaldleita results by using folder path information to complete partial author names, titles, and extract missing series information.

Codebase Patterns I Verified

Error Handling:

  • Existing code uses try/except (ValueError, TypeError) for type conversions (line 466 of layer_audio_id.py)
  • Silent failures with pass are acceptable for non-critical operations (e.g., language detection, path validation)
  • Functions return early on None/empty inputs rather than raising exceptions

Import Patterns:

  • Top-level imports at module start (lines 14-20)
  • The diff correctly moves import re from nested function (line 149) to top-level (line 17)
  • This follows the existing pattern and improves performance

Validation Patterns:

  • Minimum length checks (4 chars for authors in validation.py line 264)
  • Case-insensitive string comparisons with .lower()
  • Null/empty guards at function entry

Logging:

  • Uses logger.info() for successful operations
  • Uses logger.warning() for validation failures
  • Consistent [TAG] prefixes for categorization

✅ Good

  1. Safe prefix matching - Requires minimum 4-char match to avoid false positives like "The" matching "Therapy"
  2. Never replaces longer with shorter - Only completes truncated data, never overwrites complete data
  3. Defensive null checks - Guards against None/empty inputs at function entry
  4. Import optimization - Moves import re from nested function to module-level (performance improvement)
  5. Comprehensive documentation - Clear docstrings explaining the purpose and edge cases
  6. Confidence boost logic - Handles both string ('low', 'medium', 'high') and numeric (0-1 or 0-100) confidence formats
  7. Proper error handling - try/except (ValueError, TypeError) for confidence parsing with silent fallback
  8. Series extraction safeguards - Multiple validation checks:
    • Minimum 3-char series name to avoid "The", "No"
    • Doesn't match title or author
    • Checks last 3-4 path components only (not filesystem root)
  9. Integration points - Applied at three key locations:
    • After Skaldleita requeue results
    • After successful Skaldleita identification
    • After AI fallback (only if sanity check passed)
  10. CHANGELOG updated - Comprehensive entry describing the feature

🚨 Issues Found

Severity Location Issue Fix
MEDIUM layer_audio_id.py:116-130 No validation that path_author or path_title aren't garbage before using them to complete results. If folder is named "Unknown - Audiobook", this would replace valid SL data with garbage. Add validation: if path_author and not is_placeholder_author(path_author) and _is_truncated_version(...)
LOW layer_audio_id.py:139 Path(book_path) could raise exception if book_path is invalid. While wrapped in if book_path, malformed paths could still crash. Wrap in try/except or add path validation
LOW layer_audio_id.py:145-168 Complex nested logic (two regex patterns, multi-level conditionals) makes the series extraction fragile and hard to test. Consider extracting series logic into a separate _extract_series_from_path() helper function

📋 Scope Verification

Issue Problem Addressed? Notes
#127 Use folder path to complete truncated author names from Skaldleita Lines 116-122 handle author completion with 4-char minimum prefix match
#127 Complete truncated titles Lines 124-130 handle title completion with same safety checks
#127 Extract series info from path when missing Lines 134-168 extract series from parent directory patterns like "Series Name 01"

Scope Status: SCOPE_OK

All aspects of issue #127 are addressed. The implementation is conservative (only completes, never replaces) and includes multiple safety checks.

📝 Documentation Check

  • CHANGELOG.md: ✅ Updated (lines 10-17) - Comprehensive description of feature
  • README.md: ✅ Updated (version bump to 0.9.0-beta.127)

🎯 Verdict

APPROVE (with recommendations)

The implementation is solid and addresses issue #127 comprehensively. The code follows existing patterns and includes appropriate safety checks.

Recommendations for Future Improvement:

  1. Add garbage validation to prevent edge cases where placeholder folder names could pollute results:
# Complete author if truncated AND path_author is valid
sl_author = result.get('author', '') or ''
if (sl_author and path_author
    and not is_placeholder_author(path_author)  # <- Add this check
    and _is_truncated_version(sl_author, path_author)):
  1. Wrap Path() in try/except to handle malformed paths gracefully:
try:
    path_obj = Path(book_path) if book_path else None
except (ValueError, OSError):
    path_obj = None
  1. Consider refactoring series extraction into separate function for testability and clarity.

These are quality improvements, not blockers. The current implementation is safe because:

  • The 4-char minimum prevents most garbage matches
  • It only completes (never replaces longer data with shorter)
  • Integration at 3 points ensures all audio identification paths are covered
  • The import re is correctly moved to module level (performance improvement)

Merge Status: ✅ READY TO MERGE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant