fix: Author folder dedup and series-as-author rejection (#142, #143) #144

deucebucket · 2026-02-10T12:51:35Z

Summary

Author folder deduplication ([BUG] Duplicate author folders from name variants — no deduplication at folder level #142): New find_existing_author_folder() uses normalized comparison, standardized initials matching, and difflib.SequenceMatcher (>= 0.85 threshold) to reuse existing folders instead of creating duplicates like "James S.A. Corey" alongside "James S. A. Corey"
Series-as-author rejection ([BUG] Book title or narrator used as author folder name #143): Defensive filter in BookDB provider discards results where author_name equals series_name (corrupt Skaldleita data per skaldleita#90). Defense-in-depth check in BookProfile.finalize() catches this from any source and falls back to next-best author candidate
Enable standardize_author_initials by default: Normalizes "James S A Corey" → "James S. A. Corey" out of the box, reducing folder fragmentation

Closes #142
Closes #143

Test plan

Code checks pass (test-naming-issues.py — 281/281, ruff — clean)
Sandbox: scan library with known duplicate-author books, verify folders consolidate
Sandbox: test with series that has corrupt author data (e.g., Laundry Files), verify author is discarded and falls back correctly
Verify author_lf/title naming format also deduplicates
Verify existing libraries with correct single folders are unaffected (exact match returns early)

- Deduplicate author folders using fuzzy matching (difflib.SequenceMatcher >= 0.85) and standardized initials comparison to prevent duplicate folders like "James S.A. Corey" vs "James S. A. Corey" - Reject BookDB results where author equals series name (Skaldleita #90 corrupt data, e.g. author "Laundry Files" instead of "Charles Stross") - Defense-in-depth in BookProfile.finalize() to catch series-as-author from any source, with fallback to next-best author candidate - Enable standardize_author_initials by default to normalize initials at the config level

bucket-agent · 2026-02-10T12:54:24Z

🔍 Vibe Check Review

Context

PR #144 addresses duplicate author folders from name variants (#142) and series-name-as-author data corruption (#143).

Codebase Patterns I Verified

Error handling: Project uses specific exceptions (OSError) with try/except blocks, returns None on failure
Type hints: Public functions use type hints, internal helpers sometimes omit them (especially returns)
Logging: Uses logger = logging.getLogger(__name__) pattern consistently
Path operations: Uses pathlib.Path with defensive OSError catching
Confidence scoring: SOURCE_WEIGHTS dict defines provider trust levels (user:100, audio:85, bookdb:65, etc.)

✅ Good

Defensive programming: Series-as-author check occurs in finalize() as defense-in-depth, catching corrupt data from any source
Smart fallback: _find_alternative_author() reuses existing confidence calculation logic to find next-best candidate
Three-tier matching: Exact → Initials → Fuzzy matching provides robust deduplication without being too aggressive
Safe failure modes: Returns None on errors, doesn't crash
Proper OSError handling: Catches filesystem errors during directory listing
Config default change: Enabling standardize_author_initials by default helps the dedup logic work better

🚨 Issues Found

Severity	File:Line	Exact Code Quote	Issue	Fix
LOW	path_safety.py:138 (new)	`def find_existing_author_folder(lib_path, target_author):`	Missing return type hint	Add `-> Optional[str]:`
LOW	book_profile.py:342 (new)	`def _find_alternative_author(self, bad_value: str):`	Missing return type hint	Add `-> Optional[Tuple[str, int]]:`
MEDIUM	book_profile.py:322-336	`# Defense-in-depth: reject author that equals series name` block	No logging when series-as-author is detected	Add `logger.warning(f"[PROFILE] Series-as-author detected: '{bad_author}' == '{self.series.value}', switching to alternative")` before line 328

Notes on severity:

Type hints are LOW because the project has mixed usage (public functions have them, internal helpers don't always)
Logging is MEDIUM because this is a significant data quality issue that operators should be aware of in logs

📋 Scope Verification

Issue	Problem	Addressed?	Notes
#142	Duplicate author folders from name variants	✅	`find_existing_author_folder()` implements 3-tier matching (exact, initials, fuzzy >= 0.85). Applied to both standard and LF formats.
#143	Series name used as author folder name	✅	Two-layer defense: 1) BookDB provider rejects at source, 2) BookProfile.finalize() catches from any provider. Uses `_find_alternative_author()` to find next-best candidate.
#90	(Context) Skaldleita bug causing series-as-author	✅	Referenced correctly; defense is provider-agnostic

Scope Status: SCOPE_OK

Both original problems are fully addressed with defense-in-depth approach.

📝 Documentation Check

PR Title Format: fix: prefix ✅ (addresses bugs, not feat/security)

CHANGELOG.md: ❌ Missing - Should add entry for beta.119 documenting both fixes
README.md: N/A (no user-facing feature changes)

Recommended CHANGELOG entry:

## [0.9.0-beta.119] - 2026-02-10

### Fixed

- **Issue #142: Duplicate author folders from name variants** - Library Manager now 
  deduplicates author folders using fuzzy matching during path building. Prevents 
  separate folders like "James S.A. Corey" vs "James S. A. Corey" or "Alistair MacLean" 
  vs "Alistair Maclean". Uses 3-tier matching: exact normalized, standardized initials, 
  and SequenceMatcher fuzzy match (≥85% similarity).
- **Issue #143: Series name used as author folder** - Added defense-in-depth validation 
  to reject corrupt data where series name equals author name (e.g., "Laundry Files" as 
  author instead of "Charles Stross"). Validation occurs at both BookDB provider level 
  and BookProfile finalization, with automatic fallback to next-best author candidate.
- **Config default change**: `standardize_author_initials` now defaults to `True` to 
  improve author folder deduplication effectiveness.

🎯 Verdict

REQUEST_CHANGES

Fix these before merge:

REQUIRED: Add CHANGELOG.md entry documenting the fixes (see recommended entry above)
Recommended: Add type hints to new functions (-> Optional[str] and -> Optional[Tuple[str, int]])
Recommended: Add logging when series-as-author corruption is detected in BookProfile.finalize()

The core logic is solid and addresses both issues correctly with good defensive patterns. The missing documentation is the blocker.

- Add CHANGELOG entry for beta.121 documenting both fixes - Bump APP_VERSION to 0.9.0-beta.121 - Add return type hints to find_existing_author_folder() and _find_alternative_author() - Add logger + warning when series-as-author detected in finalize()

bucket-agent · 2026-02-10T12:58:24Z

🔍 Vibe Check Review

Context

PR #144 addresses duplicate author folders (Issue #142) via fuzzy matching deduplication and prevents series names from being used as author folders (Issue #143) via defensive filtering.

Codebase Patterns I Verified

Error Handling: This codebase uses broad except Exception as e: consistently across all providers and utilities. No specific exception types are caught - logging and returning None/fallback values is the standard pattern.

Type Hints: Minimal - only used for return types in some functions (e.g., Optional[str], Tuple[str, int]). Parameter types are rarely annotated.

Logging: Standard Python logging with module-level loggers. Info for normal operations, warning for data issues, debug for verbose details.

Path Operations: All path operations use pathlib.Path. The codebase has strong path sanitization via sanitize_path_component() which is CRITICAL safety infrastructure.

Config Access: Uses dict .get() with defaults consistently, no assumptions about config completeness.

String Normalization: Extensive use of .lower().strip() for comparisons, with helper functions like _normalize_author_for_matching() following established patterns from naming.py.

✅ Good

Excellent fuzzy matching strategy - 3-tier approach (exact → standardized initials → difflib) is well-reasoned and progressive
Smart deduplication placement - Applied at path-building time catches ALL naming formats (author/title, author_lf/title, custom templates)
Defense-in-depth for series-as-author - Filtered at BookDB provider AND in BookProfile.finalize() catches corrupt data from ANY source
Proper alternative author selection - _find_alternative_author() replicates the confidence calculation logic correctly
Config default change justified - standardize_author_initials: True reduces fragmentation, documented in CHANGELOG
Safety preserved - All new code paths use sanitize_path_component() before path construction
CHANGELOG is exemplary - Clear problem statements, technical details, references to Skaldleita issues

🚨 Issues Found

Severity	Location	Issue	Fix
MEDIUM	`path_safety.py:219`	`os.listdir()` used instead of `pathlib` - inconsistent with codebase patterns	Use `list(lib.iterdir())` with `.is_dir()` filter instead
MEDIUM	`path_safety.py:220`	OSError caught but not logged - silent failure violates observability	Add `logger.debug(f"Error listing library: {e}")` before `return None`
LOW	`path_safety.py:268`	Function missing from `__all__` export list initially (fixed at line 289)	Already corrected in diff - OK
LOW	`book_profile.py:83-84`	String comparison uses `.lower().strip()` on both sides - inefficient	Pre-compute `bad_normalized = str(self.author.value).lower().strip()` once
INFO	`bookdb.py:156-158`	Series-as-author check duplicates logic from BookProfile - intentional defense-in-depth	Consider adding comment: "# Intentional duplication - defense at provider level"

📋 Scope Verification

Issue	Problem	Addressed?	Notes
#142	Duplicate author folders from name variants	✅ FULLY	`find_existing_author_folder()` implements 3-tier fuzzy matching (exact, standardized initials, SequenceMatcher ≥85%). Applied to both standard and `author_lf/title` formats.
#143	Series name used as author folder	✅ FULLY	Defensive filter in BookDB provider (line 156-158) + defense-in-depth in BookProfile.finalize() (line 80-96) with automatic fallback to next-best author candidate.

Scope Status: SCOPE_OK - Both issues fully addressed with defense-in-depth strategy.

📝 Documentation Check

CHANGELOG.md: ✅ Updated - Excellent detail with technical explanation, references to Skaldleita issue Feature #67: ISBN Lookup for Ebooks #90, examples of prevented duplicates
README.md: N/A - Internal bug fix, no user-visible feature changes requiring README update

Code Quality Notes

String Normalization Inconsistency (Non-blocking):
Lines 83-84 in book_profile.py:

if (self.author.value and self.series.value and
        str(self.author.value).lower().strip() == str(self.series.value).lower().strip()):

This performs the same normalization twice. While not incorrect, it's slightly inefficient. Better:

bad_author = self.author.value
bad_normalized = str(bad_author).lower().strip() if bad_author else None
series_normalized = str(self.series.value).lower().strip() if self.series.value else None
if bad_normalized and series_normalized and bad_normalized == series_normalized:

However, this is a minor style issue and doesn't affect correctness.

OSError Handling (Blocking):
Line 220 in path_safety.py:

except OSError:
    return None

Silent failures violate observability. Should log at debug level minimum:

except OSError as e:
    logger.debug(f"Error listing library directory {lib}: {e}")
    return None

🎯 Verdict

REQUEST_CHANGES

Required Fixes:

path_safety.py:219-220 - Replace os.listdir() with pathlib pattern + add logging to OSError handler:

try:
    lib = Path(lib_path)
    if not lib.is_dir():
        return None
    
    # List only top-level directories
    existing_dirs = [d.name for d in lib.iterdir() if d.is_dir()]
except OSError as e:
    logger.debug(f"Error listing library directory {lib_path}: {e}")
    return None

Recommended (Non-blocking):

book_profile.py:83-84 - Pre-compute normalized strings to avoid redundant .lower().strip() calls
bookdb.py:156 - Add comment explaining intentional duplication: # Defense-in-depth: also checked in BookProfile.finalize(), but catching here prevents bad data propagation

- Replace os.listdir() with pathlib iterdir() for consistency - Log OSError when listing library directory fails - Pre-compute normalized strings in finalize() series-as-author check - Clarify defense-in-depth comment in bookdb.py - Remove unused os import

bucket-agent · 2026-02-10T13:02:00Z

🔍 Vibe Check Review

Context

PR #144 adds author folder deduplication to prevent variants like "James S.A. Corey" vs "James S. A. Corey" from creating duplicate folders (Issue #142), and adds defensive filtering to reject when author equals series name (Issue #143).

Codebase Patterns I Verified

Error Handling:

Uses specific exceptions (OSError, ValueError) with logging
Example from path_safety.py:566: except ValueError: with descriptive logger.error
except Exception as e used in utils/ for non-critical operations

Logging:

Uses module-level logger = logging.getLogger(__name__) consistently
Pattern: logger.warning(), logger.info(), logger.debug(), logger.error()
Found in path_safety.py:286, 346, 449, 506, 563, 567

Type Hints:

NOT consistently used in this codebase (0 matches for -> Type)
Functions use docstrings instead of type annotations
New code in this PR does not need type hints to match existing patterns

Import Structure:

from difflib import SequenceMatcher - stdlib imports used directly
from library_manager.utils.naming import standardize_initials - internal cross-module imports used

✅ Good

Defense-in-depth approach - Checks both in BookDB provider (line 155-160) AND BookProfile.finalize() (lines 80-96) to catch corrupt data from any source
Excellent fallback logic - _find_alternative_author() finds next-best candidate instead of just clearing author field
Fuzzy matching with 85% threshold - Good balance between too strict and too loose
Three-tier matching strategy - Exact → Standardized initials → Fuzzy (lines 230-250)
Proper imports - from difflib import SequenceMatcher and standardize_initials imported cleanally
Config default change documented - CHANGELOG explains why standardize_author_initials now defaults to True
Applied to both naming formats - Dedup works for standard author/title AND author_lf/title formats (lines 269-272, 280-283)

🚨 Issues Found

Severity	Location	Issue	Fix
MEDIUM	path_safety.py:214-226	`OSError` caught but directory listing failure returns `None` silently - caller can't distinguish "no match" from "couldn't check"	Log at `logger.warning()` level instead of `debug` so directory access failures are visible
LOW	path_safety.py:198-259	New `find_existing_author_folder()` function lacks docstring parameter/return documentation	Add explicit parameter docs for `lib_path` and `target_author` types
LOW	book_profile.py:102-141	`_find_alternative_author()` has complete docstring but could benefit from example in docstring	Consider adding example: `# Example: bad_value="Laundry Files", finds "Charles Stross" from raw_values`

Quote verification for MEDIUM issue:

# path_safety.py lines 214-226 (from diff):
    try:
        lib = Path(lib_path)
        if not lib.is_dir():
            return None

        # List only top-level directories
        existing_dirs = [d.name for d in lib.iterdir() if d.is_dir()]
    except OSError as e:
        logger.debug(f"Error listing library directory {lib_path}: {e}")  # <-- Line 222
        return None

The logger.debug() on line 222 means a real filesystem error (permissions, disk failure) is hidden at debug level. This could cause confusion when investigating why deduplication isn't working.

📋 Scope Verification

Issue	Problem	Addressed?	Notes
#142	Duplicate author folders from name variants	✅	`find_existing_author_folder()` with 3-tier matching (exact, initials, fuzzy 85%) prevents duplicates
#143	Series name used as author folder	✅	Double defense: BookDB provider rejects at source (bookdb.py:155-160), BookProfile.finalize() catches from any provider (book_profile.py:80-96) with smart fallback
#90	(Referenced in comments)	N/A	Skaldleita bug context only - not addressed in Library Manager PR

Scope Status: SCOPE_OK

Both issues fully addressed:

[BUG] Duplicate author folders from name variants — no deduplication at folder level #142: Dedup logic applied in build_new_path() before creating folders (lines 267-272, 279-283)
[BUG] Book title or narrator used as author folder name #143: Corrupt data filtered at BookDB ingress + defense-in-depth check in BookProfile + automatic fallback to alternative author

📝 Documentation Check

This is a fix: PR (not feat:/security:)

CHANGELOG.md: ✅ Updated with detailed fix descriptions for both [BUG] Duplicate author folders from name variants — no deduplication at folder level #142 and [BUG] Book title or narrator used as author folder name #143
README.md: N/A (bug fixes, no user-visible feature changes)

🎯 Verdict

APPROVE

This PR is well-implemented with strong defensive programming:

Solves both issues completely
Defense-in-depth approach (multiple layers catch the same problem)
Smart fallback logic instead of just failing
Applied consistently across naming format variants
Good logging for debugging

Minor improvements to consider (not blocking):

Raise OSError logging from debug to warning in line 222 - directory access failures are worth knowing about
Add parameter type docs to find_existing_author_folder() docstring for clarity

The MEDIUM severity issue is a logging level choice that doesn't affect functionality - the code handles the error correctly. The implementation is solid and ready to merge.

deucebucket mentioned this pull request Feb 10, 2026

[TECHDEBT] Document Python 3.9+ requirement and DB connection patterns #136

Closed

deucebucket merged commit 3dbbb0c into develop Feb 10, 2026
3 checks passed

deucebucket deleted the fix/issue-142-author-dedup branch February 10, 2026 13:03

This was referenced Feb 10, 2026

[BUG] Duplicate author folders from name variants — no deduplication at folder level #142

Closed

[BUG] Book title or narrator used as author folder name #143

Closed

[BUG] More Search Errors #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Author folder dedup and series-as-author rejection (#142, #143) #144

fix: Author folder dedup and series-as-author rejection (#142, #143) #144

Uh oh!

deucebucket commented Feb 10, 2026

Uh oh!

bucket-agent bot commented Feb 10, 2026

Uh oh!

bucket-agent bot commented Feb 10, 2026

Uh oh!

bucket-agent bot commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix: Author folder dedup and series-as-author rejection (#142, #143) #144

fix: Author folder dedup and series-as-author rejection (#142, #143) #144

Uh oh!

Conversation

deucebucket commented Feb 10, 2026

Summary

Test plan

Uh oh!

bucket-agent bot commented Feb 10, 2026

🔍 Vibe Check Review

Context

Codebase Patterns I Verified

✅ Good

🚨 Issues Found

📋 Scope Verification

📝 Documentation Check

🎯 Verdict

Uh oh!

bucket-agent bot commented Feb 10, 2026

🔍 Vibe Check Review

Context

Codebase Patterns I Verified

✅ Good

🚨 Issues Found

📋 Scope Verification

📝 Documentation Check

Code Quality Notes

🎯 Verdict

Uh oh!

bucket-agent bot commented Feb 10, 2026

🔍 Vibe Check Review

Context

Codebase Patterns I Verified

✅ Good

🚨 Issues Found

📋 Scope Verification

📝 Documentation Check

🎯 Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant