-
-
Notifications
You must be signed in to change notification settings - Fork 2
fix: Author folder dedup and series-as-author rejection (#142, #143) #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Deduplicate author folders using fuzzy matching (difflib.SequenceMatcher >= 0.85) and standardized initials comparison to prevent duplicate folders like "James S.A. Corey" vs "James S. A. Corey" - Reject BookDB results where author equals series name (Skaldleita #90 corrupt data, e.g. author "Laundry Files" instead of "Charles Stross") - Defense-in-depth in BookProfile.finalize() to catch series-as-author from any source, with fallback to next-best author candidate - Enable standardize_author_initials by default to normalize initials at the config level
🔍 Vibe Check ReviewContextPR #144 addresses duplicate author folders from name variants (#142) and series-name-as-author data corruption (#143). Codebase Patterns I Verified
✅ Good
🚨 Issues Found
Notes on severity:
📋 Scope Verification
Scope Status: SCOPE_OK Both original problems are fully addressed with defense-in-depth approach. 📝 Documentation CheckPR Title Format:
Recommended CHANGELOG entry: ## [0.9.0-beta.119] - 2026-02-10
### Fixed
- **Issue #142: Duplicate author folders from name variants** - Library Manager now
deduplicates author folders using fuzzy matching during path building. Prevents
separate folders like "James S.A. Corey" vs "James S. A. Corey" or "Alistair MacLean"
vs "Alistair Maclean". Uses 3-tier matching: exact normalized, standardized initials,
and SequenceMatcher fuzzy match (≥85% similarity).
- **Issue #143: Series name used as author folder** - Added defense-in-depth validation
to reject corrupt data where series name equals author name (e.g., "Laundry Files" as
author instead of "Charles Stross"). Validation occurs at both BookDB provider level
and BookProfile finalization, with automatic fallback to next-best author candidate.
- **Config default change**: `standardize_author_initials` now defaults to `True` to
improve author folder deduplication effectiveness.🎯 VerdictREQUEST_CHANGES Fix these before merge:
The core logic is solid and addresses both issues correctly with good defensive patterns. The missing documentation is the blocker. |
- Add CHANGELOG entry for beta.121 documenting both fixes - Bump APP_VERSION to 0.9.0-beta.121 - Add return type hints to find_existing_author_folder() and _find_alternative_author() - Add logger + warning when series-as-author detected in finalize()
🔍 Vibe Check ReviewContextPR #144 addresses duplicate author folders (Issue #142) via fuzzy matching deduplication and prevents series names from being used as author folders (Issue #143) via defensive filtering. Codebase Patterns I VerifiedError Handling: This codebase uses broad Type Hints: Minimal - only used for return types in some functions (e.g., Logging: Standard Python logging with module-level loggers. Info for normal operations, warning for data issues, debug for verbose details. Path Operations: All path operations use Config Access: Uses dict String Normalization: Extensive use of ✅ Good
🚨 Issues Found
📋 Scope Verification
Scope Status: SCOPE_OK - Both issues fully addressed with defense-in-depth strategy. 📝 Documentation Check
Code Quality NotesString Normalization Inconsistency (Non-blocking): if (self.author.value and self.series.value and
str(self.author.value).lower().strip() == str(self.series.value).lower().strip()):This performs the same normalization twice. While not incorrect, it's slightly inefficient. Better: bad_author = self.author.value
bad_normalized = str(bad_author).lower().strip() if bad_author else None
series_normalized = str(self.series.value).lower().strip() if self.series.value else None
if bad_normalized and series_normalized and bad_normalized == series_normalized:However, this is a minor style issue and doesn't affect correctness. OSError Handling (Blocking): except OSError:
return NoneSilent failures violate observability. Should log at debug level minimum: except OSError as e:
logger.debug(f"Error listing library directory {lib}: {e}")
return None🎯 VerdictREQUEST_CHANGES Required Fixes:
Recommended (Non-blocking):
|
- Replace os.listdir() with pathlib iterdir() for consistency - Log OSError when listing library directory fails - Pre-compute normalized strings in finalize() series-as-author check - Clarify defense-in-depth comment in bookdb.py - Remove unused os import
🔍 Vibe Check ReviewContextPR #144 adds author folder deduplication to prevent variants like "James S.A. Corey" vs "James S. A. Corey" from creating duplicate folders (Issue #142), and adds defensive filtering to reject when author equals series name (Issue #143). Codebase Patterns I VerifiedError Handling:
Logging:
Type Hints:
Import Structure:
✅ Good
🚨 Issues Found
Quote verification for MEDIUM issue: # path_safety.py lines 214-226 (from diff):
try:
lib = Path(lib_path)
if not lib.is_dir():
return None
# List only top-level directories
existing_dirs = [d.name for d in lib.iterdir() if d.is_dir()]
except OSError as e:
logger.debug(f"Error listing library directory {lib_path}: {e}") # <-- Line 222
return NoneThe 📋 Scope Verification
Scope Status: SCOPE_OK Both issues fully addressed:
📝 Documentation CheckThis is a fix: PR (not feat:/security:)
🎯 VerdictAPPROVE This PR is well-implemented with strong defensive programming:
Minor improvements to consider (not blocking):
The MEDIUM severity issue is a logging level choice that doesn't affect functionality - the code handles the error correctly. The implementation is solid and ready to merge. |
Summary
find_existing_author_folder()uses normalized comparison, standardized initials matching, anddifflib.SequenceMatcher(>= 0.85 threshold) to reuse existing folders instead of creating duplicates like "James S.A. Corey" alongside "James S. A. Corey"author_nameequalsseries_name(corrupt Skaldleita data per skaldleita#90). Defense-in-depth check inBookProfile.finalize()catches this from any source and falls back to next-best author candidatestandardize_author_initialsby default: Normalizes "James S A Corey" → "James S. A. Corey" out of the box, reducing folder fragmentationCloses #142
Closes #143
Test plan
test-naming-issues.py— 281/281,ruff— clean)author_lf/titlenaming format also deduplicates