-
-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Labels
P2-highMajor functionality broken, no workaroundMajor functionality broken, no workaroundbugSomething isn't workingSomething isn't working
Description
Problem
When multiple sources report slightly different author names, Library Manager creates separate author folders for each variant instead of normalizing to a single canonical name.
Examples from community tester (Merijeek, #140)
| Folder 1 | Folder 2 | Same person? |
|---|---|---|
| Alistair Maclean | Alistair McLean | Yes |
| Craig Alanson | Craig Allenson | Yes (AI hallucination) |
| James S. A. Corey | James S.A. Corey | Yes (initials formatting) |
| Rick Yancey | Richard Yancey | Yes (nickname vs legal name) |
| N.K. Jemisin | N. K. Jemisin | Yes (initials formatting) |
Root cause
The path building system (build_new_path() in path_safety.py) treats author names as exact strings. There's no fuzzy matching or deduplication when deciding which existing author folder to use.
Current tooling that EXISTS but doesn't solve this:
standardize_initials()— handlesJ.R.R.→J. R. R.but disabled by default (standardize_author_initialsconfig)is_garbage_author_match()— uses word-overlap (Jaccard similarity) for validation, but only at 0.2 threshold and not used during folder creationcalculate_title_similarity()— Jaccard on word sets, can't catch "Maclean" vs "McLean" (different words)
What's needed
When building a rename path, before creating a new author folder:
- Check existing author folders in the library for fuzzy matches
- Use Levenshtein distance or phonetic matching (Metaphone/Soundex) on author names
- If a close match exists (>85% similarity), use the existing folder name
- Handle common patterns:
- Nickname variants: Rick/Richard, Bob/Robert, Bill/William
- Initial formatting:
J.R.R./J. R. R./JRR - Surname spelling: MacLean/McLean/Maclean
- Punctuation differences:
N.K.vsN. K.
Additional context
The standardize_author_initials config flag should probably be enabled by default — there's no good reason to create James S.A. Corey and James S. A. Corey as separate folders.
Related
- [BUG] More Search Errors #140 - Merijeek's report of duplicate author folders
- [FEATURE] Use path info to complete partial Skaldleita results #127 - Use path info to complete partial SL results
- Smart Pre-Processing Pipeline: Validation, Triage, and Corrections #110 - Smart Pre-Processing Pipeline
- deucebucket/skaldleita#78 - Narrator name fuzzy dedup
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2-highMajor functionality broken, no workaroundMajor functionality broken, no workaroundbugSomething isn't workingSomething isn't working