Skip to content

[BUG] Duplicate author folders from name variants — no deduplication at folder level #142

@deucebucket

Description

@deucebucket

Problem

When multiple sources report slightly different author names, Library Manager creates separate author folders for each variant instead of normalizing to a single canonical name.

Examples from community tester (Merijeek, #140)

Folder 1 Folder 2 Same person?
Alistair Maclean Alistair McLean Yes
Craig Alanson Craig Allenson Yes (AI hallucination)
James S. A. Corey James S.A. Corey Yes (initials formatting)
Rick Yancey Richard Yancey Yes (nickname vs legal name)
N.K. Jemisin N. K. Jemisin Yes (initials formatting)

Root cause

The path building system (build_new_path() in path_safety.py) treats author names as exact strings. There's no fuzzy matching or deduplication when deciding which existing author folder to use.

Current tooling that EXISTS but doesn't solve this:

  • standardize_initials() — handles J.R.R.J. R. R. but disabled by default (standardize_author_initials config)
  • is_garbage_author_match() — uses word-overlap (Jaccard similarity) for validation, but only at 0.2 threshold and not used during folder creation
  • calculate_title_similarity() — Jaccard on word sets, can't catch "Maclean" vs "McLean" (different words)

What's needed

When building a rename path, before creating a new author folder:

  1. Check existing author folders in the library for fuzzy matches
  2. Use Levenshtein distance or phonetic matching (Metaphone/Soundex) on author names
  3. If a close match exists (>85% similarity), use the existing folder name
  4. Handle common patterns:
    • Nickname variants: Rick/Richard, Bob/Robert, Bill/William
    • Initial formatting: J.R.R. / J. R. R. / JRR
    • Surname spelling: MacLean/McLean/Maclean
    • Punctuation differences: N.K. vs N. K.

Additional context

The standardize_author_initials config flag should probably be enabled by default — there's no good reason to create James S.A. Corey and James S. A. Corey as separate folders.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-highMajor functionality broken, no workaroundbugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions