Skip to content

Conversation

@jdesboeufs
Copy link
Member

@jdesboeufs jdesboeufs commented Nov 1, 2025

Integrate improvements from cquest-11rc1 branch

This PR integrates relevant changes from the long-forgotten cquest-11rc1 branch (commits from 2019-2021), with careful refinements to avoid regressions.

📝 Summary

The cquest-11rc1 branch contained valuable improvements to synonyms and phonemicization rules that were never merged. After thorough analysis, we've extracted and refined the pertinent changes while preserving the superior cache implementation from main.

✨ Changes

1. New Synonyms (14 additions)

Add commonly used street type abbreviations:

  • cd => chemin departemental
  • chem => chemin (additional variant)
  • clef => cle, clefs => cles (orthographic variants)
  • dept => departement
  • gir => giratoire
  • habit => habitation (additional variant)
  • periph => peripherique
  • prl => parc residentiel de loisirs (official street type)
  • prm => promenade (additional variant)
  • rd => route departementale
  • rn => route nationale
  • rdpt => rond point (additional variant)

2. Enhanced Phonemicization Rules

Improved French phonemicization with targeted rules:

New transformations:

  • vowel+mp+consonantvowel+n+consonant (e.g., champvallon → chanvalon)
  • ei+gnei+ni only after "ei" (e.g., seigneur → senieur)
    • ⚠️ More targeted than original to preserve common words like "montagne"
  • je+vowelj+vowel (e.g., georges → jorj instead of jeorj)
  • anc$an (e.g., blanc → blan)
  • y at word beginning → i
  • eimaim (e.g., pforzheim → pforzaim)
  • Better ae/eie conversion in word context
  • Enhanced oe/oeueu handling (e.g., oeufs → beu)
  • Fixed duplicate letter removal to handle multiple repetitions

11 new test cases covering improved patterns including:

  • blotzheim, pforzheim (Alsatian place names)
  • georges, gorges, seigneur
  • champvallon, champol, montbon, montgros
  • blanc, montee, hyppolyte
  • boeufs

🔍 What We Didn't Keep

Cache Implementation:
The cquest-11rc1 branch used a simple dictionary cache. We kept our superior lru_cache implementation which:

  • ✅ Better memory management with automatic size limiting
  • ✅ Configurable via PHONEMICIZE_CACHE_SIZE
  • ✅ More performant LRU eviction strategy

Original gn→ni rule:
The original rule gn([aeio])→ni\1 was too broad and would transform "montagne"→"montani" (regression). Our version only applies after "ei" ((?<=ei)gn([aeiouy])→ni\1), preserving common French patterns while fixing specific cases.

✅ Testing

  • All 122 tests pass
  • No regressions on existing test cases
  • Added comprehensive test coverage for new phonemicization patterns

📚 Related

  • Closes analysis of cquest-11rc1 branch
  • The cquest-11rc1 branch can be archived/deleted after this merge
  • Fixes patterns mentioned in issues #464, #480, #668 (from original commits)

🎯 Impact

These changes improve address search quality for:

  • Common street abbreviations (rd, rn, periph, gir, etc.)
  • Alsatian and Germanic place names (pforzheim, blotzheim, etc.)
  • Names with "ei+gn" pattern (seigneur, etc.)
  • Compound place names (champvallon, montbon, etc.)
  • Orthographic variants (clef/clé)

Add 14 new synonyms to improve street type recognition:
- cd => chemin departemental
- chem => chemin
- clef => cle, clefs => cles
- dept => departement
- gir => giratoire
- habit => habitation
- periph => peripherique
- prl => parc residentiel de loisirs
- prm => promenade
- rd => route departementale
- rn => route nationale
- rdpt => rond point
- cvo => chemin vicinal

These synonyms come from the cquest-11rc1 branch (commits 2019-2021)
and are still relevant to improve address search.
Add enhanced phonemicization rules with careful refinements:

- Add mp->n conversion in vowel+mp+consonant context (champvallon->chanvalon)
- Add targeted ei+gn->ni rule for specific cases (seigneur->senieur)
- Improve je->j handling (georges->jorj instead of jeorj)
- Add anc ending simplification (blanc->blan)
- Handle y at word beginning
- Improve eim->aim handling (pforzheim->pforzaim)
- Better ae/ei->e conversion
- Enhanced oeu/oe->eu handling (oeufs->beu)
- Fix duplicate letter removal to handle multiple repetitions

Rules are more targeted than original cquest-11rc1 to avoid regressions:
- gn->ni only after 'ei' (preserves montagne->montagn, not montani)
- Preserves common French patterns while improving edge cases

Add 11 new test cases covering the improved phonemicization patterns.
All 122 tests pass.
@jdesboeufs jdesboeufs self-assigned this Nov 1, 2025
@jdesboeufs jdesboeufs requested a review from Copilot November 1, 2025 13:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances French phonemicization rules and expands synonym mappings to improve text normalization for address processing. The changes refine the phonetic transformation algorithm to better handle edge cases and special character combinations in French place names.

Key Changes

  • Updated phonemicization rules for better handling of French phonetic patterns (e.g., "oe"/"oeu" diphthongs, "mp" combinations, "gn" after "ei", initial "y")
  • Added 9 new test cases to validate the updated phonetic transformations
  • Expanded synonym mappings with 10 new abbreviations for common French address terms

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
addok_fr/utils.py Enhanced phonemicization rules with improved regex patterns for French phonetics, including better diphthong handling and duplicate letter removal
tests/test_utils.py Updated existing test expectations and added new test cases to validate the improved phonemicization rules
addok_fr/resources/synonyms.txt Added new synonym mappings for abbreviated address terms and fixed formatting inconsistencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jdesboeufs jdesboeufs merged commit 8d2642c into main Nov 1, 2025
8 checks passed
@jdesboeufs jdesboeufs deleted the integrate-cquest-11rc1 branch November 1, 2025 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant