PolySoundex is a multi-language Soundex implementation that allows developers to configure Soundex mappings for multiple languages and use them for phonetic search and indexing. The library supports various alphabets, including English and Nepali (Devanagari script), and can be extended to support additional languages.
This documentation provides a guide for configuring and using PolySoundex in your .NET applications, specifically with ASP.NET Core dependency injection.
-
Add the PolySoundex Library to your .NET project (assume it's available as a NuGet package).
-
Configure Dependency Injection in your application using the
AddPolySoundex
method, where you define language-specific Soundex rules and regex patterns for identifying different languages.
Here is an example of how you can configure PolySoundex for both English and Nepali languages using the AddPolySoundex
extension method.
In your Program.cs
(for ASP.NET Core applications):
builder.Services.AddPolySoundex(config =>
{
// English Soundex Configuration
config.Add(new PolySoundexConfig()
{
Requirements = new Dictionary<int, char[]>()
{
{ 1, new[] { 'B', 'F', 'P', 'V' } },
{ 2, new[] { 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' } },
{ 3, new[] { 'D', 'T' } },
{ 4, new[] { 'L' } },
{ 5, new[] { 'M', 'N' } },
{ 6, new[] { 'R' } }
},
LanguageIdentifierRegex = new Regex("^[A-Za-z\\s]+$") // Regex for English alphabet
});
// Nepali Soundex (Devanagari script) Configuration
config.Add(new PolySoundexConfig
{
Requirements = new Dictionary<int, char[]>()
{
{ 1, new[] { 'क', 'ख', 'ग', 'घ' } },
{ 2, new[] { 'च', 'छ', 'ज', 'झ', 'ट', 'ठ', 'ड', 'ढ' } },
{ 3, new[] { 'त', 'थ', 'द', 'ध' } },
{ 4, new[] { 'न' } },
{ 5, new[] { 'प', 'फ', 'ब', 'भ', 'म' } },
{ 6, new[] { 'य', 'र', 'ल', 'व' } }
},
LanguageIdentifierRegex = new Regex("^[\u0900-\u097F\\s]+$") // Regex for Devanagari (Nepali) script
});
});
- Requirements: A dictionary mapping integers (Soundex codes) to arrays of characters that share the same phonetic value. For example, in English, 'B', 'F', 'P', and 'V' are treated similarly by Soundex.
- LanguageIdentifierRegex: A regular expression used to identify the input language. In this case:
^[A-Za-z]+$
matches English characters.^[\u0900-\u097F]+$
matches Nepali (Devanagari script) characters.
You can now use the PolySoundex
service in your app to generate Soundex values. Below is an example of how to create an API endpoint that takes user input and returns the Soundex value:
app.MapGet("/", (string input, IPolySoundex soundex) =>
{
try
{
var items = new List<(string name, string address)>()
{
new("Biraj", "123 Main St."),
new("Ram", "456 Elm St."),
new("Hari", "789 Oak St.")
};
var inputSoundex = soundex.GenerateSoundex(input); // assume input is "Braj" the soundex code is "B620"
var matchingItems = items
.Select(i => new { i.name, i.address, Soundex = soundex.GenerateSoundex(i.name.ToUpper()) })
.Where(i => i.Soundex == inputSoundex)
.ToList(); // returns { name = "Biraj", address = "123 Main St.", Soundex = "B620" }
return Results.Ok(matchingItems);
}
catch (Exception e)
{
return Results.BadRequest(e.Message);
}
});
- Accepts a
GET
request with astring
input. - Uses the
PolySoundex
service (injected viaIPolySoundex
) to generate the Soundex code for the given input. - Returns the Soundex result or an error if the language is not supported.
-
English Example:
- Input:
Biraj
- Soundex Result:
B620
- Input:
-
Nepali Example:
- Input:
बिराज
(Biraj in Devanagari) - Soundex Result:
ब620
- Input:
To add support for other languages:
- Define the Soundex rules for the language in the
Requirements
dictionary. - Create an appropriate regular expression for identifying the alphabet or script of the language.
- Add the new
PolySoundexConfig
in theAddPolySoundex
configuration.
Example for another language:
config.Add(new PolySoundexConfig()
{
Requirements = new Dictionary<int, char[]>()
{
{ 1, new[] { 'А', 'О', 'У', 'Э', 'Ю', 'Я' } }, // Example Cyrillic characters
// Add other mappings for the Cyrillic alphabet
},
LanguageIdentifierRegex = new Regex("^[\u0400-\u04FF]+$") // Cyrillic script
});
- Unsupported Language: If the input does not match any configured languages, the
PolySoundexException
will be thrown with the messageLanguage not supported.
- Invalid Input: If the input string does not contain valid characters after being cleaned, an
ArgumentException
will be thrown with a message likeInput string must contain at least one valid character after cleaning.
- Empty Input: The Soundex function will return the input as-is.
- Digits Only Input: Soundex will return the input unchanged (useful for phone numbers).
- Input with Mixed Languages: The
LanguageIdentifierRegex
ensures that input is matched against the correct language. If the input doesn’t fully conform to one language, an error will be thrown.
PolySoundex offers a flexible and extendable approach to handling multi-language phonetic search and Soundex generation. By configuring language-specific rules and utilizing regex for language identification, you can adapt it to various languages like English, Nepali, and more.
This makes it a powerful tool for building search functionality in applications where phonetic matching of names or other text is crucial, particularly in multi-lingual environments like legal, governmental, or educational systems.