Normalizer for russian #296

aignatovich · 2024-06-26T12:02:11Z

Pull Request

Normalizer for russian

Related issue

No related issue.

Why this changes could be helpful?

In written russian language it is permissible to use "е" in words containing diacritical version (ex. "ёжик" -> "ежик").
Below is the current search behavior, using latest version of meilisearch available to date, and it is questionable.
- Case 1: Search Query: "Ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ёжик", Expected: Both
- Case 2: Search Query: "Еж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both
- Case 3: Search Query: "ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both, or at least "Ёжик". This one seems to be incorrect.

If my assumptions are correct, this change may impact some of the cases above, though it has to be validated.

What does this PR do?

Performs a grammatically permissible normalization of "ё" into "е" for russian language, given that compatibility decomposition already replaces 1-codepoint version with 2-codepoint version.

PR checklist

Please check if your PR fulfills the following requirements:

[ ❓ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
[ 🟢 ] Have you read the contributing guidelines?
[ 🟢 ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

curquiza · 2024-07-04T16:26:42Z

@aignatovich thanks for your PR, let us know when your PR is ready for review 😊

I see there is current Rustfmt issue (cf CI)

ManyTheFish · 2024-07-10T06:48:07Z

Hello @aignatovich,
Thank you for your PR. Your PR seems good to me.
However, I think there is already a normalizer that covers the work of the Russian normalizer: the NonspacingMarkNormalizer, the goal of this normalizer is to remove all the nonspacing marks including the diacritics. The only missing change to activate this normalizer is to add the Cyrilic to the eligible scripts of the should_normalize function.

Let me know if this modification works.

See you!

aignatovich · 2024-07-22T00:08:38Z

Hi @ManyTheFish,

Solution that was proposed (adding Cyrilic to the predicate in NonspacingMarkNormalizer) did not reach expected result. The solution this PR proposes also does not impact search behavior as expected.

Conclusions made from debugging Case 2 scenario above: during indexing, normalization of "Ё" -> "e" (yes, that is a latin) takes place, but this does not happen for the input search term, instead "Е" (cyrillic) is normalized into "е" (cyrillic).

I will update as long as I know more.

ManyTheFish

Hello @aignatovich,

thank you for your contribution,

let's merge this!

bors merge

meili-bors · 2024-08-28T06:13:41Z

Merge conflict.

ManyTheFish · 2024-08-28T06:22:57Z

bors merge

ManyTheFish

bors merge

meili-bors · 2024-08-28T06:23:16Z

Already running a review

296: Normalizer for russian r=ManyTheFish a=aignatovich # Pull Request - Normalizer for russian ## Related issue - No related issue. ## Why this changes could be helpful? - In written russian language it is permissible to use "е" in words containing diacritical version (ex. "ёжик" -> "ежик"). - Below is the current search behavior, using latest version of meilisearch available to date, and it is questionable. - Case 1: Search Query: "Ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ёжик", Expected: Both - Case 2: Search Query: "Еж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both - Case 3: Search Query: "ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both, or at least "Ёжик". This one seems to be incorrect. If my assumptions are correct, this change may impact some of the cases above, though it has to be validated. ## What does this PR do? - Performs a grammatically permissible normalization of "ё" into "е" for russian language, given that compatibility decomposition already replaces 1-codepoint version with 2-codepoint version. ## PR checklist Please check if your PR fulfills the following requirements: - [ ❓ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ 🟢 ] Have you read the contributing guidelines? - [ 🟢 ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Arty I <work.artyignatovich@gmail.com> Co-authored-by: Many the fish <many@meilisearch.com>

meili-bors · 2024-08-28T06:33:32Z

This PR was included in a batch that successfully built, but then failed to merge into main. It will not be retried.

Additional information:

{"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches","status":"422"}

aignatovich · 2024-08-29T19:07:25Z

Hi @ManyTheFish ,

There is still work to be done by me in this PR to solve the issue that is being described.

Would you be able to indicate what behavior is preferred in each of those scenarios?

Conclusions made from debugging Case 2 scenario above: during indexing, normalization of "Ё" -> "e" (yes, that is a latin) takes place, but this does not happen for the input search term, instead "Е" (cyrillic) is normalized into "е" (cyrillic).

This inconsistency between normalization of a query and the document, to the best of my understanding, is the cause of the issue.

- Should normalization of Cyrillic -> Latin take place for input query?
OR
- Should normalization of "Ё"(cyrillic) -> "e" (latin) not take place during document indexing?

ManyTheFish · 2024-09-10T09:23:36Z

Hello @aignatovich,

I don't understand why my suggestions in my comments don't work:

Thank you for your PR. Your PR seems good to me.
However, I think there is already a normalizer that covers the work of the Russian normalizer: the NonspacingMarkNormalizer, the goal of this normalizer is to remove all the nonspacing marks including the diacritics. The only missing change to activate this normalizer is to add the Cyrilic to the eligible scripts of the should_normalize function.

As I re-reading everything, I understand that doing normalizer that convert Cyrillic characters close to Latin into Latin characters should work if put after the Lowecase normalizer.

Something like:

static SPOOFING_VARIANTS: Lazy<HashMap<char, char>> = Lazy::new(|| {
    [
        ('е', 'e'),
    ].into_iter().collect()
});

pub struct CyrillicVariantsNormalizer;

impl CharNormalizer for CyrillicVariantsNormalizer {
    fn normalize_char(&self, c: char) -> Option<CharOrStr> {
        match SPOOFING_VARIANTS.get(&c) {
            Some(replacement) => Some(replacement.into()),
            None => Some(c.into()),
        }
    }

    fn should_normalize(&self, token: &Token) -> bool {
        token.script == Script::Cyrillic && token.lemma.chars().any(|c| SPOOFING_VARIANTS.contains_key(&c))
    }
}

Conclusions made from debugging Case 2 scenario above: during indexing, normalization of "Ё" -> "e" (yes, that is a latin) takes place, but this does not happen for the input search term, instead "Е" (cyrillic) is normalized into "е" (Cyrillic).

converting into Latin should be good to me.

Normalizer for russian

0bccf7b

Applying Rustfmt

1872257

ManyTheFish previously approved these changes Aug 28, 2024

View reviewed changes

Merge branch 'main' into basic-cyrillic-normalization

2edcf4a

ManyTheFish dismissed their stale review via 2edcf4a August 28, 2024 06:22

ManyTheFish approved these changes Aug 28, 2024

View reviewed changes

aignatovich marked this pull request as ready for review August 30, 2024 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalizer for russian #296

Normalizer for russian #296

aignatovich commented Jun 26, 2024

curquiza commented Jul 4, 2024 •

edited

Loading

ManyTheFish commented Jul 10, 2024

aignatovich commented Jul 22, 2024

ManyTheFish left a comment

meili-bors bot commented Aug 28, 2024

ManyTheFish commented Aug 28, 2024

ManyTheFish left a comment

meili-bors bot commented Aug 28, 2024

meili-bors bot commented Aug 28, 2024

aignatovich commented Aug 29, 2024

ManyTheFish commented Sep 10, 2024

Normalizer for russian #296

Are you sure you want to change the base?

Normalizer for russian #296

Conversation

aignatovich commented Jun 26, 2024

Pull Request

Related issue

Why this changes could be helpful?

What does this PR do?

PR checklist

curquiza commented Jul 4, 2024 • edited Loading

ManyTheFish commented Jul 10, 2024

aignatovich commented Jul 22, 2024

ManyTheFish left a comment

Choose a reason for hiding this comment

meili-bors bot commented Aug 28, 2024

ManyTheFish commented Aug 28, 2024

ManyTheFish left a comment

Choose a reason for hiding this comment

meili-bors bot commented Aug 28, 2024

meili-bors bot commented Aug 28, 2024

aignatovich commented Aug 29, 2024

ManyTheFish commented Sep 10, 2024

curquiza commented Jul 4, 2024 •

edited

Loading