Skip to content

Conversation

@marquesafonso
Copy link

@marquesafonso marquesafonso commented Oct 25, 2025

This pull request adds multilingual NanoBEIREvaluator and SparseNanoBEIREvaluator support for: Arabic (ar), German (de), English (en), Spanish (es), French (fr), Italian (it), Norwegian (no), Portuguese (pt), and Swedish (sv). It preserves English as default, adds a validator for the added language argument and a test for invalid languages.

The solution is based on the lightonai/nanobeir-multilingual dataset, which is based on the original NanoBEIR collection.

The code style was kept similar to that of the existing NanoBEIREvaluator.py file:

  • Added a LanguageType Literal for validation + _validate_language method.
  • Renamed dataset_name_to_id to dataset_name_to_subset_id (as the multilingual dataset is only one dataset with multiple subsets).
  • Added new language argument + descriptive annotation.
  • Changed the _load_dataset method to properly load the corpus, queries and qrels for each dataset/subset.
  • Added a test_nanobeir_evaluator_invalid_language test in the tests/evaluation.
  • Added the new language argument + descriptive annotation to SparseNanoBEIREvaluator to reflect the changes made in NanoBEIREvaluator.

All tests in the evaluation folder are passing (with the exception of one skip)

image

Hope you find this PR helpful and can merge it for multilingual support of NanoBEIREvaluator and SparseNanoBEIREvaluator, this would be helpful to test/benchmark retrievers quickly in other languages!

Available for any comments/improvements.

Best regards,
Afonso

@marquesafonso marquesafonso changed the title Add multilingual NanoBEIREvaluator support Add multilingual NanoBEIREvaluator and SparseNanoBEIREvaluator support Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant