Code for training and evaluation of our KaLM-Embedding models.
For a more comprehensive understanding of the technical details, please refer to our paper KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.
- Training
- Ranking Consistency Filtering
- Semi-homogeneous Task Batching
- Matryoshka Representation Learning
- Evaluation
- Multi-GPU Asynchronous Computation
conda env create -f environment.yaml
conda activate kalm
bash ./scripts/hn_mine.sh
You can customize the filter_topk
parameter to set the threshold of ranking consistency filtering.
bash ./scripts/train.sh
We have provided a code for evaluating MTEB using multiple GPUs, which allocates each task from the task set to a single GPU in a queue-based manner, thereby enhancing evaluation efficiency.
bash ./scripts/eval_mteb.sh
Below, we present a portion of the results from the MTEB study. For a more comprehensive analysis, please refer to our technical report.
Model Name | Model Size | MTEB(zh) | MTEB(en) | MTEB(fr) | MTEB(pl) | avg |
---|---|---|---|---|---|---|
multilingual-e5-large | 560M | 58.54 | 60.89 | 55.64 | 60.08 | 58.79 |
bge-m3 (dense) | 560M | 61.07 | 59.57 | 58.79 | 60.35 | 59.95 |
gte-multilingual-base (dense) | 305M | 62.72 | 61.40 | 59.79 | 58.22 | 60.53 |
KaLM-embedding-multilingual-mini-v1 | 494M | 62.31 | 61.87 | 60.59 | 54.79 | 59.89 |
KaLM-embedding-multilingual-mini-instruct-v1 | 494M | 63.57 | 64.74 | 64.04 | 58.16 | 62.62 |
KaLM-embedding-multilingual-mini-instruct-v1.5 | 494M | 64.13 | 64.94 | 63.08 | 57.05 | 62.3 |
Specifically, our training code was forked from FlagOpen/FlagEmbedding. We have made modifications to suit our specific needs, but the core functionality and structure are derived from their excellent work. Please check out their repository for more details!
Please cite the repo if you use the model or code in this repo.
@article{hu2025kalm,
title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model},
author={Hu, Xinshuo and Shan, Zifei and Zhao, Xinping and Sun, Zetian and Liu, Zhenyu and Li, Dongfang and Ye, Shaolin and Wei, Xinyuan and Chen, Qian and Hu, Baotian and others},
journal={arXiv preprint arXiv:2501.01028},
year={2025}
}
This repository respects to MIT license.