This repository provides the official implementation of the paper:
Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval (EMNLP 2025 Industry Track) Authors: Yohan Lee, Yongwoo Song, Sangyeop Kim
The Conversational Data Retrieval (CDR) Benchmark provides a systematic framework for evaluating retrieval models on conversational datasets. Unlike traditional document retrieval, conversation data features multi-turn complexity, implicit signals, and temporal progression that conventional methods fail to capture.
This benchmark consists of:
- 1,583 queries and 9,146 conversations across five key analytical areas
- Comprehensive evaluation framework for conversational retrieval systems
- Benchmark results for 16 leading embedding models
Our research reveals significant performance gaps in current embedding models, with even top performers achieving only moderate success (NDCG@10: 0.5036) and particularly struggling with conversation dynamics.
| Analytical Area | Description | Product Insights |
|---|---|---|
| Emotion & Feedback | Identifying users' emotional states and feedback in conversations | Revealing satisfaction patterns and pain points for product improvement |
| Intent & Purpose | Recognizing user intentions and goals | Evaluating alignment between intended and actual AI system usage |
| Conversation Dynamics | Analyzing conversation flow, turn structure and resolution patterns | Identifying conversation bottlenecks and improving dialogue completion rates |
| Trust, Safety & Ethics | Exploring trust-building and ethical issues in conversations | Identifying system reliability concerns and potential safety risks |
| Linguistic Style & Expression | Analyzing language patterns and comprehension challenges | Helping calibrate system language to user comprehension levels |
| General Statistics | |
|---|---|
| Number of conversations | 9,146 |
| Number of queries | 1,583 |
| Avg. messages per conversation | 5.4 |
| Avg. tokens per query | 10.26 |
| Avg. tokens per conversation | 464 |
| Avg. relevant convs per query | 20.44 |
| Query Task Distribution (%) | |
| Intent & Purpose | 36.1% |
| Emotion & Feedback | 20.1% |
| Linguistic Style & Expression | 15.9% |
| Trust, Safety & Ethics | 14.6% |
| Conversation Dynamics | 13.4% |
# Create and activate conda environment
conda create -n cdr python=3.12 -y
conda activate cdr
# Install requirements
pip install -r requirements.txtcdr_benchmark_data/test_dataset/data/test
├── corpus.json
├── queries.json
└── relevant_docs.jsonpython -m src.data.make_test_embedding --model_name "MODEL_NAME_OR_DIRS" --batch_size 4python -m src.inference --model_names "MODEL_NAME_OR_DIRS" --batch_size 4Or simply run:
bash scripts/run_inference.shAll models perform strongly on Emotion & Feedback and Intent & Purpose categories but struggle significantly with Conversation Dynamics tasks, where even the best models score below 0.17 NDCG@10.
We utilized multiple open-source dialogue datasets under their respective licenses. Please refer to the LICENSE file and the paper appendix for detailed terms.
If you use this benchmark in your research, please cite our paper:
@inproceedings{cdr-benchmark-2025,
title={Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval},
author={Lee, Yohan and Song, Yongwoo and Kim, Sangyeop},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track},
year={2025}
}

