CDR Benchmark: Conversational Data Retrieval

Overview

This repository provides the official implementation of the paper:

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval (EMNLP 2025 Industry Track) Authors: Yohan Lee, Yongwoo Song, Sangyeop Kim

The Conversational Data Retrieval (CDR) Benchmark provides a systematic framework for evaluating retrieval models on conversational datasets. Unlike traditional document retrieval, conversation data features multi-turn complexity, implicit signals, and temporal progression that conventional methods fail to capture.

This benchmark consists of:

1,583 queries and 9,146 conversations across five key analytical areas
Comprehensive evaluation framework for conversational retrieval systems
Benchmark results for 16 leading embedding models

Our research reveals significant performance gaps in current embedding models, with even top performers achieving only moderate success (NDCG@10: 0.5036) and particularly struggling with conversation dynamics.

Key Features

Five Analytical Areas

Analytical Area	Description	Product Insights
Emotion & Feedback	Identifying users' emotional states and feedback in conversations	Revealing satisfaction patterns and pain points for product improvement
Intent & Purpose	Recognizing user intentions and goals	Evaluating alignment between intended and actual AI system usage
Conversation Dynamics	Analyzing conversation flow, turn structure and resolution patterns	Identifying conversation bottlenecks and improving dialogue completion rates
Trust, Safety & Ethics	Exploring trust-building and ethical issues in conversations	Identifying system reliability concerns and potential safety risks
Linguistic Style & Expression	Analyzing language patterns and comprehension challenges	Helping calibrate system language to user comprehension levels

Benchmark Statistics

General Statistics
Number of conversations	9,146
Number of queries	1,583
Avg. messages per conversation	5.4
Avg. tokens per query	10.26
Avg. tokens per conversation	464
Avg. relevant convs per query	20.44
Query Task Distribution (%)
Intent & Purpose	36.1%
Emotion & Feedback	20.1%
Linguistic Style & Expression	15.9%
Trust, Safety & Ethics	14.6%
Conversation Dynamics	13.4%

Domain Distribution

Setup

Environment Setup

# Create and activate conda environment
conda create -n cdr python=3.12 -y
conda activate cdr

# Install requirements
pip install -r requirements.txt

Data

cdr_benchmark_data/test_dataset/data/test
├── corpus.json
├── queries.json
└── relevant_docs.json

Usage

1. Create Embeddings

python -m src.data.make_test_embedding --model_name "MODEL_NAME_OR_DIRS" --batch_size 4

2. Run Evaluation

python -m src.inference --model_names "MODEL_NAME_OR_DIRS" --batch_size 4

Or simply run:

bash scripts/run_inference.sh

Benchmark Results

Performance by Task Category

All models perform strongly on Emotion & Feedback and Intent & Purpose categories but struggle significantly with Conversation Dynamics tasks, where even the best models score below 0.17 NDCG@10.

License

We utilized multiple open-source dialogue datasets under their respective licenses. Please refer to the LICENSE file and the paper appendix for detailed terms.

Citation

If you use this benchmark in your research, please cite our paper:

@inproceedings{cdr-benchmark-2025,
  title={Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval},
  author={Lee, Yohan and Song, Yongwoo and Kim, Sangyeop},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
cdr_benchmark_data		cdr_benchmark_data
inference_results		inference_results
scripts		scripts
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
query_templates.json		query_templates.json
requirements.txt		requirements.txt
template_variables.json		template_variables.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CDR Benchmark: Conversational Data Retrieval

Overview

Key Features

Five Analytical Areas

Benchmark Statistics

Domain Distribution

Setup

Environment Setup

Data

Usage

1. Create Embeddings

2. Run Evaluation

Benchmark Results

Performance by Task Category

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

l-yohai/CDR-Benchmark

Folders and files

Latest commit

History

Repository files navigation

CDR Benchmark: Conversational Data Retrieval

Overview

Key Features

Five Analytical Areas

Benchmark Statistics

Domain Distribution

Setup

Environment Setup

Data

Usage

1. Create Embeddings

2. Run Evaluation

Benchmark Results

Performance by Task Category

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages