Skip to content

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval (EMNLP 2025 Industry Track)

License

Notifications You must be signed in to change notification settings

l-yohai/CDR-Benchmark

Repository files navigation

CDR Benchmark: Conversational Data Retrieval

CDR vs Traditional Retrieval

Overview

This repository provides the official implementation of the paper:

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval (EMNLP 2025 Industry Track) Authors: Yohan Lee, Yongwoo Song, Sangyeop Kim

The Conversational Data Retrieval (CDR) Benchmark provides a systematic framework for evaluating retrieval models on conversational datasets. Unlike traditional document retrieval, conversation data features multi-turn complexity, implicit signals, and temporal progression that conventional methods fail to capture.

This benchmark consists of:

  • 1,583 queries and 9,146 conversations across five key analytical areas
  • Comprehensive evaluation framework for conversational retrieval systems
  • Benchmark results for 16 leading embedding models

Our research reveals significant performance gaps in current embedding models, with even top performers achieving only moderate success (NDCG@10: 0.5036) and particularly struggling with conversation dynamics.

Key Features

Five Analytical Areas

Analytical Area Description Product Insights
Emotion & Feedback Identifying users' emotional states and feedback in conversations Revealing satisfaction patterns and pain points for product improvement
Intent & Purpose Recognizing user intentions and goals Evaluating alignment between intended and actual AI system usage
Conversation Dynamics Analyzing conversation flow, turn structure and resolution patterns Identifying conversation bottlenecks and improving dialogue completion rates
Trust, Safety & Ethics Exploring trust-building and ethical issues in conversations Identifying system reliability concerns and potential safety risks
Linguistic Style & Expression Analyzing language patterns and comprehension challenges Helping calibrate system language to user comprehension levels

Benchmark Statistics

General Statistics
Number of conversations 9,146
Number of queries 1,583
Avg. messages per conversation 5.4
Avg. tokens per query 10.26
Avg. tokens per conversation 464
Avg. relevant convs per query 20.44
Query Task Distribution (%)
Intent & Purpose 36.1%
Emotion & Feedback 20.1%
Linguistic Style & Expression 15.9%
Trust, Safety & Ethics 14.6%
Conversation Dynamics 13.4%

Domain Distribution

Domain Distribution

Setup

Environment Setup

# Create and activate conda environment
conda create -n cdr python=3.12 -y
conda activate cdr

# Install requirements
pip install -r requirements.txt

Data

cdr_benchmark_data/test_dataset/data/test
├── corpus.json
├── queries.json
└── relevant_docs.json

Usage

1. Create Embeddings

python -m src.data.make_test_embedding --model_name "MODEL_NAME_OR_DIRS" --batch_size 4

2. Run Evaluation

python -m src.inference --model_names "MODEL_NAME_OR_DIRS" --batch_size 4

Or simply run:

bash scripts/run_inference.sh

Benchmark Results

Performance by Task Category

Performance by Task

All models perform strongly on Emotion & Feedback and Intent & Purpose categories but struggle significantly with Conversation Dynamics tasks, where even the best models score below 0.17 NDCG@10.

License

We utilized multiple open-source dialogue datasets under their respective licenses. Please refer to the LICENSE file and the paper appendix for detailed terms.

Citation

If you use this benchmark in your research, please cite our paper:

@inproceedings{cdr-benchmark-2025,
  title={Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval},
  author={Lee, Yohan and Song, Yongwoo and Kim, Sangyeop},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  year={2025}
}

About

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval (EMNLP 2025 Industry Track)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published