Diverse, not Short:
A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models
This repository corresponds to our study on boosting response diversity of LLMs, presented at EMNLP 2025.
- While curating preference optimization data to boost diversity, it it crucial to form preference pairs with same/comparable length.
- Our preference data, improves response diversity across four creative writing tasks with only 3k, high-quality preference pairs.
- We propose a diversity metric, Diversity Decile, that provides length-adjusted view of diversity.
.
├── src/diverse_not_short/
│ ├── bash_scripts/ # Contains step-by-step implemetation
│ ├── data_gen.../ # Generation and filtration of pref. pairs
│ ├── ft_scripts/ # DPO training with TRL trainer
│ ├── evaluation_scripts/
│ │ └── system_prompts.py # Evaluation prompts for 4 creative tasks
│ └── util_scripts/ # Helpers (evaluation, filtering, etc.)
├── data/
│ ├── raw_data/
│ │ └── SAMPLE_INSTANCE.json # Example instance & required keys
│ └── decile_map.csv # Observed decile distribution (≈800k responses)
└── README.md
-
Data generation: This step is optional. If you already have some preference dataset, you can skip this step. However, make sure you have length, diversity and quality measurements available in this preference data. If you don't have the measurements then please use
evaluate_responses()function to measure length, diversity and quality. If you don't have prefrence data, you can usebash_script/STEP_1_gen_data_from_seq_prompting.shto generate data and measure all metric required for filtering. The code is setup to produce dataset for story writing task, refer to the prompts here. -
Data filtering: In this step you'll read and filter your preference data. For using our filtering script as it is, you'll need to make sure the data format is acceptable. Please refer to the sample data instance (
data/raw_data/SAMPLE_INSTACE.json) to know the required keys. If the data has all the keys needed, you can directly usebash_scripts/STEP_2_filter_data_to_get_pref_pairs.sh. If you have different column names in your data, you can still leverage our filtering function (filter_rule_dns()) to filter your data. -
Tune the LLM: Our training script is based on HF/TRL trainer script (https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py). You can use
bash_scripts/STEP_3_preference_learning.shorft_scripts/ft_dpo.pyfor tuning the model on filtered data. -
Generate responses for evaluation: In this step, we leverage the tuned checkpoint and task it to generate responses. The evaluation prompts are for four different creative writing tasks, refer to
evaluation_scripts/system_prompts.py. You can usebash_scripts/STEP_4_sample_resp_for_eval.shto generate responses to evaluate diversity of any checkpoint. -
Evaluate the responses: In this last step, we measure the diversity of the generated responses. We measure response-level (DSI, TTR, MATTR, HD-D, MTLD, MAAS), corpus-level diversity (4-gram lexical diversity, 4-gram syntactic diversity, compression ratio) and quality (reward model scores). For story writing task, we calculate which decile the metric value belongs to, based on the observed distribution calculated with 800k responses collected in our experiments (prior to filtering). The observed distribution is collection of decile values for all metrics considered in our study and can be accessed at
data/decile_map.csv.
For replicating our experiments, please run following commands.
- Creating a vitual environment
conda create -n env_diversity python=3.10
conda activate env_diversity
pip install -e .
If you use our code or findings, please cosider citing our work,
@inproceedings{deshpande2025diverse,
title={Diverse, not Short: A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models},
author={Deshpande, Vijeta and Ghose, Debasmita and Patterson, John D and Beaty, Roger E and Rumshisky, Anna},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={33905--33926},
year={2025}
}
