Diverse, not Short:
A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models

This repository corresponds to our study on boosting response diversity of LLMs, presented at EMNLP 2025.

Key Findings

While curating preference optimization data to boost diversity, it it crucial to form preference pairs with same/comparable length.
Our preference data, improves response diversity across four creative writing tasks with only 3k, high-quality preference pairs.
We propose a diversity metric, Diversity Decile, that provides length-adjusted view of diversity.

Important Files

.
├── src/diverse_not_short/
│   ├── bash_scripts/               # Contains step-by-step implemetation
│   ├── data_gen.../                # Generation and filtration of pref. pairs
│   ├── ft_scripts/                 # DPO training with TRL trainer
│   ├── evaluation_scripts/
│   │   └── system_prompts.py       # Evaluation prompts for 4 creative tasks
│   └── util_scripts/               # Helpers (evaluation, filtering, etc.)
├── data/
│   ├── raw_data/
│   │   └── SAMPLE_INSTANCE.json    # Example instance & required keys
│   └── decile_map.csv              # Observed decile distribution (≈800k responses)
└── README.md

Data generation: This step is optional. If you already have some preference dataset, you can skip this step. However, make sure you have length, diversity and quality measurements available in this preference data. If you don't have the measurements then please use evaluate_responses() function to measure length, diversity and quality. If you don't have prefrence data, you can use bash_script/STEP_1_gen_data_from_seq_prompting.sh to generate data and measure all metric required for filtering. The code is setup to produce dataset for story writing task, refer to the prompts here.
Data filtering: In this step you'll read and filter your preference data. For using our filtering script as it is, you'll need to make sure the data format is acceptable. Please refer to the sample data instance (data/raw_data/SAMPLE_INSTACE.json) to know the required keys. If the data has all the keys needed, you can directly use bash_scripts/STEP_2_filter_data_to_get_pref_pairs.sh. If you have different column names in your data, you can still leverage our filtering function (filter_rule_dns()) to filter your data.
Tune the LLM: Our training script is based on HF/TRL trainer script (https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py). You can use bash_scripts/STEP_3_preference_learning.sh or ft_scripts/ft_dpo.py for tuning the model on filtered data.
Generate responses for evaluation: In this step, we leverage the tuned checkpoint and task it to generate responses. The evaluation prompts are for four different creative writing tasks, refer to evaluation_scripts/system_prompts.py. You can use bash_scripts/STEP_4_sample_resp_for_eval.sh to generate responses to evaluate diversity of any checkpoint.
Evaluate the responses: In this last step, we measure the diversity of the generated responses. We measure response-level (DSI, TTR, MATTR, HD-D, MTLD, MAAS), corpus-level diversity (4-gram lexical diversity, 4-gram syntactic diversity, compression ratio) and quality (reward model scores). For story writing task, we calculate which decile the metric value belongs to, based on the observed distribution calculated with 800k responses collected in our experiments (prior to filtering). The observed distribution is collection of decile values for all metrics considered in our study and can be accessed at data/decile_map.csv.

Setting up the environment

For replicating our experiments, please run following commands.

Creating a vitual environment

conda create -n env_diversity python=3.10
conda activate env_diversity
pip install -e .

Citation

If you use our code or findings, please cosider citing our work,

@inproceedings{deshpande2025diverse,
  title={Diverse, not Short: A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models},
  author={Deshpande, Vijeta and Ghose, Debasmita and Patterson, John D and Beaty, Roger E and Rumshisky, Anna},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={33905--33926},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
src/diverse_not_short		src/diverse_not_short
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
diversity_decile.png		diversity_decile.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diverse, not Short:
A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models

Key Findings

Important Files

Setting up the environment

Citation

About

Uh oh!

Releases

Packages

Languages

License

text-machine-lab/diverse-not-short

Folders and files

Latest commit

History

Repository files navigation

Diverse, not Short: A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models

Key Findings

Important Files

Setting up the environment

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Diverse, not Short:
A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models

Packages