Skip to content

This repository includes the data generation, filtration, model tuning and evaluation code for https://arxiv.org/pdf/2505.16245

License

Notifications You must be signed in to change notification settings

text-machine-lab/diverse-not-short

Repository files navigation

Diverse, not Short:
A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models

EMNLP 2025 Paper HF Dataset

This repository corresponds to our study on boosting response diversity of LLMs, presented at EMNLP 2025.

Key Findings

Increase in diversity.

  1. While curating preference optimization data to boost diversity, it it crucial to form preference pairs with same/comparable length.
  2. Our preference data, improves response diversity across four creative writing tasks with only 3k, high-quality preference pairs.
  3. We propose a diversity metric, Diversity Decile, that provides length-adjusted view of diversity.

Important Files

.
├── src/diverse_not_short/
│   ├── bash_scripts/               # Contains step-by-step implemetation
│   ├── data_gen.../                # Generation and filtration of pref. pairs
│   ├── ft_scripts/                 # DPO training with TRL trainer
│   ├── evaluation_scripts/
│   │   └── system_prompts.py       # Evaluation prompts for 4 creative tasks
│   └── util_scripts/               # Helpers (evaluation, filtering, etc.)
├── data/
│   ├── raw_data/
│   │   └── SAMPLE_INSTANCE.json    # Example instance & required keys
│   └── decile_map.csv              # Observed decile distribution (≈800k responses)
└── README.md
  1. Data generation: This step is optional. If you already have some preference dataset, you can skip this step. However, make sure you have length, diversity and quality measurements available in this preference data. If you don't have the measurements then please use evaluate_responses() function to measure length, diversity and quality. If you don't have prefrence data, you can use bash_script/STEP_1_gen_data_from_seq_prompting.sh to generate data and measure all metric required for filtering. The code is setup to produce dataset for story writing task, refer to the prompts here.

  2. Data filtering: In this step you'll read and filter your preference data. For using our filtering script as it is, you'll need to make sure the data format is acceptable. Please refer to the sample data instance (data/raw_data/SAMPLE_INSTACE.json) to know the required keys. If the data has all the keys needed, you can directly use bash_scripts/STEP_2_filter_data_to_get_pref_pairs.sh. If you have different column names in your data, you can still leverage our filtering function (filter_rule_dns()) to filter your data.

  3. Tune the LLM: Our training script is based on HF/TRL trainer script (https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py). You can use bash_scripts/STEP_3_preference_learning.sh or ft_scripts/ft_dpo.py for tuning the model on filtered data.

  4. Generate responses for evaluation: In this step, we leverage the tuned checkpoint and task it to generate responses. The evaluation prompts are for four different creative writing tasks, refer to evaluation_scripts/system_prompts.py. You can use bash_scripts/STEP_4_sample_resp_for_eval.sh to generate responses to evaluate diversity of any checkpoint.

  5. Evaluate the responses: In this last step, we measure the diversity of the generated responses. We measure response-level (DSI, TTR, MATTR, HD-D, MTLD, MAAS), corpus-level diversity (4-gram lexical diversity, 4-gram syntactic diversity, compression ratio) and quality (reward model scores). For story writing task, we calculate which decile the metric value belongs to, based on the observed distribution calculated with 800k responses collected in our experiments (prior to filtering). The observed distribution is collection of decile values for all metrics considered in our study and can be accessed at data/decile_map.csv.

Setting up the environment

For replicating our experiments, please run following commands.

  1. Creating a vitual environment
conda create -n env_diversity python=3.10
conda activate env_diversity
pip install -e .

Citation

If you use our code or findings, please cosider citing our work,

@inproceedings{deshpande2025diverse,
  title={Diverse, not Short: A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models},
  author={Deshpande, Vijeta and Ghose, Debasmita and Patterson, John D and Beaty, Roger E and Rumshisky, Anna},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={33905--33926},
  year={2025}
}

About

This repository includes the data generation, filtration, model tuning and evaluation code for https://arxiv.org/pdf/2505.16245

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published