This repository contains code and data for the paper Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation accepted by EMNLP 2023.
Target-oriented dialogue systems, designed to proactively steer conversations toward predefined targets or accomplish specific system-side goals, are an exciting area in conversational AI. In this work, by formulating a <dialogue act, topic> pair as the conversation target, we explore a novel problem of personalized target-oriented dialogue by considering personalization during the target accomplishment process. However, there remains an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To address this, we propose an automatic dataset curation framework using a role-playing approach. Based on this framework, we construct a large-scale personalized target-oriented dialogue dataset, TopDial, which comprises about 18K multi-turn dialogues.
We upload the curated TopDial dataset to the OneDrive cloud. Please download it from this OneDrive link.
We use Neo4j as the graph database tool to process domain knowledge graph in the seed dataset. Please install it by following the official guide. The required Python packages are listed in requirements.txt
. Please install them by running:
pip install -r requirements.txt
We use the re-purposed version of the DuRecDial 2.0 dataset as the seed dataset. For convenience of preprocessing, please download it from this OneDrive link.
python data_preprocess.py --seed_dataset_dir ${seed_dataset_dir} --cache_dir ${cache_dir}
Running this script will generate the following files in the specified cache dir:
cache_dialogue_{train|dev|test_seen|test_unseen}.jsonl
# set your OpenAI API key
export OPENAI_API_KEY=""
python -u dialog_simulation.py --cached_seed_path ${cached_seed_path} \
--output_dir ${output_dir} \
--max_interaction_step ${max_interaction_step}
Running the above script will be like:
If you hope NOT to show the instructions and the synthesized conversations in the console, please set --show_description
and --show_message
to false
.
Our code is partially based on the implementation of ChatArena. We thank the authors for their excellent work.
If you use our data or code in your work, please kindly cite our work as:
@inproceedings{wang-etal-2023-target,
title = "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation",
author = "Wang, Jian and
Cheng, Yi and
Lin, Dongding and
Leong, Chak Tou and
Li, Wenjie",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
}