Skip to content

Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)

License

Notifications You must be signed in to change notification settings

iwangjian/TopDial

Repository files navigation

TopDial

This repository contains code and data for the paper Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation accepted by EMNLP 2023.

Overview

Target-oriented dialogue systems, designed to proactively steer conversations toward predefined targets or accomplish specific system-side goals, are an exciting area in conversational AI. In this work, by formulating a <dialogue act, topic> pair as the conversation target, we explore a novel problem of personalized target-oriented dialogue by considering personalization during the target accomplishment process. However, there remains an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To address this, we propose an automatic dataset curation framework using a role-playing approach. Based on this framework, we construct a large-scale personalized target-oriented dialogue dataset, TopDial, which comprises about 18K multi-turn dialogues.

Dataset

We upload the curated TopDial dataset to the OneDrive cloud. Please download it from this OneDrive link.

Dataset Curation

Requirements

We use Neo4j as the graph database tool to process domain knowledge graph in the seed dataset. Please install it by following the official guide. The required Python packages are listed in requirements.txt. Please install them by running:

pip install -r requirements.txt

Seed Dataset

We use the re-purposed version of the DuRecDial 2.0 dataset as the seed dataset. For convenience of preprocessing, please download it from this OneDrive link.

Step 1: Preprocessing the seed dataset

python data_preprocess.py --seed_dataset_dir ${seed_dataset_dir} --cache_dir ${cache_dir}

Running this script will generate the following files in the specified cache dir: cache_dialogue_{train|dev|test_seen|test_unseen}.jsonl

Step 2: Dataset curation

# set your OpenAI API key
export OPENAI_API_KEY=""

python -u dialog_simulation.py --cached_seed_path ${cached_seed_path} \
    --output_dir ${output_dir} \
    --max_interaction_step ${max_interaction_step}

Running the above script will be like:

If you hope NOT to show the instructions and the synthesized conversations in the console, please set --show_description and --show_message to false.

Acknowledgement

Our code is partially based on the implementation of ChatArena. We thank the authors for their excellent work.

Citation

If you use our data or code in your work, please kindly cite our work as:

@inproceedings{wang-etal-2023-target,
    title = "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation",
    author = "Wang, Jian  and
      Cheng, Yi  and
      Lin, Dongding  and
      Leong, Chak Tou and
      Li, Wenjie",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
}

About

Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages