The preprocess script for these datasets can be found under data
folder.
- DailyDialog dataset
- Ubuntu corpus
- PersonaChat
- PPL: test perplexity
- BLEU(1-4):
- Embedding-based metrics: Average, Extrema, Greedy
- Distinct-1/2
- Pytorch 1.2+
- Python 3.6.1+
- tqdm
- numpy
- nltk 3.4+
- scipy
- sklearn (optional)
- GoogleNews word2vec or glove 300 word2vec (optional)
- tensorboard (for PyTorch 1.2+)
Three multi-turn open-domain dialogue dataset (Dailydialog, PersonaChat, UbuntuV2)
Dailydialog and PersonaChat can be obtained by this link
UbuntuV2 can be obtained by this link
The preprocess script process.py
for these datasets can be found under data/
folder.
Each dataset contains 6 files
- src-train.txt
- tgt-train.txt
- src-dev.txt
- tgt-dev.txt
- src-test.txt
- tgt-test.txt
In all the files, one line contain only one dialogue context (src) or the dialogue response (tgt).
More details can be found in the example files. Each sentence must begin with the special tokens <user0>
and <user1>
which denote the speaker.
The __eou__
is used to separate the multiple sentences in the conversation context.
- Model names:
PHAED
- Dataset names:
DaildyDialog, PersonaChat, Ubuntu
Before running the following commands, make sure the essential folders are created:
mkdir -p processed/$DATASET
mkdir -p data/$DATASET
mkdir -p tblogs/$DATASET
mkdir -p ckpt/$DATASET
Variable DATASET
contains the name of the dataset that you want to process
./run.sh vocab <dataset>
# get the vocab of DailyDialog dataset
./run.sh vocab DailyDialog
./run.sh train <dataset> <model> <cuda>
# train the PHAED model with DailyDialog dataset on 0th GPU
./run.sh train DailyDialog PHAED 0
./run.sh translate <dataset> <model> <cuda>
# generation the response. translate mode, dataset dialydialog, model PHAED on 0th GPU
./run.sh translate DailyDialog PHAED 0
./run.sh eval <dataset> <model> <cuda>
# get the BLEU, Distinct, embedding-based metrics result of the generated sentences on 0th GPU
./run.sh eval DailyDialog PHAED 0
Builds on the MutiTurnDialogZoo, embedding_metric, and transformer_xl