Czech restaurant information dataset for NLG

Authors: Ondřej Dušek, Filip Jurčíček, Josef Dvořák, Petra Grycová, Matěj Hejda, Jana Olivová, Michal Starý, Eva Štichová
License: Creative Commons 4.0 BY-SA
LINDAT release: http://hdl.handle.net/11234/1-2123
Development website: https://github.com/UFAL-DSG/cs_restaurant_dataset

This is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015).

It includes input dialogue acts and the corresponding output natural language paraphrases in Czech. Since the dataset is intended for RNN-based NLG systems using delexicalization, inflection tables for all slot values appearing verbatim in the text are provided.

The dataset has been created from the English restaurant set using the following steps:

Deduplicating identical sentences (with different slot DA values ignored)
Localizing restaurant and neighborhood names to Prague (the actual data are random, do not correspond to any real restaurant database, but most of the proper names used included to be inflected in Czech)
Translating the data into Czech
Automatic checks for the presence of slot values
Expanding the translated data to original size by relexicalizing with different slot values + manual checks

Details & Citing

To find out more details and cite the dataset if you use it, use the following paper:

Ondřej Dušek and Filip Jurčíček (2019): Neural Generation for Czech: Data and Baselines. In INLG, Tokyo, Japan.

Dataset format

The NLG data

The dataset is released in CSV and JSON formats ({train,devel,test}.csv, {train,devel,test}.json); the contents are identical. All files use the UTF-8 encoding.

The dataset contains 5192 instances in total. Each instance has the following properties:

da -- the input dialogue act
delex_da -- the input dialogue act, delexicalized
text -- the output text
delex_text -- the output text, delexicalized

The order of the instances is random; the split is roughly 3:1:1, ensuring that the different sections don't share the same DAs (so the generators need to generalize to unseen DAs), but they share as many generic different DA types as possible (e.g., confirm, inform_only_match etc.). DA types that only have a single corresponding DA (e.g., bye()) are included in the training set.

The training, development, and test set contain 3569, 781, and 842 instances, respectively.

Additional morphology data

The JSON file surface_forms.json includes information about morphological inflection forms for all slot values contained within the dataset. The JSON has the following hierarchical structure:

dictionary slot (e.g., name, pricerange) -> all data associated with the slot
- dictionary value (e.g., Café Savoy, Chinese) -> all possible surface forms associated with it
  - list of string values in the format lemma - form - tag (tab-separated, e.g. Karlín\tKarlína\tNNIS2-----A----)

Morphological tags use Hajič's Czech positional tagset. Multi-word values are treated as atomic, i.e., they only receive one morphological tag each, typically belonging to the main word of the phrase (usually a noun).

A _ wildcard is used to replace numeric values.

The domain

The domain is restaurant information in Prague, with random/fictional values. The users may request a specific restaurant, the system may ask for clarification or confirmation.

Dialogue acts format

The dialogue acts in this dataset (context_parse and response_da properties) follow the Alex dialogue act format. Basically, it is a sequence of dialogue act items. Each dialogue act item contains an act type and may also contain a slot and a value.

Examples:

dialogue act	example utterance	English translation
`goodbye()`	Na shledanou.	Goodbye.
`?request(food)`	Na jaké jídlo máte chuť?	What food type would you like?
`inform(area=Smíchov)&inform(name=\"Ananta\")`	Ananta je v oblasti Smíchova.	Ananta is in the area of Smíchov.
`?confirm(good_for_meal=dinner)`	Chcete restauraci na večeři?	Would you like a restaurant for dinner?

The act types used in this dataset are:

inform -- informing about a restaurant or number of restaurants matching criteria
inform_only_match -- informing about the only restaurant matching criteria
inform_no_match -- apology that no match has been found
confirm -- request to confirm a specific search parameter
select -- request to select between two parameter values
request -- request for additional details to complete the search
reqmore -- asking whether the system can be of more help
goodbye -- goodbye

Slots used in this dataset are:

name -- restaurant name
count -- number of restaurants matching criteria
type -- venue type (the only value used here is restaurant)
price_range -- restaurant price range (cheap, moderate, expensive)
price -- the actual meal price (or price range) in Czech Crowns (Kč)
phone -- restaurant phone number
address -- restaurant address (i.e., street and number)
postcode -- restaurant postcode
area -- the neighborhood in which the restaurant is located
near -- a nearby other venue
food -- food type, i.e., cuisine (Chinese, French, etc.)
good_for_meal -- suitability for a particular meal (breakfast, lunch, brunch, dinner)
kids_allowed -- suitability for children

Slot Error Rate evaluation

Additionally we provide an evaluation script for further research purposes. The script measure_slot_error_rate.py computes the Slot Error Rate defined as:

SER = (# missing slots + # additional slots) / # total number of slots

This is done by comparing the slot values in each input dialogue acts with provided dialogue system output. Note that this script was created after publishing the corresponding paper and the scores do not correspond to those in the paper. Most significatly, the script is able to check the kids_allowed slot, which was not handled in the original paper.

Usage

To evaluate your NLG system output output.txt on the test set run the following command:

python measure_slot_error_rate.py --sys_file output.txt surface_forms.json test.csv

See the list of found errors by increasing the verbosity of the script by adding the -vv argument.

For detailed usage information run:

python measure_slot_error_rate.py -h

Acknowledgments

This work was funded by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221 and core research funding, SVV project 260 333, and GAUK grant 2058214 of Charles University in Prague. It used language resources stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
devel		devel
.gitignore		.gitignore
README.md		README.md
devel.csv		devel.csv
devel.json		devel.json
measure_slot_error_rate.py		measure_slot_error_rate.py
surface_forms.json		surface_forms.json
test.csv		test.csv
test.json		test.json
test_measure_slot_error_rate.py		test_measure_slot_error_rate.py
train.csv		train.csv
train.json		train.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Czech restaurant information dataset for NLG

Details & Citing

Dataset format

The NLG data

Additional morphology data

The domain

Dialogue acts format

Slot Error Rate evaluation

Usage

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

UFAL-DSG/cs_restaurant_dataset

Folders and files

Latest commit

History

Repository files navigation

Czech restaurant information dataset for NLG

Details & Citing

Dataset format

The NLG data

Additional morphology data

The domain

Dialogue acts format

Slot Error Rate evaluation

Usage

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages