Welcome! ππ»
This is the official repository of the ProsocialDialog dataset, Canary, and Prost from our EMNLP 2022 paper: 
ProsocialDialog: A Prosocial Backbone for Conversational Agents.
Please cite our work if you found the resources in this repository useful:
@inproceedings{kim2022prosocialdialog,
    title={ProsocialDialog: A Prosocial Backbone for Conversational Agents},
    author={Hyunwoo Kim and Youngjae Yu and Liwei Jiang and Ximing Lu and Daniel Khashabi and Gunhee Kim and Yejin Choi and Maarten Sap},
    booktitle={EMNLP},
    year=2022
}We provide two different formats of our dataset:
- HuggingFace: dataframe format
- Direct download: list of dialogues
The only difference is the format; the contents are the same.
You can load our dataset from the HuggingFace hub as the following:
from datasets import load_dataset
dataset = load_dataset("allenai/prosocial-dialog", split='train')
prosocial_df = dataset.to_pandas()π‘ Note, the dataset format of the HuggingFace hub version and the direct download version is different. For the HuggingFace version, please take a look at the dataset preview.
You can also download our dataset directly by clicking this link.
Each split (train/valid/test) contains a list of dialogues. 
A dialogue has the following structure:
[
    {
        "text": ... # utterance 1
        "labels": [
            ... # utterance 2
        ],
        "rots": [
            ...
        ],
        "safety_label": ...,
        "safety_annotations": [
            ...
        ],
        "safety_annotation_reasons": [
            ...
        ],
        "source": ...,
        "etc": ...,
        "episode_done": false
    },
    ...,
    {
        "text": "I don't care what people think about me. I just want to make someone else's life miserable. Maybe I should instead threaten him with this.", # utterance 5
        "labels": [
            "You need to stop this. He might be having medical difficulties with this issue, so please try to help him out instead. It's important to look out for others when they're having trouble." # utterance 6
        ],
        "rots": [
            "It's wrong to threaten someone.",
            "You should look out for others."
        ],
        "safety_label": "__needs_intervention__",
        "safety_annotations": [
            "needs caution",
            "needs intervention",
            "needs intervention"
        ],
        "safety_annotation_reasons": [
            "They don't care how their actions would hurt the other person.",
            "Did not listen to the advice and plans to threaten the classmate.",
            "Now the speaker is thinking of threatening his classmate."
        ],
        "source": "socialchemistry",
        "etc": "reddit/amitheasshole/b26onw",
        "episode_done": true
    }
]
Please see below for a description of each attribute in the dataset:
| attribute | type | description | 
|---|---|---|
| text | str | the potentially unsafe utterance | 
| labels | list of str | the guiding utterance grounded on rules-of-thumb ( rots) | 
| rots | list of str|null | the relevant rules-of-thumb for textnot labeled as __casual__ | 
| safety_label | str | the final verdict of the context according to safety_annotations: {__casual__, __possibly_needs_caution__, __probably_needs_caution__, __needs_caution__, __needs_intervention__} | 
| safety_annotations | list of str | raw annotations from three workers: {casual, needs caution, needs intervention} | 
| safety_annotation_reasons | list of str | the reasons behind the safety annotations in free-form text from each worker | 
| source | str | the source of the seed text that was used to craft the first utterance of the dialogue: {socialchemistry, sbic, ethics_amt, ethics_reddit} | 
| etc | str|null | other information | 
| episode_done | bool | an indicator of whether it is the end of the dialogue | 
You can now directly download our Canary here!
The model will also be automatically downloaded when you create Canary by calling the Canary() class.
Have a look at the demo notebook to see how you can load Canary and use it!
Our code is built on the ParlAI framework. We recommend you create a conda environment as follows
conda env create -f environment.ymland activate it with
conda activate prosocial-dialogWe currently have a new conversation model π§π»βπ COSMO, which is significantly better than Prost, so we gently guide you to use COSMO instead of Prost. COSMO is also trained on ProsocialDialog and it comes with more controllability as you can give specific situation and role related prompts as input.
Please contact Hyunwoo Kim at hyunwook ATTT allenai.org
This repository is MIT licensed. See the LICENSE file for details.
