ReaSCAN is a synthetic navigation task that requires models to reason about surroundings over syntactically difficult languages.
- 11/28/2021: We release newer version of non-generalization testing sets for different command patterns as ReaSCAN-v1.1.zip.
- 07/29/2021: Our paper is accepted to NeurIPS2021 with OpenReview.
- 06/17/2021: We update model performance results by fixing known issues. We include more compositional splits as well.
- 06/07/2021: We submit our preprint to NeurIPS2021.
- Citation
- Example
- Dataset
- Data format
- ReaSCAN as an Abstract Reasoning Challenge
- Dataset Artifacts
- Models
- Other files
- License
Zhengxuan Wu, Elisa Kreiss, Desmond C. Ong, and Christopher Potts. 2021. ReaSCAN: Compositional Reasoning in Language Grounding. NeurIPS 2021 Datasets and Benchmarks Track.
@article{wu-etal-2021-reascan,
title={Rea{SCAN}: Compositional Reasoning in Language Grounding},
author={Wu, Zhengxuan and Kreiss, Elisa and Ong, Desmond C. and Potts, Christopher},
journal={NeurIPS 2021 Datasets and Benchmarks Track},
url={https://openreview.net/forum?id=Rtquf4Jk0jN},
year={2021}}
Four command-world pairs for different command patterns. Our simple command is equivalent to gSCAN. RD means distractors are randomly sampled. Referent targets shaded in red with distractors are shaded in blue, and are highlighted by green dash lines.
We generated ReaSCAN using our pipeline with fixed random seeds. You can reproduce the version of ReaSCAN we use in the paper by running the pipeline. Additionally, we also update the version we use to a online folder where you can directly download and use as-it-is. Note that, the dataset files are really large. It may take a while to download them.
Our generated data is in ReaSCAN-v1.1.zip (Note that we updated our files to hotfix some of existing issues on 06/16/2021. We also included newer non-generalization testing sets on 11/28/2021), which is saved in a shared drive. The dataset consists subsets generated for different patterns (P1: Simple (similar to gSCAN), P2: 1-relative-clause, P3: 2-relative-clauses, P4: 3-relative-clauses) and different compositional splits (see our paper for details about each split).
Random splits that can be used for training your models,
ReaSCAN-compositional
: ReaSCAN all commands, containing train, dev and test sets.ReaSCAN-compositional-p1
: ReaSCAN Simple set, containing train, dev and test sets.ReaSCAN-compositional-p2
: ReaSCAN 1-relative-clause set, containing train, dev and test sets.ReaSCAN-compositional-p3
: ReaSCAN 2-relative-clauses set, containing train, dev and test sets.ReaSCAN-compositional-p1-test
: ReaSCAN Simple set, containing test set only. Model performance is reported in the paper.ReaSCAN-compositional-p2-test
: ReaSCAN 1-relative-clause set, containing test set only. Model performance is reported in the paper.ReaSCAN-compositional-p3-test
: ReaSCAN 2-relative-clauses set, containing test set only. Model performance is reported in the paper.ReaSCAN-compositional-p1-test-updated
: UPDATED ReaSCAN Simple set, containing test set only. Model performance is NOT reported in the paper.ReaSCAN-compositional-p2-test-updated
: UPDATED ReaSCAN 1-relative-clause set, containing test set only. Model performance is NOT reported in the paper.ReaSCAN-compositional-p3-test-updated
: UPDATED ReaSCAN 2-relative-clauses set, containing test set only. Model performance is NOT reported in the paper.ReaSCAN-compositional-p3-rd
: ReaSCAN 2-relative-clauses set with random distractors, containing train, dev and test sets.
Compositional splits that are designed to be zero-shot testing splits,
ReaSCAN-compositional-a1
: ReaSCAN A1 (novel color modifier) compositional split, containing test set only.ReaSCAN-compositional-a2
: ReaSCAN A2 (novel color attribute) compositional split, containing test set only.ReaSCAN-compositional-a3
: ReaSCAN A3 (novel size modifier) compositional split, containing test set only.ReaSCAN-compositional-b1
: ReaSCAN B1 (novel co-occurence of objects) compositional split, containing test set only.ReaSCAN-compositional-b2
: ReaSCAN B2 (novel co-occurence of relations) compositional split, containing test set only.ReaSCAN-compositional-c1
: ReaSCAN C1 (novel conjunctive clause length) compositional split, containing test set only.ReaSCAN-compositional-c2
: ReaSCAN C2 (novel relative clauses) compositional split, containing test set only.
You can also generate your own compositional splits by modifying couple lines in code/dataset/generate_ReaSCAN_splits.ipynb
.
The Table 3 in our paper includes testing performance on non-generalization testing sets (e.g., the top 4 rows in the table). As raised by this PR, those sets are later found to be overestimating model performance as they may include exact same examples from the training set. You can find detailed analyses here. We thus update the dataset, and you can now download it at ReaSCAN-v1.1.zip. We also report model performance on these updated non-generalization testing sets as follows:
Compositional Splits | Command-World Pairs | M-LSTM | GCN-LSTM |
---|---|---|---|
UPDATED Simple (Test) | 907 | 93.83 (0.76) | 99.38 (0.13) |
UPDATED 1-relative-clause (Test) | 2122 | 75.59 (2.29) | 97.71 (0.56) |
UPDATED 2-relative-clauses (Test) | 2724 | 67.16 (2.50) | 95.87 (0.40) |
UPDATED All (Test) | 5753 | 74.47 (1.71) | 97.10 (0.38) |
CAVEATS: When you compare your model performance with the baselines, please pay attention to what sets you are using. If you use the old sets, you want to use the numbers from the paper. Otherwise, you need to use the updated numbers here if you use the updated version for these sets.
You can recreate ReaSCAN using provided scripts as well. Since generating a full-fleged dataset can take long, you can use our multi-process generator which can generate any subset included in our paper within 20 mininutes with 50 processes. Here are some example code we used to generate 2-relative-clauses set dataset. For exact scripts we use to generate our dataset used in the paper, you can refer to code/experiments.sh
.
Single process generation,
cd code/dataset
python generate_ReaSCAN.py \
--mode train \
--n_command_struct 100 \
--date 2021-05-30 \
--grid_size 6 \
--n_object_max 13 \
--per_command_world_retry_max 500 \
--per_command_world_target_count 3 \
--output_dir ./ReaSCAN-compositional-demo/ \
--include_relation_distractor \
--include_attribute_distractor \
--include_isomorphism_distractor \
--include_random_distractor \
--full_relation_probability 1.0 \
--command_pattern p3 \
--save_interal 200
Multi-process generation,
cd code/dataset
python generate_ReaSCAN_batch.py
Note that you need to go into the file and modify some variables to generate the dataset you want. After generating the datasets, if you want to create your own splits, you need to follow the provided dataset split helpers in code/dataset/generate_ReaSCAN_splits.ipynb
.
Once you generate the dataset .txt
file (in json
format), you can simply load any dataset as,
import json
path_to_data = "data-compositional-splits.txt"
logger.info(f"Reading dataset from file: {p1_path_to_data}...")
data_json = json.load(open(path_to_data, "r"))
print(data_json["examples"].keys())
We keep our format the same as gSCAN. For each example, we provide the command and the world representation. Additionally, we provide ReaSCAN specific metadata,
The first data example in the split called ReaSCAN-compositional-p3-test set. Click to open/close.
{
"command": "push,the,big,green,object,that,is,inside,of,a,red,box,and,in,the,same,row,as,a,blue,cylinder",
"grammer_pattern": "$OBJ_0 ^ $OBJ_1 & $OBJ_2",
"meaning": "push,the,big,green,object,that,is,inside,of,a,red,box,and,in,the,same,row,as,a,blue,cylinder",
"derivation": "$OBJ_0 ^ $OBJ_1 & $OBJ_2",
"situation": {
"grid_size": 6,
"agent_position": {
"row": "5",
"column": "3"
},
"agent_direction": 0,
"target_object": {
"vector": "000101000010",
"position": {
"row": "3",
"column": "1"
},
"object": {
"shape": "cylinder",
"color": "green",
"size": "4"
}
},
"distance_to_target": "4",
"direction_to_target": "nw",
"placed_objects": {
"0": {
"vector": "000101000010",
"position": {
"row": "3",
"column": "1"
},
"object": {
"shape": "cylinder",
"color": "green",
"size": "4"
}
},
"1": {
"vector": "001000011000",
"position": {
"row": "2",
"column": "0"
},
"object": {
"shape": "box",
"color": "red",
"size": "3"
}
},
"2": {
"vector": "001001000100",
"position": {
"row": "3",
"column": "0"
},
"object": {
"shape": "cylinder",
"color": "blue",
"size": "3"
}
},
"3": {
"vector": "000110000010",
"position": {
"row": "0",
"column": "4"
},
"object": {
"shape": "circle",
"color": "green",
"size": "4"
}
},
"4": {
"vector": "001001000100",
"position": {
"row": "0",
"column": "0"
},
"object": {
"shape": "cylinder",
"color": "blue",
"size": "3"
}
},
"5": {
"vector": "000101000010",
"position": {
"row": "2",
"column": "3"
},
"object": {
"shape": "cylinder",
"color": "green",
"size": "4"
}
},
"6": {
"vector": "001000011000",
"position": {
"row": "1",
"column": "1"
},
"object": {
"shape": "box",
"color": "red",
"size": "3"
}
},
"7": {
"vector": "100010000010",
"position": {
"row": "4",
"column": "4"
},
"object": {
"shape": "circle",
"color": "green",
"size": "1"
}
},
"8": {
"vector": "001001001000",
"position": {
"row": "5",
"column": "5"
},
"object": {
"shape": "cylinder",
"color": "red",
"size": "3"
}
},
"9": {
"vector": "100010000001",
"position": {
"row": "3",
"column": "4"
},
"object": {
"shape": "circle",
"color": "yellow",
"size": "1"
}
},
"10": {
"vector": "010000100100",
"position": {
"row": "3",
"column": "5"
},
"object": {
"shape": "square",
"color": "blue",
"size": "2"
}
},
"11": {
"vector": "000110000100",
"position": {
"row": "1",
"column": "0"
},
"object": {
"shape": "circle",
"color": "blue",
"size": "4"
}
},
"12": {
"vector": "000101001000",
"position": {
"row": "2",
"column": "5"
},
"object": {
"shape": "cylinder",
"color": "red",
"size": "4"
}
}
},
"carrying_object": null
},
"target_commands": "turn left,turn left,walk,walk,turn right,walk,walk,push,push,push,push,push,push",
"verb_in_command": "push",
"adverb_in_command": "",
"referred_target": "big green object",
"object_pattern_map": {
"$OBJ_0": "$SIZE $COLOR $ABS_SHAPE",
"$OBJ_1": "$COLOR $SHAPE",
"$OBJ_2": "$COLOR $SHAPE"
},
"relation_map": [
[
[
"$OBJ_0",
"$OBJ_1"
],
"$IS_INSIDE"
],
[
[
"$OBJ_0",
"$OBJ_2"
],
"$SAME_ROW"
]
],
"object_expression": {
"$OBJ_0": "big green object",
"$OBJ_1": "red box",
"$OBJ_2": "blue cylinder"
},
"n_object": 13,
"n_distractor": 10,
"full_relation_distractor": true,
"has_relation_distractor": true,
"has_attribute_distractor": false,
"has_isomorphism_distractor": false,
"has_random_distractor": true,
"n_random_distractor": 5,
"relation_distractor_metadata": [
{
"distractor_metadata": {
"edge": [
"$OBJ_0",
"$OBJ_1"
],
"relation_old_type": "$IS_INSIDE",
"full_set": true
}
},
{
"distractor_metadata": {
"edge": [
"$OBJ_0",
"$OBJ_2"
],
"relation_old_type": "$SAME_ROW",
"full_set": true
}
}
],
"attribute_distractor_metadata": [
{
"distractor_metadata": [
{
"modified_obj": null,
"modified_attribute": null
}
]
}
],
"isomorphism_distractor_metadata": [],
"random_distractor_metadata": [
{
"$OBJ_8": " red cylinder",
"$OBJ_9": " yellow circle",
"$OBJ_10": " blue square",
"$OBJ_11": " blue circle",
"$OBJ_12": " red cylinder"
}
]
}
This is one example from this dataset. It contains the "command", or input instruction, 'pull,a,small,object,that,is,in,the,same,column,as,a,green,cylinder,and,in,the,same,shape,as,a,small,red,object,cautiously' separated by ,
, which for the specified world state (i.e., "situation") maps to the "target_commands": "turn left,turn right,turn right,turn left,walk,turn left,turn right,turn right,turn left,walk,turn right,turn left,turn right,turn right,turn left,walk". The example contains the situation representation, or world state, at the key "situation", and also contains additional information that is needed in generating the world for example what are our distractors made of, such as fields in the relation_distractor_metadata
.
To be more compatiable with other models, we also provide a translation script that can translate each exmaple into a compressed dictionary containing all the information needed to train a neural model (i.e., input: a command sequence + tensor representation of a shape world, output: an output action sequence are all you need.). To convert, you can refer the following script,
cd code/models/gSCAN_with_language_conditioned_embedding
jupyter notebook
# open this file: read_reascan.ipynb
Following steps in this script, each example will be translated to a data structure like,
Compact version of ReaSCAN that is ready-to-use by any neural models. Click to open/close.
{"input": ["walk", "to", "the", "big", "blue", "circle", "that", "is", "in", "the", "same", "column", "as", "a", "big", "blue", "cylinder", "and", "in", "the", "same", "row", "as", "a", "red", "square", "hesitantly"], "target": ["walk", "stay", "walk", "stay", "walk", "stay", "turn left", "walk", "stay"], "situation": [[[0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]], [[0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], [[0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]], [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], [[0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]]}
Note that the situation is a tensor representation of the shape world. Each sub-list is the representation of each cell in the world. It encodes what object is in what position based on the following information,
"""
Each grid cell in a situation is fully specified by a vector:
[_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _]
1 2 3 4 r g b circle square cylinder box agent E S W N
_______ _____ ______________________ _____ _______
size color shape agent agent dir.
:param situation_representation: data from dataset.txt at key "situation".
:param grid_size: int determining row/column number.
:return: grid to be parsed by computational models.
"""
In case, if there are overlayed objects in a single cell, we add them together. This is only for a object that is inside of the box if the object is at the upper left corner. There are many other ways to represent this situation, but we take the simplest approach.
Two simplified abstract reasoning challenges with ReaSCAN. The task mimics human reasoning test where giving a set of input-output (input on the left and output on the right) pairs, the task taker needs to guess the output for the last input. For each task, we provide one potential abstract reasoning to solve the task.
You can generate such tasks using the script provided in code/dataset/future-looking-demo.ipynb
.
ReaSCAN in not perfect. In fact, we document a list of artifacts in our paper. Please see our Appendix B for details. Please read this before you use ReaSCAN. Here is a short summary of that section in bullet points:
- Non-comprehensive Linguistic Structures: Commands from ReaSCAN follow a specific linguistic template and are non-comprehensive in covering all linguistic structures.
- Non-comprehensive Distractors: ReaSCAN is not able to cover all possible distractors to make sure every part of the command is necessary to resolve the referring expression.
- Shapes and Relations Biases: The frequency distributions of shapes and relations may be biased due to the generation program.
- Self-exclusiveness: We assume every object mention in the command matches a unique object in the world.
- Other Induced Artifacts: We also discuss frequency distributions of verbs, adverbs, agent facing directions, agent-target relative directions, etc.
We use two existing models, and adapt their codes to benchmark ReaSCAN. Both models are published and experimented on gSCAN. Other than hyperparameter tunning, we are not changing model architectures.
This model is published with gSCAN in this paper from this repo. You can refer to their repo for details about the model. Here, we already adapt interface changes that are needed to run with ReaSCAN, you can simply run training with following lines,
cd code/models/seq2seq
CUDA_VISIBLE_DEVICES=0 python run_reascan.py \
--mode=train \
--max_decoding_steps=120 \
--max_testing_examples=2000 \
--data_directory=ReaSCAN-compositional-p1 \
--input_vocab_path=input_vocabulary.txt \
--target_vocab_path=target_vocabulary.txt \
--attention_type=bahdanau \
--no_auxiliary_task \
--conditional_attention \
--output_directory=./training_logs/p1-random-seed-44 \
--training_batch_size=2000 \
--max_training_iterations=200000 \
--seed=44
Note that this requires you generate the vocabulary file before hand to save time. You can do so by following scripts provided in the notebook ReaSCAN-vocab-generator.ipynb
in the same folder.
To evaluate this model, you need to run evaluation script and generate all predictions. Note that we follow the original repo, and you can refer to their code for your own implementations. This is the script we run,
cd code/models/seq2seq
CUDA_VISIBLE_DEVICES=0 python run_reascan.py \
--mode=test \
--data_directory=../../../data-files-updated/ReaSCAN-compositional-p1/ \
--input_vocab_path=input_vocabulary.txt \
--target_vocab_path=target_vocabulary.txt \
--attention_type=bahdanau \
--no_auxiliary_task \
--conditional_attention \
--output_directory=../../../testing_logs/p1-random-seed-44/ \
--resume_from_file=../../../training_logs/p1-random-seed-44/model_best.pth.tar \
--splits=dev \
--output_file_name=p1-random-seed-44.json \
--max_decoding_steps=120
Note that this is for --splits=dev
, you can change to --splits=test
if you want to evaluate with test splits.
After this script, it will generate predictions in the file in the output directory. Then, you can use our notebook to analyze the results by running the notebook performance-analysis.ipynb
in the model folder!
This model is published with gSCAN in this paper from this repo. You can refer to their repo for details about the model. Here, we already adapt interface changes that are needed to run with ReaSCAN, you can simply run training with following lines,
cd code/models/gSCAN_with_language_conditioned_embedding
CUDA_VISIBLE_DEVICES=0 python main_model.py \
--run p1-random-seed-66 \
--data_dir ./parsed_dataset-p1/ \
--seed 44 \
--txt
Note that the script above assumed that you already parse the dataset following the parsing helpers provided in the notebook read_reascan.ipynb
.
After running this script, all models will be saved in the directory folder. Then, you can evaluate performance of this model using scripts as,
cd code/models/gSCAN_with_language_conditioned_embedding
CUDA_VISIBLE_DEVICES=0 python eval_best_model.py \
--load ./output/p1-random-seed-44/model_best.pth.tar \
--data_dir ./parsed_dataset-p1/ \
--seed 44 \
--test_split dev
Note that this is for --test_split=dev
, you can change to --test_split=test
if you want to evaluate with test splits.
In this repo, we also provide a lot of useful scripts to analyze ReaSCAN in various ways. Here are a non-comprehensive list of them with their purposes,
code/models/seq2seq/performance-analysis.ipynb
: evaluate model performance.code/models/seq2seq/ReaSCAN-vocab-generator.ipynb
: generate required vocab files.code/models/gSCAN_with_language_conditioned_embedding/read_reascan.ipynb
: helper to parse the dataset into model readable format.code/experiments.sh
: all bash scripts we run for our experiment results.code/dataset/demo.ipynb
: demo file for all components involved in ReaSCAN data generation process.code/dataset/unit_tests.ipynb
: unit tests for ReaSCAN. If you want to customized ReaSCAN, please run this unit test before changing anything.code/dataset/generate_ReaSCAN_splits.ipynb
: generate splits for ReaSCAN.code/dataset/ReaSCAN-analysis.ipynb
: some analyses we conduct in the paper.
ReaSCAN has a Creative Commons Attribution 4.0 International License.