This repository contains data readers and examples for the three tracks of the Shifts Dataset and the Shifts Challenge.
The Shifts Dataset contains curated and labelled examples of real, 'in-the-wild' distributional shift across three large-scale tasks. Specifically, it contains a tabular weather prediction task, machine translation, and Vehicle Motion Prediction. Dataset shift is ubiquitous in all of these tasks and modalities. The dataset, assessment metrics and benchmark results are detailed in our associated paper: Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks
If you use the Shifts Dataset in your work, please cite our paper using the following Bibtex:
@article{shifts2021,
author = {Malinin, Andrey and Band, Neil and Ganshin, Alexander, and Chesnokov, German and Gal, Yarin, and Gales, Mark J. F. and Noskov, Alexey and Ploskonosov, Andrey and Prokhorenkova, Liudmila and Provilkov, Ivan and Raina, Vatsal and Raina, Vyas and Roginskiy, Denis and Shmatova, Mariya and Tigar, Panos and Yangel, Boris},
title = {Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks},
journal = {arXiv preprint arXiv:2107.07455},
year = {2021},
}
If you have any questions about the Shifts Dataset, the paper or the benchmarks, please contact am969@yandex-team.ru
.
The Shifts dataset is released under a mixed license.
The Shifts Weather Prediction Dataset is released under CC BY NC SA 4.0 license. This dataset was constructed by combining features from publicly available weather prediction services and models. Specifically, we combined data from NOAA/NWS servers, data generated by WRF model from NCAR/UCAR, and data from Meteorological Service of Canada. Ground station readings were taken from [NOAA] (https://www.weather.gov/disclaimer). The data was cleaned and features standardized.
The Shifts Machine Translation Dataset is released under a mixed license.
GlobalVoices evaluation data is released under CC BY NC SA 4.0.
The english source data was taken from GlobalVoices (originally licenced under CC BY 3.0) and target Russian translations provided by Yandex in-house professional translators.
The source-side text for the Reddit development and evaluation datasets exist under terms of the Reddit API. The target side Russian sentences were obtained by Yandex via in-house professional translators and are released under CC BY NC SA 4.0. We highlight that the development set source sentences are the same ones as used in the MTNT dataset.
Shifts SDC Motion Prediction Dataset is released under CC BY NC SA 4.0 license.
As the Shifts Challenge is currently underway, we are only releasing the full training and development sets of the canonical partition for all tasks of the Shift Dataset, as detailed in our paper. Evaluation data without ground-truth labels or metadata will be released on October 17th 2021. The evaluation data labels and ground-truth predictions, as well as the full Shifts Dataset, will become availabe on November 1st 2021, after the Shifts Challenge concludes.
By downloading the Shifts Dataset, you automatically agree to the licenses described above.
Canonical parition of the training and development data can be downloaded here.
The development data can be downloaded here. The training data for this task if the WMT'20 En-Ru dataset. It can be downloaded via the scripts provided here.
Canonical parition of the training and development data can be downloaded here.