On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

This is the official GitHub repository for the following paper:

On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks.
Stephen Mussmann,* Robin Jia,* and Percy Liang.
Findings of EMNLP, 2020.

For more details on how to reproduce all the experiments in the paper, please see the associated CodaLab Worksheet.

Setup

Install packages:

pip install -r requirements.txt

For faiss on Ubuntu, you may also have to install libopenblas-dev and libomp-dev:

sudo apt-get install libopenblas-dev libomp-dev

If you're using Docker, you can also refer to the Dockerfile in this repository, or just use this Docker image.

Download data:

bash pull-deps.sh

Running experiments

For full details on how to run all of our experiments, see the commands used in the aforementioned CodaLab Worksheet.

An example command to run uncertainty sampling on QQP is:

python3 src/active_learning.py --world qqp --output_dir my_model_directory --sampling_technique uncertainty --loss_type sgdbn

By default this will save checkpoints from every round of active learning, which takes up a lot of disk space. You can only save the last checkpoint by passing the flag --save_last_model.

After active learning finishes, you can evaluate the model on the test set by running:

python3 src/run_on_test.py --world qqp --load_dir my_model_directory/labels232100 --output_dir my_results_directory --sampling_technique uncertainty --loss_type sgdbn --test

Without the --test flag, this will run on dev.

Codalab scripts

The cl directory contains scripts that launch jobs on CodaLab that reproduce our experiments. You may find this useful if you also use CodaLab, or merely as documentation of what commands we ran.

Code to run vision experiments

An earlier version of this paper was submitted to ICML and also included experiments on two computer vision datasets, iNaturalist and CelebA. The code is retained in src in case it becomes useful for future work---see image_embedder.py, celeba_reader.py, and inat_reader.py.

Making your own splits and evaluation sets

The data directory created by pull-deps.sh contains all the files needed to run on the same exact dataset splits and evaluations sets as we use in our paper. These files should be used when reproducing or comparing to results from this paper.

We have also included scripts that can be used to generate new splits and evaluation sets. These may be useful for processing new datasets, or for measuring variance across different dataset splits.

For QQP, prepare the initial partition of questions by running:

python3 src/setup_qqp.py <path_to_GLUE_QQP_dir> <out_dir>

For either QQP or WikiQA, generate the set of evaluation pairs by running:

python3 src/make_dprcp.py --world [qqp|wikiqa] --random_pairs 10000000 # Pass --test for test set

This will use 10 million random pairs, in addition to positives and "nearby" negatives defined by faiss.

Generate stated datasets (all positives plus stated negatives) using the current train/dev/test split by running:

python3 src/create_data_files.py <out_dir> [qqp|wikiqa]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cl		cl
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pull-deps.sh		pull-deps.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Setup

Running experiments

Codalab scripts

Code to run vision experiments

Making your own splits and evaluation sets

About

Releases

Packages

Languages

License

robinjia/adaptive-pairwise

Folders and files

Latest commit

History

Repository files navigation

On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Setup

Running experiments

Codalab scripts

Code to run vision experiments

Making your own splits and evaluation sets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages