Skip to content

Releases: cleanlab/label-errors

QuickDraw Dataset Cross-validated Predicted Probabilities

05 May 22:33
c84dfe8
Compare
Choose a tag to compare

We release the cross-validated predicted probabilities for the QuickDraw dataset. These probabilities were trained using 4-fold cross-validation for all 50,426,266 examples and 345 classes. The resulting predicted probabilities (pyx numpy matrix) is shape 50426266 x 345. The resulting file is 33GB in np.float16 format.

Note, pyx is short for prob(y = label | data example x).

Download the QuickDraw Cross-validated Predicted Probabilities as an numpy matrix.

Make sure pigz and wget are installed:

# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz

Download the pyx files

base_url="https://github.com/cgnorthcutt/label-errors/releases/download/"
base_filename="quickdraw-pyx-v1/quickdraw_pyx.tar.gz-parta"
for part in $(eval echo "{a..k}"); do
    wget --continue $base_url$base_filename$part
done

Decompress the tar.gz file parts into the final pyx numpy matrix:

cat quickdraw_pyx.tar.gz-part?? | unpigz | tar -xvC .

Ancillary extra details

To compress the pyx probabilities file prior to uploading, we used the followign command

tar -I pigz -cvf - quickdraw_pyx.npy | split --bytes=1800M - "quickdraw_pyx.tar.gz-part"

Numpy AudioSet Embeddings Dataset

05 May 18:43
2405c37
Compare
Choose a tag to compare

This is a version of the AudioSet dataset formatted using only python lists and numpy matrices. The original dataset (formatted as tfrecords) is released here: https://research.google.com/audioset/download.html

We found pervasive errors in the test set of this dataset, and released corrected test sets here (see our paper).

Dataset Details

This dataset provides three things for balanced train set, the unbalanced train set, and the eval/test set:

  • the features (as a list of numpy matrices)
    • each 10 second audio clip is represented as a 128-length 8-bit quantized embedding for every 1 second of audio resulting in a 128x10 matrix representation for all 10 seconds of audio
  • the labels (as a list of multi-label lists)
    • there are 527 unique labels, denoted as 0, 1, ..., 526
  • the video ids of each example (as list of lists). Use these to map to the corrected test sets and label errors released at https://github.com/cgnorthcutt/label-errors.

Download the dataset

Make sure pigz and wget are installed:

# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz

Download the Audioset Files

wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/numpy-audioset-dataset/audioset_preprocessed.tar.gz-partaa
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/numpy-audioset-dataset/audioset_preprocessed.tar.gz-partab

Decompress the tar.gz file parts into the final dataset:

cat audioset_preprocessed.tar.gz-part?? | unpigz | tar -xvC .

Once decompressed, the preprocessed data should like this like

preprocessed/
│   │── bal_train_features.p
│   │── bal_train_labels.p
│   │── bal_train_video_ids.p
│   │── eval_features.p
│   │── eval_labels.p
│   │── eval_video_ids.p
│   │── unbal_train_features.p
│   │── unbal_train_labels.p
│   '── unbal_train_video_ids.p

Recreating this preprocessed dataset from scratch

The original dataset is provided using tfrecord formatting. To reformat the data to python lists of numpy matrices (for correcting test sets, viewing errors, and for training), you need to run this script: https://github.com/cgnorthcutt/label-errors/blob/main/examples/audioset_preprocessing.py

For example, using [this script)], you'd run:

mkdir preprocessed
cd preprocessed
python audioset_preprocessing.py --audioset-dir '/path/to/audioset/audioset_v1_embeddings/'

License

This preprocessed dataset is made available (Copyright (c) Curtis G. Northcutt) under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

The original AudioSet dataset is made available (Copyright (c) Google Inc.) under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Amazon Reviews Dataset

05 May 13:50
4260126
Compare
Choose a tag to compare

The original Amazon5core dataset, which was available here (http://jmcauley.ucsd.edu/data/amazon/index_2014.html) is no longer available in its original form with the same indices for each example. For reproducibility, and to match up the indices of the label errors with the dataset, we host the Amazon5core dataset here.

We made four modifications (as compared to the original amazon5core dataset):

  • Removed 2-star and 4-star because of ambiguity with 1-star and 5-star reviews, respectively.
  • Removed unhelpful reviews, i.e. we only kept reviews with more helpful votes than unhelpful votes.
  • Removed reviews with zero helpful upvotes.
  • Removed empty reviews.

The dataset has been prepared/formatted into fastext format, i.e. lines in the txt dataset file look like:

__label__5 I bought this for my husband who plays the piano.
__label__1 Both tutus were mailed in a flat plastic bag in a manila envelope.
__label__3 ...

The label number matches the number of stars (out of 5) associated with each review. As a reminder, we removed 2-star and 4-star because of ambiguity with 1-star and 5-star reviews, respectively.

Download the dataset files

Make sure pigz and wget are installed:

# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz

Download the Amazon5core reviews pre-prepared dataset files

wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/amazon-reviews-dataset/amazon5core.tar.gz-partaa
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/amazon-reviews-dataset/amazon5core.tar.gz-partab

To combine the tar.gz file parts into the pre-prepared amazon5core.txt dataset:

cat amazon5core.tar.gz-part?? | unpigz | tar -xvC .