Releases: cleanlab/label-errors
QuickDraw Dataset Cross-validated Predicted Probabilities
We release the cross-validated predicted probabilities for the QuickDraw dataset. These probabilities were trained using 4-fold cross-validation for all 50,426,266
examples and 345
classes. The resulting predicted probabilities (pyx numpy matrix) is shape 50426266 x 345
. The resulting file is 33GB
in np.float16
format.
Note, pyx
is short for prob(y = label | data example x)
.
Download the QuickDraw Cross-validated Predicted Probabilities as an numpy matrix.
Make sure pigz
and wget
are installed:
# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz
Download the pyx files
base_url="https://github.com/cgnorthcutt/label-errors/releases/download/"
base_filename="quickdraw-pyx-v1/quickdraw_pyx.tar.gz-parta"
for part in $(eval echo "{a..k}"); do
wget --continue $base_url$base_filename$part
done
Decompress the tar.gz file parts into the final pyx numpy matrix:
cat quickdraw_pyx.tar.gz-part?? | unpigz | tar -xvC .
Ancillary extra details
To compress the pyx probabilities file prior to uploading, we used the followign command
tar -I pigz -cvf - quickdraw_pyx.npy | split --bytes=1800M - "quickdraw_pyx.tar.gz-part"
Numpy AudioSet Embeddings Dataset
This is a version of the AudioSet dataset formatted using only python lists and numpy matrices. The original dataset (formatted as tfrecords) is released here: https://research.google.com/audioset/download.html
We found pervasive errors in the test set of this dataset, and released corrected test sets here (see our paper).
Dataset Details
This dataset provides three things for balanced train set, the unbalanced train set, and the eval/test set:
- the features (as a list of numpy matrices)
- each 10 second audio clip is represented as a 128-length 8-bit quantized embedding for every 1 second of audio resulting in a 128x10 matrix representation for all 10 seconds of audio
- the labels (as a list of multi-label lists)
- there are 527 unique labels, denoted as 0, 1, ..., 526
- the video ids of each example (as list of lists). Use these to map to the corrected test sets and label errors released at https://github.com/cgnorthcutt/label-errors.
Download the dataset
Make sure pigz
and wget
are installed:
# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz
Download the Audioset Files
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/numpy-audioset-dataset/audioset_preprocessed.tar.gz-partaa
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/numpy-audioset-dataset/audioset_preprocessed.tar.gz-partab
Decompress the tar.gz file parts into the final dataset:
cat audioset_preprocessed.tar.gz-part?? | unpigz | tar -xvC .
Once decompressed, the preprocessed data should like this like
preprocessed/
βββ βββ bal_train_features.p
βββ βββ bal_train_labels.p
βββ βββ bal_train_video_ids.p
βββ βββ eval_features.p
βββ βββ eval_labels.p
βββ βββ eval_video_ids.p
βββ βββ unbal_train_features.p
βββ βββ unbal_train_labels.p
βββ 'ββ unbal_train_video_ids.p
Recreating this preprocessed dataset from scratch
The original dataset is provided using tfrecord formatting. To reformat the data to python lists of numpy matrices (for correcting test sets, viewing errors, and for training), you need to run this script: https://github.com/cgnorthcutt/label-errors/blob/main/examples/audioset_preprocessing.py
For example, using [this script)], you'd run:
mkdir preprocessed
cd preprocessed
python audioset_preprocessing.py --audioset-dir '/path/to/audioset/audioset_v1_embeddings/'
License
This preprocessed dataset is made available (Copyright (c) Curtis G. Northcutt) under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
The original AudioSet dataset is made available (Copyright (c) Google Inc.) under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Amazon Reviews Dataset
The original Amazon5core dataset, which was available here (http://jmcauley.ucsd.edu/data/amazon/index_2014.html) is no longer available in its original form with the same indices for each example. For reproducibility, and to match up the indices of the label errors with the dataset, we host the Amazon5core dataset here.
We made four modifications (as compared to the original amazon5core dataset):
- Removed 2-star and 4-star because of ambiguity with 1-star and 5-star reviews, respectively.
- Removed unhelpful reviews, i.e. we only kept reviews with more helpful votes than unhelpful votes.
- Removed reviews with zero helpful upvotes.
- Removed empty reviews.
The dataset has been prepared/formatted into fastext format, i.e. lines in the txt dataset file look like:
__label__5 I bought this for my husband who plays the piano.
__label__1 Both tutus were mailed in a flat plastic bag in a manila envelope.
__label__3 ...
The label number matches the number of stars (out of 5) associated with each review. As a reminder, we removed 2-star and 4-star because of ambiguity with 1-star and 5-star reviews, respectively.
Download the dataset files
Make sure pigz
and wget
are installed:
# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz
Download the Amazon5core reviews pre-prepared dataset files
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/amazon-reviews-dataset/amazon5core.tar.gz-partaa
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/amazon-reviews-dataset/amazon5core.tar.gz-partab
To combine the tar.gz file parts into the pre-prepared amazon5core.txt dataset:
cat amazon5core.tar.gz-part?? | unpigz | tar -xvC .