QuixData is a wrapper for sharded tar-datasets with multiple modalities based on the WebDataset format. Currently maintained and used by the DSB group at the University of Oslo.
We use an API similar in style to WebDataset to make a simple way of training with sharded datasets using standard PyTorch conventions. This allows entry-level users to work with locally hosted sharded datasets without much hassle, and allows us to better maintain an increasing zoo of dataset formats over multiple HPC resources.
quixdata contains the QuixDataset class which handles locally stored datasets using .tar shards,
e.g. the WebDataset (WDS) format.
The behaviour is a mix of standard torch
dataset implementations, and we use API
for applying mappings after dataset initialization, (e.g., map
, map_tuple
).
Designed to simplify some of the implementation choices used in WDS for online hosting,
such as missing length (__len__
) and buffered sampling. These can be an issue for locally
hosted datasets. These quirks make WDS slightly less attractive for entry-level implementation.
QuixDataset allows the data to be shuffled / batched / etc. using native DataLoader classes. We also simplify shuffling behaviour for tasks which require high stochasticity, such as contrastive learning.
Instead of sequential iteration over shards, which is unnecessary if files are hosted locally, this format creates an index of the byte offsets of all files in the shards, which can be serialized for faster subsequent initialisation. These indices are then concatenated by taking the union over elements matching the supplied extensions.
QuixDataset relies on a supplied config file, formatted as a JSON which contains info on training / validation folds, default extensions. Typically this is formatted as follows (example from ImageNet1k):
{'train': '/train_{0000..0071}.tar',
'val': '/val_{0000..0003}.tar',
'extensions': ['jpg', 'cls'],
'metadata': {'num_classes': 1000,
'num_train': 1281167,
'num_val': 50000,
'website': 'https://www.image-net.org/'}}
The shards are listed using brace expansion. In addition, the config file includes a set of (default) extensions.
These can be overrided in the initialisation. In addition, QuixDataset allows for customizable decoders for different
extensions, which can be provided using the override_decoders
argument.
Currently, the config file is required to be in the directory of the dataset, and
defaults to config.json
, but can be specified.
A number of useful encoder/decoders for different modalities are included:
- PIL for image data files,
- CLS for integer indices / classification labels,
- RLE for a simple run-length encoding,
- SEG8 for 8-bit segmentation indices,
- SEG16 for 16-bit segmentation indices,
- SEG24 for 24-bit segmentation indices,
- SEG32 for 32-bit segmentation indices,
- JSON file support for nested objects, text, or bounding box data,
- NPY for numpy array encoding/decoding,
- MAT for matlab array encoding/decoding,
- PKL for pickled python objects,
The encoders are currently featured in the quixdata.encoders
submodule, which includes a list of default encoders and decoders, as well as an interface EncoderDecoder
for easy implementation of custom file encoders and decoders.
Looking up names in a tarfile is a bit inefficient for large shards. Instead, the LITDataset looks up the offsets
for each file in all shards, and generates an index. If these are not provided, this is generated on the fly, but
can be serialized for faster subsequent initialization. This allows uncompressed tar shards to be quickly accessed
int the __getitem__
method.
The quixdata.writers
submodule includes a set of tools for writing sharded files. The most practical for writing QuixDataset
classes is the QuixWriter
class. This class is initialized with a name and a root folder, and opens two seperate writers for training and validation folds. The writers take a key, and a set of modalities as a dictionary, and writes these as a sample to the shard, e.g.:
img = ... # Input image
seg = ... # Semantic segmentation mask
inst = ... # Instance segmentation mask
scenelabel = ... # Scene class label
objdict = {
'__key__': keyname,
'jpg': img, # JPG files handled by PIL
'semantic.seg16': seg, # 16-bit semantic segmentation label
'instance.seg8': inst, # 8-bit instance segmentation label
'scene.cls': scenelabel, # Scene classification label
}
writer.train.write(objdict) # Handles writing to tar, sharding, etc...
The modalities can vary from sample to sample in the dataset, and the QuixDataset class will automagically compile a dataset to match the selected modalities when initialized. This means that samples can be effectively filtered using the override_extensions
argument. For instance, if we only wanted samples with semantic segmentation labels, we could do:
train_dataset = QuixDataset(
'DatasetName',
'/path/to/data/',
override_extensions=('jpg', 'semantic.seg16'),
train=True
)
This initializes the dataset with only the selected modalities.
- Improve documentation.
- Expand support for modalities in encoder/decoders.
- Remove explicit train/val folds by supporting a
__fold__
dict key. - Better support infinite data streams from online/network sources.