webdataset-resharder

Efficiently process webdatasets

Selecting samples in the FixedPools track

Before training, you will need to select the subset of samples you wish to use. Given a set of chosen samples, we create new shards with only those samples, which the train ing code then consumes.

Each sample in our pool has a unique identifier, which is present in the metadata parquets, and in the json files inside the .tar shards.

The format describing the subset of samples should be a numpy array of dtype numpy.dtype("u8,u8") (i.e. a structured array of pairs of unsigned 64-bit integers), with shape (subset_size,), containing a list of uids (128-bit hashes from the parquet files) in lexicographic sorted order, saved to disk in either npy format or memory-mapped format.

For instance, if you have a list of uids uids = ['139e4a9b22a614771f06c700a8ebe150', '6e356964a967af455c8016b75d691203'], you can store them by running the following python code:

processed_uids = np.array([(int(uid[:16], 16), int(uid[16:32], 16)) for uid in uids], np.dtype("u8,u8"))
processed_uids.sort()
np.save(out_filename, processed_uids)

After creating a subset, you may invoke the resharder to build the subset shards in $output_dir like so:

python resharder.py -i $download_dir -o $output_dir -s $subset_file

If desired, the resharder can be run in parallel on multiple nodes. The easiest way to do so is to split the input directory into smaller subfolders with fewer shards, and run separate resharder jobs for each of them, each with to separate output directories.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.gitignore		.gitignore
README.md		README.md
resharder.py		resharder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webdataset-resharder

Selecting samples in the FixedPools track

About

Releases

Packages

Contributors 4

Languages

mlfoundations/webdataset-resharder

Folders and files

Latest commit

History

Repository files navigation

webdataset-resharder

Selecting samples in the FixedPools track

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages