audios-to-dataset is a Rust CLI that turns a folder full of audio files into chunked DuckDB or Parquet datasets that mirror the layout expected by the Hugging Face datasets library. It is designed for fast local preparation of corpora before pushing the data to object storage or the Hub.
- Recursively scans your input directory (with optional MIME-type filtering) and trims the traversal depth to a configurable level.
- Batches audio into evenly sized databases with optional DuckDB or Parquet outputs and Hugging Face metadata embedded in the Parquet files.
- Pulls per-file transcripts from a companion CSV so you can ship audio + text pairs without post-processing.
- Parallelizes decoding and packing across a configurable number of threads to keep local ingestion fast.
- Populates audio duration by reading WAV headers when available, keeping non-WAV formats in the dataset with a zero duration fallback.
The binary is published as part of the repository; you can build it locally with Cargo:
cargo install --locked --path .If you prefer to run it directly from source while developing:
cargo run --release -- --help- Prepare an input directory that contains audio files (WAV, MP3, FLAC, OGG, AAC, …).
- (Optional) Create a CSV file with the columns
file_nameandtranscription(plus an optionalrelative_path) if you want to attach transcripts. - Run the CLI and point it at the input directory and an empty output folder:
audios-to-dataset \
--input ./recordings \
--output ./recordings-packed \
--files-per-db 1000 \
--format parquetDuring execution the tool:
- creates the output directory when it is missing,
- splits the corpus into batches of
files-per-dbfiles, - writes files named
0.parquet,1.parquet, … (or.duckdb) into the output directory, and - prints progress for each chunk.
Use --metadata-file path/to/metadata.csv to provide transcripts. The CSV is expected to have headers and contain:
| column | description |
|---|---|
file_name |
Base file name matching the audio files found during the scan. |
relative_path* |
Path (relative to --input) for disambiguating duplicate file names. |
transcription |
Text transcription that will be associated with the audio sample. |
* Optional. When you have multiple files with the same name under different subdirectories, include a relative_path column (use forward slashes) so each transcription maps to the correct file.
Rows without a matching audio file are skipped. When a match is missing the CLI stores "-" as the placeholder transcript.
audios-to-dataset [OPTIONS] --input <INPUT> --output <OUTPUT>
Options:
--input <INPUT> Directory to scan (recursively by default)
--format <FORMAT> Output format [default: parquet] [duck-db, parquet]
--files-per-db <N> Number of audio files per output shard [default: 500]
--max-depth-size <N> Maximum recursion depth when scanning [default: 50]
--check-mime-type Skip files whose MIME type is not audio/*
--num-threads <N> Worker threads used for processing [default: 5]
--output <OUTPUT> Destination folder for `.parquet` / `.duckdb` files
--parquet-compression <TYPE> Compression (Snappy, Zstd, Gzip, …) [default: snappy]
--metadata-file <CSV> CSV with `file_name` + `transcription` columns (and optional `relative_path`)
-h, --help Print help
-V, --version Print version
- Parquet files contain a struct column named
audiowithbytes,sampling_rate, andpath(relative to your--input), plusdurationandtranscriptioncolumns. Hugging Face-specific metadata is embedded in the Parquet schema so that the Hub Data Viewer recognizes the dataset automatically. - DuckDB files create a
filestable with the same schema. Existing files in the output directory are replaced shard by shard. - When
--check-mime-typeis enabled the CLI keeps a curated allow list of audio MIME types; others are skipped with a log line. - Durations are derived from WAV headers; non-WAV files remain with
duration = 0.0.
Produce DuckDB shards of 250 files and verify MIME types before packing:
audios-to-dataset \
--format duckdb \
--files-per-db 250 \
--check-mime-type \
--input ./speech-corpus \
--output ./speech-corpus-duckdbCreate Parquet shards with Zstd compression and attach transcripts:
audios-to-dataset \
--format parquet \
--parquet-compression zstd \
--metadata-file ./metadata-file.csv \
--input ./speech-corpus \
--output ./speech-corpus-parquetRequirements: cargo, rustc, cross, podman, and goreleaser.
-
Build the cross images and allocate more resources to Podman (once per machine):
podman build --platform=linux/amd64 -f dockerfiles/Dockerfile.aarch64-unknown-linux-gnu -t aarch64-unknown-linux-gnu:my-edge . podman build --platform=linux/amd64 -f dockerfiles/Dockerfile.x86_64-unknown-linux-gnu -t x86_64-unknown-linux-gnu:my-edge . podman machine set --cpus 4 --memory 8192
-
Produce the binaries with GoReleaser:
goreleaser build --clean --snapshot --id audios-to-dataset --timeout 60m
Helpful automation is captured in the justfile:
just fmt # rustfmt
just clippy # cargo clippy --all-targets
just release # cargo build --releaseTo render an inline audio player on the Hugging Face Hub Data Viewer, prepend this front matter to the dataset README:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: duration
dtype: float64
- name: transcription
dtype: string
task_categories:
- automatic-speech-recognition
tags:
- audio
- speech-processing
---
Distributed under the terms of the MIT License.