Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gigaspeech2 #3

Merged
merged 14 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/corpus.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,8 @@ a CLI tool that create the manifests given a corpus directory.
- :func:`lhotse.recipes.prepare_gale_mandarin`
* - GigaSpeech
- :func:`lhotse.recipes.prepare_gigaspeech`
* - GigaSpeech 2
- :func:`lhotse.recipes.prepare_gigaspeech2`
* - GigaST
- :func:`lhotse.recipes.prepare_gigast`
* - Heroico
Expand Down
8 changes: 4 additions & 4 deletions docs/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ It allows for interesting collation methods - e.g. **padding the speech with noi

The items for mini-batch creation are selected by the ``Sampler``.
Lhotse defines ``Sampler`` classes that are initialized with :class:`~lhotse.cut.CutSet`'s, so that they can look up specific properties of an utterance to stratify the sampling.
For example, :class:`~lhotse.dataset.sampling.SimpleCutSampler` has a defined ``max_frames`` attribute, and it will keep sampling cuts for a batch until they do not exceed the specified number of frames.
For example, :class:`~lhotse.dataset.sampling.SimpleCutSampler` has a defined ``max_duration`` attribute, and it will keep sampling cuts for a batch until they do not exceed the specified number of seconds.
Another strategy — used in :class:`~lhotse.dataset.sampling.BucketingSampler` — will first group the cuts of similar durations into buckets, and then randomly select a bucket to draw the whole batch from.

For tasks where both input and output of the model are speech utterances, we can use the :class:`~lhotse.dataset.sampling.CutPairsSampler`, which accepts two :class:`~lhotse.cut.CutSet`'s and will match the cuts in them by their IDs.
Expand All @@ -38,11 +38,11 @@ A typical Lhotse's dataset API usage might look like this:
.. code-block::

from torch.utils.data import DataLoader
from lhotse.dataset import SpeechRecognitionDataset, SimpleCutSampler
from lhotse.dataset import K2SpeechRecognitionDataset, SimpleCutSampler

cuts = CutSet(...)
dset = SpeechRecognitionDataset(cuts)
sampler = SimpleCutSampler(cuts, max_frames=50000)
dset = K2SpeechRecognitionDataset(cuts)
sampler = SimpleCutSampler(cuts, max_duration=500)
# Dataset performs batching by itself, so we have to indicate that
# to the DataLoader with batch_size=None
dloader = DataLoader(dset, sampler=sampler, batch_size=None, num_workers=1)
Expand Down
17 changes: 17 additions & 0 deletions lhotse/audio/recording.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,23 @@ def is_placeholder(self) -> bool:
def num_channels(self) -> int:
return len(self.channel_ids)

@property
def source_format(self) -> str:
"""Infer format of the audio sources.
If all sources have the same format, return it.
If sources have different formats, raise an error.
"""
source_formats = list(set([s.format for s in self.sources]))

if len(source_formats) == 1:
# if all sources have the same format, return it
return source_formats[0]
else:
# at the moment, we don't resolve different formats
raise NotImplementedError(
"Sources have different formats. Resolving to a single format not implemented."
)

@staticmethod
def from_file(
path: Pathlike,
Expand Down
28 changes: 28 additions & 0 deletions lhotse/audio/source.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import io
import os
import warnings
from dataclasses import dataclass
from io import BytesIO, FileIO
Expand All @@ -6,6 +8,7 @@
from typing import List, Optional, Tuple, Union

import numpy as np
import soundfile as sf
import torch

from lhotse.audio.backend import read_audio
Expand Down Expand Up @@ -64,6 +67,10 @@ class AudioSource:
def has_video(self) -> bool:
return self.video is not None

@property
def format(self) -> str:
return self._get_format()

def load_audio(
self,
offset: Seconds = 0.0,
Expand Down Expand Up @@ -316,3 +323,24 @@ def _prepare_for_reading(
)

return source

def _get_format(self) -> str:
"""Get format for the audio source.
If using 'file' or 'url' types, the format is inferred from the file extension, as in soundfile.
If using 'memory' type, the format is inferred from the binary data.
"""
if self.type in ("file", "url"):
# Resolve audio format based on the filename
format = os.path.splitext(self.source)[-1][1:]
return format.lower()
elif self.type == "memory":
sf_info = sf.info(io.BytesIO(self.source))
if sf_info.format == "OGG" and sf_info.subtype == "OPUS":
# soundfile describes opus as ogg container with opus coding
return "opus"
else:
return sf_info.format.lower()
else:
raise NotImplementedError(
f"Getting format not implemented for source type {self.type}"
)
1 change: 1 addition & 0 deletions lhotse/bin/modes/recipes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
from .gale_arabic import *
from .gale_mandarin import *
from .gigaspeech import *
from .gigaspeech2 import *
from .gigast import *
from .grid import *
from .heroico import *
Expand Down
38 changes: 38 additions & 0 deletions lhotse/bin/modes/recipes/gigaspeech2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from typing import Optional, Sequence, Union

import click

from lhotse.bin.modes import prepare
from lhotse.recipes.gigaspeech2 import prepare_gigaspeech2
from lhotse.utils import Pathlike


@prepare.command(context_settings=dict(show_default=True))
@click.argument("corpus_dir", type=click.Path(exists=True, dir_okay=True))
@click.argument("output_dir", type=click.Path())
@click.option(
"-l",
"--languages",
default="auto",
help="Languages to prepare (scans CORPUS_DIR for language codes by default).",
)
@click.option(
"-j",
"--num-jobs",
type=int,
default=1,
help="How many threads to use (can give good speed-ups with slow disks).",
)
def gigaspeech2(
corpus_dir: Pathlike,
output_dir: Optional[Pathlike] = None,
languages: Union[str, Sequence[str]] = "auto",
num_jobs: int = 1,
):
"""GigaSpeech 2 data preparation."""
prepare_gigaspeech2(
corpus_dir=corpus_dir,
output_dir=output_dir,
languages=languages,
num_jobs=num_jobs,
)
4 changes: 2 additions & 2 deletions lhotse/bin/modes/shar.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ def shar():
"-a",
"--audio",
default="none",
type=click.Choice(["none", "wav", "flac", "mp3", "opus"]),
help="Format in which to export audio (disabled by default, enabling will make a copy of the data)",
type=click.Choice(["none", "wav", "flac", "mp3", "opus", "original"]),
help="Format in which to export audio. Original will save in the same format as the original audio (disabled by default, enabling will make a copy of the data)",
)
@click.option(
"-f",
Expand Down
2 changes: 1 addition & 1 deletion lhotse/cut/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -723,7 +723,7 @@ def pad(
"""
Return a new MixedCut, padded with zeros in the recording, and ``pad_feat_value`` in each feature bin.

The user can choose to pad either to a specific `duration`; a specific number of frames `max_frames`;
The user can choose to pad either to a specific `duration`; a specific number of frames `num_frames`;
or a specific number of samples `num_samples`. The three arguments are mutually exclusive.

:param duration: The cut's minimal duration after padding.
Expand Down
2 changes: 1 addition & 1 deletion lhotse/cut/mixed.py
Original file line number Diff line number Diff line change
Expand Up @@ -622,7 +622,7 @@ def pad(
"""
Return a new MixedCut, padded with zeros in the recording, and ``pad_feat_value`` in each feature bin.

The user can choose to pad either to a specific `duration`; a specific number of frames `max_frames`;
The user can choose to pad either to a specific `duration`; a specific number of frames `num_frames`;
or a specific number of samples `num_samples`. The three arguments are mutually exclusive.

:param duration: The cut's minimal duration after padding.
Expand Down
2 changes: 1 addition & 1 deletion lhotse/cut/padding.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ def pad(
"""
Return a new MixedCut, padded with zeros in the recording, and ``pad_feat_value`` in each feature bin.

The user can choose to pad either to a specific `duration`; a specific number of frames `max_frames`;
The user can choose to pad either to a specific `duration`; a specific number of frames `num_frames`;
or a specific number of samples `num_samples`. The three arguments are mutually exclusive.

:param duration: The cut's minimal duration after padding.
Expand Down
2 changes: 1 addition & 1 deletion lhotse/cut/set.py
Original file line number Diff line number Diff line change
Expand Up @@ -2821,7 +2821,7 @@ def pad(
"""
Return a new MixedCut, padded with zeros in the recording, and ``pad_feat_value`` in each feature bin.

The user can choose to pad either to a specific `duration`; a specific number of frames `max_frames`;
The user can choose to pad either to a specific `duration`; a specific number of frames `num_frames`;
or a specific number of samples `num_samples`. The three arguments are mutually exclusive.

:param cut: DataCut to be padded.
Expand Down
2 changes: 1 addition & 1 deletion lhotse/dataset/audio_tagging.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def __init__(
def __getitem__(self, cuts: CutSet) -> Dict[str, Union[torch.Tensor, List[str]]]:
"""
Return a new batch, with the batch size automatically determined using the constraints
of max_frames and max_cuts.
of max_duration and max_cuts.
"""
self.hdf5_fix.update()

Expand Down
4 changes: 2 additions & 2 deletions lhotse/dataset/sampling/bucketing.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class BucketingSampler(CutSampler):
... # BucketingSampler specific args
... sampler_type=SimpleCutSampler, num_buckets=20,
... # Args passed into SimpleCutSampler
... max_frames=20000
... max_duration=200
... )

Bucketing sampler with 20 buckets, sampling pairs of source-target cuts::
Expand All @@ -40,7 +40,7 @@ class BucketingSampler(CutSampler):
... # BucketingSampler specific args
... sampler_type=CutPairsSampler, num_buckets=20,
... # Args passed into CutPairsSampler
... max_source_frames=20000, max_target_frames=15000
... max_source_duration=200, max_target_duration=150
... )
"""

Expand Down
8 changes: 4 additions & 4 deletions lhotse/dataset/sampling/cut_pairs.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ class CutPairsSampler(CutSampler):
It expects that both CutSet's strictly consist of Cuts with corresponding IDs.
It behaves like an iterable that yields lists of strings (cut IDs).

When one of :attr:`max_frames`, :attr:`max_samples`, or :attr:`max_duration` is specified,
When one of :attr:`max_source_duration`, :attr:`max_target_duration`, or :attr:`max_cuts` is specified,
the batch size is dynamic.
Exactly zero or one of those constraints can be specified.
Padding required to collate the batch does not contribute to max frames/samples/duration.
Padding required to collate the batch does not contribute to max source_duration/target_duration.
"""

def __init__(
Expand Down Expand Up @@ -229,7 +229,7 @@ def _next_batch(self) -> Tuple[CutSet, CutSet]:
self.source_constraints.add(next_source_cut)
self.target_constraints.add(next_target_cut)

# Did we exceed the max_source_frames and max_cuts constraints?
# Did we exceed the max_source_duration and max_cuts constraints?
if (
not self.source_constraints.exceeded()
and not self.target_constraints.exceeded()
Expand All @@ -249,7 +249,7 @@ def _next_batch(self) -> Tuple[CutSet, CutSet]:
# and return the cut anyway.
warnings.warn(
"The first cut drawn in batch collection violates one of the max_... constraints"
"we'll return it anyway. Consider increasing max_source_frames/max_cuts/etc."
"we'll return it anyway. Consider increasing max_source_duration/max_cuts/etc."
)
source_cuts.append(next_source_cut)
target_cuts.append(next_target_cut)
Expand Down
2 changes: 1 addition & 1 deletion lhotse/dataset/sampling/dynamic.py
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,7 @@ def detuplify(
else next_cut_or_tpl
)

# Did we exceed the max_frames and max_cuts constraints?
# Did we exceed the max_duration and max_cuts constraints?
if self.constraint.close_to_exceeding():
# Yes. Finish sampling this batch.
if self.constraint.exceeded() and len(cuts) == 1:
Expand Down
12 changes: 6 additions & 6 deletions lhotse/dataset/sampling/simple.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ class SimpleCutSampler(CutSampler):
Samples cuts from a CutSet to satisfy the input constraints.
It behaves like an iterable that yields lists of strings (cut IDs).

When one of :attr:`max_frames`, :attr:`max_samples`, or :attr:`max_duration` is specified,
When one of :attr:`max_duration`, or :attr:`max_cuts` is specified,
the batch size is dynamic.
Exactly zero or one of those constraints can be specified.
Padding required to collate the batch does not contribute to max frames/samples/duration.
Padding required to collate the batch does not contribute to max duration.

Example usage::

Expand Down Expand Up @@ -197,10 +197,10 @@ def _next_batch(self) -> CutSet:
self.diagnostics.discard_single(next_cut)
continue

# Track the duration/frames/etc. constraints.
# Track the duration/etc. constraints.
self.time_constraint.add(next_cut)

# Did we exceed the max_frames and max_cuts constraints?
# Did we exceed the max_duration and max_cuts constraints?
if not self.time_constraint.exceeded():
# No - add the next cut to the batch, and keep trying.
cuts.append(next_cut)
Expand All @@ -215,9 +215,9 @@ def _next_batch(self) -> CutSet:
# and return the cut anyway.
warnings.warn(
"The first cut drawn in batch collection violates "
"the max_frames, max_cuts, or max_duration constraints - "
"the max_duration, or max_cuts constraints - "
"we'll return it anyway. "
"Consider increasing max_frames/max_cuts/max_duration."
"Consider increasing max_duration/max_cuts."
)
cuts.append(next_cut)

Expand Down
2 changes: 1 addition & 1 deletion lhotse/dataset/sampling/weighted_simple.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ class WeightedSimpleCutSampler(SimpleCutSampler):
When performing sampling, it avoids having duplicated cuts in the same batch.
The sampler terminates if the number of sampled cuts reach :attr:`num_samples`

When one of :attr:`max_frames`, :attr:`max_samples`, or :attr:`max_duration` is specified,
When one of :attr:`max_duration`, or :attr:`max_cuts` is specified,
the batch size is dynamic.

Example usage:
Expand Down
2 changes: 1 addition & 1 deletion lhotse/dataset/speech_recognition.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ def __init__(
def __getitem__(self, cuts: CutSet) -> Dict[str, Union[torch.Tensor, List[str]]]:
"""
Return a new batch, with the batch size automatically determined using the constraints
of max_frames and max_cuts.
of max_duration and max_cuts.
"""
validate_for_asr(cuts)

Expand Down
2 changes: 1 addition & 1 deletion lhotse/dataset/speech_translation.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def __init__(
def __getitem__(self, cuts: CutSet) -> Dict[str, Union[torch.Tensor, List[str]]]:
"""
Return a new batch, with the batch size automatically determined using the constraints
of max_frames and max_cuts.
of max_duration and max_cuts.
"""
validate_for_asr(cuts)
self.hdf5_fix.update()
Expand Down
2 changes: 1 addition & 1 deletion lhotse/dataset/surt.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ def __init__(
def __getitem__(self, cuts: CutSet) -> Dict[str, Union[torch.Tensor, List[str]]]:
"""
Return a new batch, with the batch size automatically determined using the constraints
of max_frames and max_cuts.
of max_duration and max_cuts.
"""
validate_for_asr(cuts)

Expand Down
2 changes: 1 addition & 1 deletion lhotse/parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ class ParallelExecutor:

>>> class MyRunner:
... def __init__(self):
... self.name = name
... pass
... def __call__(self, x):
... return f'processed: {x}'
...
Expand Down
2 changes: 2 additions & 0 deletions lhotse/recipes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
from .gale_arabic import prepare_gale_arabic
from .gale_mandarin import prepare_gale_mandarin
from .gigaspeech import prepare_gigaspeech
from .gigaspeech2 import prepare_gigaspeech2
from .gigast import download_gigast, prepare_gigast
from .grid import download_grid, prepare_grid
from .heroico import download_heroico, prepare_heroico
Expand Down Expand Up @@ -152,6 +153,7 @@
"prepare_gale_arabic",
"prepare_gale_mandarin",
"prepare_gigaspeech",
"prepare_gigaspeech2",
"download_gigast",
"prepare_gigast",
"download_grid",
Expand Down
Loading
Loading