diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..9c420526 --- /dev/null +++ b/404.html @@ -0,0 +1,984 @@ + + + +
+ + + + + + + + + + + + + + + +Audiomentations isn't the only python library that can do various types of audio data +augmentation/degradation! Here's an overview:
+Name | +Github stars | +License | +Last commit | +GPU support? | +
---|---|---|---|---|
audio-degradation-toolbox | ++ | + | + | + |
audio_degrader | ++ | + | + | + |
audiomentations | ++ | + | + | + |
audiotools | ++ | + | + | + |
auglib | ++ | + | + | + |
AugLy | ++ | + | + | + |
fast-audiomentations | ++ | + | + | + |
kapre | ++ | + | + | + |
muda | ++ | + | + | + |
nlpaug | ++ | + | + | + |
pedalboard | ++ | + | + | + |
pydiogment | ++ | + | + | + |
python-audio-effects | ++ | + | + | + |
SpecAugment | ++ | + | + | + |
spec_augment | ++ | + | + | + |
teal | ++ | + | + | + |
torch-audiomentations | ++ | + | + | + |
torchaudio-augmentations | ++ | + | + | + |
WavAugment | ++ | + | + | + |
All notable changes to this project will be documented in this file.
+The format is based on Keep a Changelog, +and this project adheres to Semantic Versioning.
+AddColorNoise
, Aliasing
and BitCrush
Normalize
, Mp3Compression
and Limiter
slightly faster.AssertionError
exceptions to ValueError
Shift
transform has been changed:fade
parameter. fade_duration=0.0
now denotes disabled fading.min_fraction
to min_shift
and max_fraction
to max_shift
shift_unit
parameterThese are breaking changes. The following example shows how you can adapt your code when upgrading from <=v0.32.0 to >=v0.33.0:
+<= 0.32.0 | +>= 0.33.0 | +
---|---|
Shift(min_fraction=-0.5, max_fraction=0.5, fade=True, fade_duration=0.01) |
+Shift(min_shift=-0.5, max_shift=0.5, shift_unit="fraction", fade_duration=0.01) |
+
Shift() |
+Shift(fade_duration=0.0) |
+
RepeatPart
transformWrongMultichannelAudioShape
. This allows some rare use cases where the number of channels slightly exceeds the number of samples.np.array
instead of np.ndarray
PitchShift
, leading to much faster execution. This requires soxr
.scipy
requirement from 1.0 to 1.3AdjustDuration
transformMp3Compression
apply_to
parameter that can be set to "only_too_loud_sounds"
in Normalize
noise_rms
from "relative"
to "relative_to_whole_input"
in AddShortNoises
min_snr_in_db
(from 0.0
to -6.0
), max_snr_in_db
(from 24.0
to 18.0
), min_time_between_sounds
(from 4.0
to 2.0
) and max_time_between_sounds
(from 16.0
to 8.0
) in AddShortNoises
Limiter
raised an exception when it got digital silence as inputRoomSimulator
where the value of max_order
was not respectedFrequencyMask
that had been deprecated since version 0.22.0. BandStopFilter
is a good alternative.Limiter
by ~8xTrim
and ApplyImpulseResponse
according to the warnings that were added in v0.23.0noise_rms
in AddShortNoises
is not specified - the
+ default value will change from "relative" to "relative_to_whole_input" in a future version. Lambda
. Thanks to Thanatoz-1.Limiter
. Thanks to pzelasko.RoomSimulator
Shift
robust to different sample rate inputs when parameters are frozenRoomSimulator
would treat an x value as if it was y, and vice versaAirAbsorption
transformTimeMask
FutureWarning
instead of UserWarning
in Trim
and ApplyImpulseResponse
ApplyImpulseResponse
, AddBackgroundNoise
and AddShortNoises
. Previously only a path to a folder was allowed.noise_transform
in AddBackgroundNoise
where some
+ SNR calculations were done before the noise_transform
was applied. This has sometimes
+ led to incorrect SNR in the output. This changes the behavior of
+ AddBackgroundNoise
(when noise_transform is used).SevenBandParametricEQ
transformnoise_transform
in AddShortNoises
top_db
and/or p
in Trim
are not specified because their default
+ values will change in a future versionLowShelfFilter
and HighShelfFilter
Padding
transformRoomSimulator
transform for simulating shoebox rooms using pyroomacoustics
signal_gain_in_db_during_noise
in AddShortNoises
leave_length_unchanged
in AddImpulseResponse
now emits
+ a warning, as the default value will change from False
to True
in a future version.AddImpulseResponse
alias. Use ApplyImpulseResponse
instead.min_SNR
and max_SNR
in AddGaussianSNR
AddBackgroundNoise
, AddShortNoises
and ApplyImpulseResponse
GainTransition
Mp3Compression
, Resample
and Trim
"relative_to_whole_input"
option for noise_rms
parameter in AddShortNoises
noise_transform
in AddBackgroundNoise
PitchShift
by 6-18% when the input audio is stereoFrequencyMask
in favor of BandStopFilter
ApplyImpulseResponse
, BandPassFilter
, HighPassFilter
and LowPassFilter
BandStopFilter
(similar to FrequencyMask, but with overhauled defaults and parameter randomization behavior), PeakingFilter
, LowShelfFilter
and HighShelfFilter
add_all_noises_with_same_level
in AddShortNoises
BandPassFilter
, LowPassFilter
, HighPassFilter
, to use scipy's butterworth
+ filters instead of pydub. Now they have parametrized roll-off. Filters are now steeper
+ than before by default - set min_rolloff=6, max_rolloff=6
to get the old behavior.
+ They also support zero-phase filtering now. And they're at least ~25x times faster than before!wavio
dependency for audio loadingOneOf
and SomeOf
for applying one of or some of many transforms. Transforms are randomly
+ chosen every call. Inspired by augly. Thanks to Cangonin and iver56.apply_to_children
(bool) in randomize_parameters
,
+ freeze_parameters
and unfreeze_parameters
in Compose
and SpecCompose
.AddBackgroundNoise
: noise_rms
(defaults to "relative", which is
+ the old behavior), min_absolute_rms_in_db
and max_absolute_rms_in_db
. This may be a breaking
+ change if you used AddBackgroundNoise
with positional arguments in earlier versions of audiomentations!
+ Please use keyword arguments to be on the safe side - it should be backwards compatible then.pydub
import which was accidentally introduced in v0.18.0. pydub
is
+ considered an optional dependency and is imported only on demand now.TanhDistortion
. Thanks to atamazian and iver56.noise_rms
parameter to AddShortNoises
. It defaults to relative
, which
+ is the old behavior. absolute
allows for adding loud noises to parts that are
+ relatively silent in the input.BandPassFilter
, HighPassFilter
, LowPassFilter
and Reverse
. Thanks to atamazian.fade
option in Shift
for eliminating unwanted clicksClip
. Thanks to atamazian.AddGaussianNoise
AddImpulseResponse
to ApplyImpulseResponse
. The former will still work for
+ now, but give a warning.AddImpulseResponse
, AddBackgroundNoise
+ and AddShortNoises
, follow symlinks by default.min_snr_in_db
and max_snr_in_db
in AddGaussianSNR
,
+ SNRs will be picked uniformly in the decibel scale instead of in the linear amplitude
+ ratio scale. The new behavior aligns more with human hearing, which is not linear.AddImpulseResponse
when input is digital silence (all zeros)AddGaussianSNR
. It will continue working as before
+ unless you switch to the new parameters min_snr_in_db
and max_snr_in_db
. If you
+ use the old parameters, you'll get a warning.SpecCompose
for applying a pipeline of spectrogram transforms. Thanks to omerferhatt.SpecChannelShuffle
where it did not support more than 3 audio channels. Thanks to omerferhatt.leave_length_unchanged
to AddImpulseResponse
AddImpulseResponse
, AddBackgroundNoise
+ and AddShortNoises
LoudnessNormalization
randomize_parameters
in Compose
. Thanks to SolomidHero.AddGaussianNoise
, AddGaussianSNR
, ClippingDistortion
,
+ FrequencyMask
, PitchShift
, Shift
, TimeMask
and TimeStretch
SpecChannelShuffle
and
+ SpecFrequencyMask
.Normalize
AddBackgroundNoise
, AddShortNoises
and AddImpulseResponse
by loading wav files with scipy or wavio instead of librosa.Mp3Compression
Gain
and PolarityInversion
librosa
versionsfrom audiomentations import calculate_rms
, now you have to do
+ from audiomentations.core.utils import calculate_rms
Gain
and PolarityInversion
. Thanks to Spijkervet for the inspiration.AddBackgroundNoise
and AddShortNoises
by optimizing the implementation of calculate_rms
.Normalize
. Thanks to ZFTurbo.AddImpulseResponse
, AddBackgroundNoise
and AddShortNoises
now support aiff files in addition to flac, mp3, ogg and wavAddImpulseResponse
, AddBackgroundNoise
and AddShortNoises
now include subfolders when searching for files. This is useful when your sound files are organized in subfolders.FrequencyMask
. Thanks to kvilouras.Shift
. This allows for introducing silence instead of a wrapped part of the sound.AddImpulseResponse
AddBackgroundNoise
transform. Useful for when you want to add background noise to all of your sound. You need to give it a folder of background noises to choose from.AddShortNoises
. Useful for when you want to add (bursts of) short noise sounds to your input audio.AddImpulseResponse
significantly faster.ClippingDistortion
where the min_percentile_threshold was not respected as expected.Composer
Resample
transformationClippingDistortion
transformationfade
parameter to TimeMask
Thanks to askskro
+AddGaussianSNR
AddImpulseResponse
FrequencyMask
TimeMask
Trim
Thanks to karpnv
+Shift
transformPitchShift
transformAddGaussianNoise
leave_length_unchanged
in TimeStretch
TimeStretch
transformAddGaussianNoise
AddGaussianNoise
When training an audio machine learning model that includes online data augmentation as part of the training pipeline, you can choose to run the transforms on CPU or GPU. While some libraries, such as torch-audiomentations, support GPU, audiomentations is CPU-only. So, which one is better? The answer is: it depends.
+There are several advantages to using CPU-only data augmentation libraries like audiomentations:
+There are also advantages to running audio augmentation transforms on GPU, for example, with the help of torch-audiomentations :
+In summary, whether to use CPU-only libraries like audiomentations or GPU-accelerated libraries like torch-audiomentations depends on the specific requirements of your model and the available hardware. If your model training pipeline doesn't utilize your GPU(s) fully, running transforms on GPU might be the best choice. However, if your model's GPU utilization is already very high, running the transforms on multiple CPU threads might be the best option. It boils down to checking where your bottleneck is.
+ + + + + + +When working with audio files in Python, you may encounter two main formats for representing the data, especially when you are dealing with stereo (or multichannel) audio. These formats correspond to the shape of the numpy ndarray that holds the audio data.
+This format has the shape (channels, samples)
. In the context of a stereo audio file, the number of channels would be 2 (for left and right), and samples are the individual data points in the audio file. For example, a stereo audio file with a duration of 1 second sampled at 44100 Hz would have a shape of (2, 44100)
.
This is the format expected by audiomentations when dealing with multichannel audio. If you provide multichannel audio data in a different format, a WrongMultichannelAudioShape
exception will be raised.
Note that audiomentations
also supports mono audio, i.e. shape like (1, samples)
or (samples,)
This format has the shape (samples, channels)
. Using the same stereo file example as above, the shape would be (44100, 2)
. This format is commonly returned by the soundfile
library when loading a stereo wav file, because channels last is the inherent data layout of a stereo wav file. This layout is the default in stereo wav files because it facilitates streaming audio, where data must be read and played back sequentially.
Different libraries in Python may return audio data in different formats. For instance, librosa
by default returns a mono ndarray, whereas soundfile
will return a multichannel ndarray in channels-last format when loading a stereo wav file.
Here is an example of how to load a file with each:
+import librosa
+import soundfile as sf
+
+# Librosa, mono
+y, sr = librosa.load("stereo_audio_example.wav", sr=None, mono=True)
+print(y.shape) # (117833,)
+
+# Librosa, multichannel
+y, sr = librosa.load("stereo_audio_example.wav", sr=None, mono=False)
+print(y.shape) # (2, 117833)
+
+# Soundfile
+y, sr = sf.read("stereo_audio_example.wav")
+print(y.shape) # (117833, 2)
+
If you have audio data in the channels-last format but need it in channels-first format, you can easily convert it using the transpose operation of numpy ndarrays:
+import numpy as np
+
+# Assuming y is your audio data in channels-last format
+y_transposed = np.transpose(y)
+
+# Alternative, shorter syntax:
+y_transposed = y.T
+
Now, y_transposed
will be in channels-first format and can be used with audiomentations
.
You can access the parameters
property of a transform. Code example:
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift
+import numpy as np
+
+augment = Compose([
+ AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),
+ TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
+ PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
+ Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),
+])
+
+# Generate 2 seconds of dummy audio for the sake of example
+samples = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)
+
+# Augment/transform/perturb the audio data
+augmented_samples = augment(samples=samples, sample_rate=16000)
+
+for transform in augment.transforms:
+ print(f"{transform.__class__.__name__}: {transform.parameters}")
+
When running the example code above, it may print something like this: +
AddGaussianNoise: {'should_apply': True, 'amplitude': 0.0027702725003923272}
+TimeStretch: {'should_apply': True, 'rate': 1.158377360016495}
+PitchShift: {'should_apply': False}
+Shift: {'should_apply': False}
+
This technique can be useful if you want to transform e.g. a target sound in the same way as an input sound. Code example:
+from audiomentations import Gain
+import numpy as np
+
+augment = Gain(p=1.0)
+
+samples = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)
+samples2 = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)
+
+augmented_samples = augment(samples=samples, sample_rate=16000)
+augment.freeze_parameters()
+print(augment.parameters)
+augmented_samples2 = augment(samples=samples2, sample_rate=16000)
+print(augment.parameters)
+augment.unfreeze_parameters()
+
When running the example code above, it may print something like this:
+{'should_apply': True, 'amplitude_ratio': 0.9688148624484364}
+{'should_apply': True, 'amplitude_ratio': 0.9688148624484364}
+
In other words, this means that both sounds (samples
and samples2
) were gained by the same amount
A Python library for audio data augmentation. Inspired by +albumentations. Useful for deep learning. Runs on +CPU. Supports mono audio and multichannel audio. Can be +integrated in training pipelines in e.g. Tensorflow/Keras or Pytorch. Has helped people get +world-class results in Kaggle competitions. Is used by companies making next-generation audio +products.
+Need a Pytorch-specific alternative with GPU support? Check out torch-audiomentations!
+pip install audiomentations
Some features have extra dependencies. Extra python package dependencies can be installed by running
+pip install audiomentations[extras]
Feature | +Extra dependencies | +
---|---|
Limiter |
+cylimiter |
+
LoudnessNormalization |
+pyloudnorm |
+
Mp3Compression |
+ffmpeg and [pydub or lameenc ] |
+
RoomSimulator |
+pyroomacoustics |
+
Note: ffmpeg
can be installed via e.g. conda or from the official ffmpeg download page.
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift
+import numpy as np
+
+augment = Compose([
+ AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),
+ TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
+ PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
+ Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),
+])
+
+# Generate 2 seconds of dummy audio for the sake of example
+samples = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)
+
+# Augment/transform/perturb the audio data
+augmented_samples = augment(samples=samples, sample_rate=16000)
+
from audiomentations import SpecCompose, SpecChannelShuffle, SpecFrequencyMask
+import numpy as np
+
+augment = SpecCompose(
+ [
+ SpecChannelShuffle(p=0.5),
+ SpecFrequencyMask(p=0.5),
+ ]
+)
+
+# Example spectrogram with 1025 frequency bins, 256 time steps and 2 audio channels
+spectrogram = np.random.random((1025, 256, 2))
+
+# Augment/transform/perturb the spectrogram
+augmented_spectrogram = augment(spectrogram)
+
For a list and explanation of all waveform transforms, see Waveform transforms in the menu.
+Waveform transforms can be visualized (for understanding) by the audio-transformation-visualization GUI (made by phrasenmaeher), where you can upload your own input wav file
+For a list and brief explanation of all spectrogram transforms, see Spectrogram transforms
+Compose
Compose applies the given sequence of transforms when called, optionally shuffling the sequence for every call.
+SpecCompose
Same as Compose, but for spectrogram transforms
+OneOf
OneOf randomly picks one of the given transforms when called, and applies that transform.
+SomeOf
SomeOf randomly picks several of the given transforms when called, and applies those transforms.
+Contributions are welcome!
+As of v0.22.0, all transforms except AddBackgroundNoise
and AddShortNoises
support not only mono audio (1-dimensional numpy arrays), but also stereo audio, i.e. 2D arrays with shape like (num_channels, num_samples)
. See also the guide on multichannel audio array shapes.
Thanks to Nomono for backing audiomentations.
+Thanks to all contributors who help improving audiomentations.
+ + + + + + +A Python library for audio data augmentation. Inspired by albumentations. Useful for deep learning. Runs on CPU. Supports mono audio and multichannel audio. Can be integrated in training pipelines in e.g. Tensorflow/Keras or Pytorch. Has helped people get world-class results in Kaggle competitions. Is used by companies making next-generation audio products.
Need a Pytorch-specific alternative with GPU support? Check out torch-audiomentations!
"},{"location":"#setup","title":"Setup","text":"pip install audiomentations
Some features have extra dependencies. Extra python package dependencies can be installed by running
pip install audiomentations[extras]
Limiter
cylimiter
LoudnessNormalization
pyloudnorm
Mp3Compression
ffmpeg
and [pydub
or lameenc
] RoomSimulator
pyroomacoustics
Note: ffmpeg
can be installed via e.g. conda or from the official ffmpeg download page.
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift\nimport numpy as np\n\naugment = Compose([\n AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),\n TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),\n PitchShift(min_semitones=-4, max_semitones=4, p=0.5),\n Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),\n])\n\n# Generate 2 seconds of dummy audio for the sake of example\nsamples = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)\n\n# Augment/transform/perturb the audio data\naugmented_samples = augment(samples=samples, sample_rate=16000)\n
"},{"location":"#spectrogram","title":"Spectrogram","text":"from audiomentations import SpecCompose, SpecChannelShuffle, SpecFrequencyMask\nimport numpy as np\n\naugment = SpecCompose(\n [\n SpecChannelShuffle(p=0.5),\n SpecFrequencyMask(p=0.5),\n ]\n)\n\n# Example spectrogram with 1025 frequency bins, 256 time steps and 2 audio channels\nspectrogram = np.random.random((1025, 256, 2))\n\n# Augment/transform/perturb the spectrogram\naugmented_spectrogram = augment(spectrogram)\n
"},{"location":"#waveform-transforms","title":"Waveform transforms","text":"For a list and explanation of all waveform transforms, see Waveform transforms in the menu.
Waveform transforms can be visualized (for understanding) by the audio-transformation-visualization GUI (made by phrasenmaeher), where you can upload your own input wav file
"},{"location":"#spectrogram-transforms","title":"Spectrogram transforms","text":"For a list and brief explanation of all spectrogram transforms, see Spectrogram transforms
"},{"location":"#composition-classes","title":"Composition classes","text":""},{"location":"#compose","title":"Compose
","text":"Compose applies the given sequence of transforms when called, optionally shuffling the sequence for every call.
"},{"location":"#speccompose","title":"SpecCompose
","text":"Same as Compose, but for spectrogram transforms
"},{"location":"#oneof","title":"OneOf
","text":"OneOf randomly picks one of the given transforms when called, and applies that transform.
"},{"location":"#someof","title":"SomeOf
","text":"SomeOf randomly picks several of the given transforms when called, and applies those transforms.
"},{"location":"#known-limitations","title":"Known limitations","text":"Contributions are welcome!
"},{"location":"#multichannel-audio","title":"Multichannel audio","text":"As of v0.22.0, all transforms except AddBackgroundNoise
and AddShortNoises
support not only mono audio (1-dimensional numpy arrays), but also stereo audio, i.e. 2D arrays with shape like (num_channels, num_samples)
. See also the guide on multichannel audio array shapes.
Thanks to Nomono for backing audiomentations.
Thanks to all contributors who help improving audiomentations.
"},{"location":"alternatives/","title":"Alternatives","text":"Audiomentations isn't the only python library that can do various types of audio data augmentation/degradation! Here's an overview:
Name Github stars License Last commit GPU support? audio-degradation-toolbox audio_degrader audiomentations audiotools auglib AugLy fast-audiomentations kapre muda nlpaug pedalboard pydiogment python-audio-effects SpecAugment spec_augment teal torch-audiomentations torchaudio-augmentations WavAugment"},{"location":"changelog/","title":"Changelog","text":"All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
"},{"location":"changelog/#unreleased","title":"Unreleased","text":""},{"location":"changelog/#added","title":"Added","text":"AddColorNoise
, Aliasing
and BitCrush
Normalize
, Mp3Compression
and Limiter
slightly faster.AssertionError
exceptions to ValueError
Shift
transform has been changed:","text":"fade
parameter. fade_duration=0.0
now denotes disabled fading.min_fraction
to min_shift
and max_fraction
to max_shift
shift_unit
parameterThese are breaking changes. The following example shows how you can adapt your code when upgrading from <=v0.32.0 to >=v0.33.0:
<= 0.32.0 >= 0.33.0Shift(min_fraction=-0.5, max_fraction=0.5, fade=True, fade_duration=0.01)
Shift(min_shift=-0.5, max_shift=0.5, shift_unit=\"fraction\", fade_duration=0.01)
Shift()
Shift(fade_duration=0.0)
"},{"location":"changelog/#fixed","title":"Fixed","text":"RepeatPart
transformWrongMultichannelAudioShape
. This allows some rare use cases where the number of channels slightly exceeds the number of samples.np.array
instead of np.ndarray
PitchShift
, leading to much faster execution. This requires soxr
.scipy
requirement from 1.0 to 1.3AdjustDuration
transformMp3Compression
apply_to
parameter that can be set to \"only_too_loud_sounds\"
in Normalize
noise_rms
from \"relative\"
to \"relative_to_whole_input\"
in AddShortNoises
min_snr_in_db
(from 0.0
to -6.0
), max_snr_in_db
(from 24.0
to 18.0
), min_time_between_sounds
(from 4.0
to 2.0
) and max_time_between_sounds
(from 16.0
to 8.0
) in AddShortNoises
Limiter
raised an exception when it got digital silence as inputRoomSimulator
where the value of max_order
was not respectedFrequencyMask
that had been deprecated since version 0.22.0. BandStopFilter
is a good alternative.Limiter
by ~8xTrim
and ApplyImpulseResponse
according to the warnings that were added in v0.23.0noise_rms
in AddShortNoises
is not specified - the default value will change from \"relative\" to \"relative_to_whole_input\" in a future version. Lambda
. Thanks to Thanatoz-1.Limiter
. Thanks to pzelasko.RoomSimulator
Shift
robust to different sample rate inputs when parameters are frozenRoomSimulator
would treat an x value as if it was y, and vice versaAirAbsorption
transformTimeMask
FutureWarning
instead of UserWarning
in Trim
and ApplyImpulseResponse
ApplyImpulseResponse
, AddBackgroundNoise
and AddShortNoises
. Previously only a path to a folder was allowed.noise_transform
in AddBackgroundNoise
where some SNR calculations were done before the noise_transform
was applied. This has sometimes led to incorrect SNR in the output. This changes the behavior of AddBackgroundNoise
(when noise_transform is used).SevenBandParametricEQ
transformnoise_transform
in AddShortNoises
top_db
and/or p
in Trim
are not specified because their default values will change in a future versionLowShelfFilter
and HighShelfFilter
Padding
transformRoomSimulator
transform for simulating shoebox rooms using pyroomacoustics
signal_gain_in_db_during_noise
in AddShortNoises
leave_length_unchanged
in AddImpulseResponse
now emits a warning, as the default value will change from False
to True
in a future version.AddImpulseResponse
alias. Use ApplyImpulseResponse
instead.min_SNR
and max_SNR
in AddGaussianSNR
AddBackgroundNoise
, AddShortNoises
and ApplyImpulseResponse
GainTransition
Mp3Compression
, Resample
and Trim
\"relative_to_whole_input\"
option for noise_rms
parameter in AddShortNoises
noise_transform
in AddBackgroundNoise
PitchShift
by 6-18% when the input audio is stereoFrequencyMask
in favor of BandStopFilter
ApplyImpulseResponse
, BandPassFilter
, HighPassFilter
and LowPassFilter
BandStopFilter
(similar to FrequencyMask, but with overhauled defaults and parameter randomization behavior), PeakingFilter
, LowShelfFilter
and HighShelfFilter
add_all_noises_with_same_level
in AddShortNoises
BandPassFilter
, LowPassFilter
, HighPassFilter
, to use scipy's butterworth filters instead of pydub. Now they have parametrized roll-off. Filters are now steeper than before by default - set min_rolloff=6, max_rolloff=6
to get the old behavior. They also support zero-phase filtering now. And they're at least ~25x times faster than before!wavio
dependency for audio loadingOneOf
and SomeOf
for applying one of or some of many transforms. Transforms are randomly chosen every call. Inspired by augly. Thanks to Cangonin and iver56.apply_to_children
(bool) in randomize_parameters
, freeze_parameters
and unfreeze_parameters
in Compose
and SpecCompose
.AddBackgroundNoise
: noise_rms
(defaults to \"relative\", which is the old behavior), min_absolute_rms_in_db
and max_absolute_rms_in_db
. This may be a breaking change if you used AddBackgroundNoise
with positional arguments in earlier versions of audiomentations! Please use keyword arguments to be on the safe side - it should be backwards compatible then.pydub
import which was accidentally introduced in v0.18.0. pydub
is considered an optional dependency and is imported only on demand now.TanhDistortion
. Thanks to atamazian and iver56.noise_rms
parameter to AddShortNoises
. It defaults to relative
, which is the old behavior. absolute
allows for adding loud noises to parts that are relatively silent in the input.BandPassFilter
, HighPassFilter
, LowPassFilter
and Reverse
. Thanks to atamazian.fade
option in Shift
for eliminating unwanted clicksClip
. Thanks to atamazian.AddGaussianNoise
AddImpulseResponse
to ApplyImpulseResponse
. The former will still work for now, but give a warning.AddImpulseResponse
, AddBackgroundNoise
and AddShortNoises
, follow symlinks by default.min_snr_in_db
and max_snr_in_db
in AddGaussianSNR
, SNRs will be picked uniformly in the decibel scale instead of in the linear amplitude ratio scale. The new behavior aligns more with human hearing, which is not linear.AddImpulseResponse
when input is digital silence (all zeros)AddGaussianSNR
. It will continue working as before unless you switch to the new parameters min_snr_in_db
and max_snr_in_db
. If you use the old parameters, you'll get a warning.SpecCompose
for applying a pipeline of spectrogram transforms. Thanks to omerferhatt.SpecChannelShuffle
where it did not support more than 3 audio channels. Thanks to omerferhatt.leave_length_unchanged
to AddImpulseResponse
AddImpulseResponse
, AddBackgroundNoise
and AddShortNoises
LoudnessNormalization
randomize_parameters
in Compose
. Thanks to SolomidHero.AddGaussianNoise
, AddGaussianSNR
, ClippingDistortion
, FrequencyMask
, PitchShift
, Shift
, TimeMask
and TimeStretch
SpecChannelShuffle
and SpecFrequencyMask
.Normalize
AddBackgroundNoise
, AddShortNoises
and AddImpulseResponse
by loading wav files with scipy or wavio instead of librosa.Mp3Compression
Gain
and PolarityInversion
librosa
versionsfrom audiomentations import calculate_rms
, now you have to do from audiomentations.core.utils import calculate_rms
Gain
and PolarityInversion
. Thanks to Spijkervet for the inspiration.AddBackgroundNoise
and AddShortNoises
by optimizing the implementation of calculate_rms
.Normalize
. Thanks to ZFTurbo.AddImpulseResponse
, AddBackgroundNoise
and AddShortNoises
now support aiff files in addition to flac, mp3, ogg and wavAddImpulseResponse
, AddBackgroundNoise
and AddShortNoises
now include subfolders when searching for files. This is useful when your sound files are organized in subfolders.FrequencyMask
. Thanks to kvilouras.Shift
. This allows for introducing silence instead of a wrapped part of the sound.AddImpulseResponse
AddBackgroundNoise
transform. Useful for when you want to add background noise to all of your sound. You need to give it a folder of background noises to choose from.AddShortNoises
. Useful for when you want to add (bursts of) short noise sounds to your input audio.AddImpulseResponse
significantly faster.ClippingDistortion
where the min_percentile_threshold was not respected as expected.Composer
Resample
transformationClippingDistortion
transformationfade
parameter to TimeMask
Thanks to askskro
"},{"location":"changelog/#070-2020-01-14","title":"0.7.0 - 2020-01-14","text":""},{"location":"changelog/#added_24","title":"Added","text":"AddGaussianSNR
AddImpulseResponse
FrequencyMask
TimeMask
Trim
Thanks to karpnv
"},{"location":"changelog/#060-2019-05-27","title":"0.6.0 - 2019-05-27","text":""},{"location":"changelog/#added_25","title":"Added","text":"Shift
transformPitchShift
transformAddGaussianNoise
leave_length_unchanged
in TimeStretch
TimeStretch
transformAddGaussianNoise
AddGaussianNoise
audiomentations is in a very early (read: not very useful yet) stage when it comes to spectrogram transforms. Consider applying waveform transforms before converting your waveforms to spectrograms, or check out alternative libraries
"},{"location":"spectrogram_transforms/#specchannelshuffle","title":"SpecChannelShuffle
","text":"Added in v0.13.0
Shuffle the channels of a multichannel spectrogram. This can help combat positional bias.
"},{"location":"spectrogram_transforms/#specfrequencymask","title":"SpecFrequencyMask
","text":"Added in v0.13.0
Mask a set of frequencies in a spectrogram, \u00e0 la Google AI SpecAugment. This type of data augmentation has proved to make speech recognition models more robust.
The masked frequencies can be replaced with either the mean of the original values or a given constant (e.g. zero).
"},{"location":"guides/cpu_vs_gpu/","title":"CPU vs. GPU: Which to use for online data augmentation when training audio ML models?","text":"When training an audio machine learning model that includes online data augmentation as part of the training pipeline, you can choose to run the transforms on CPU or GPU. While some libraries, such as torch-audiomentations, support GPU, audiomentations is CPU-only. So, which one is better? The answer is: it depends.
"},{"location":"guides/cpu_vs_gpu/#pros-of-using-cpu-only-libraries-like-audiomentations","title":"Pros of using CPU-only libraries like audiomentations","text":"There are several advantages to using CPU-only data augmentation libraries like audiomentations:
There are also advantages to running audio augmentation transforms on GPU, for example, with the help of torch-audiomentations :
In summary, whether to use CPU-only libraries like audiomentations or GPU-accelerated libraries like torch-audiomentations depends on the specific requirements of your model and the available hardware. If your model training pipeline doesn't utilize your GPU(s) fully, running transforms on GPU might be the best choice. However, if your model's GPU utilization is already very high, running the transforms on multiple CPU threads might be the best option. It boils down to checking where your bottleneck is.
"},{"location":"guides/multichannel_audio_array_shapes/","title":"Multichannel audio array shapes","text":"When working with audio files in Python, you may encounter two main formats for representing the data, especially when you are dealing with stereo (or multichannel) audio. These formats correspond to the shape of the numpy ndarray that holds the audio data.
"},{"location":"guides/multichannel_audio_array_shapes/#1-channels-first-format","title":"1. Channels-first format","text":"This format has the shape (channels, samples)
. In the context of a stereo audio file, the number of channels would be 2 (for left and right), and samples are the individual data points in the audio file. For example, a stereo audio file with a duration of 1 second sampled at 44100 Hz would have a shape of (2, 44100)
.
This is the format expected by audiomentations when dealing with multichannel audio. If you provide multichannel audio data in a different format, a WrongMultichannelAudioShape
exception will be raised.
Note that audiomentations
also supports mono audio, i.e. shape like (1, samples)
or (samples,)
This format has the shape (samples, channels)
. Using the same stereo file example as above, the shape would be (44100, 2)
. This format is commonly returned by the soundfile
library when loading a stereo wav file, because channels last is the inherent data layout of a stereo wav file. This layout is the default in stereo wav files because it facilitates streaming audio, where data must be read and played back sequentially.
Different libraries in Python may return audio data in different formats. For instance, librosa
by default returns a mono ndarray, whereas soundfile
will return a multichannel ndarray in channels-last format when loading a stereo wav file.
Here is an example of how to load a file with each:
import librosa\nimport soundfile as sf\n\n# Librosa, mono\ny, sr = librosa.load(\"stereo_audio_example.wav\", sr=None, mono=True)\nprint(y.shape) # (117833,)\n\n# Librosa, multichannel\ny, sr = librosa.load(\"stereo_audio_example.wav\", sr=None, mono=False)\nprint(y.shape) # (2, 117833)\n\n# Soundfile\ny, sr = sf.read(\"stereo_audio_example.wav\")\nprint(y.shape) # (117833, 2)\n
"},{"location":"guides/multichannel_audio_array_shapes/#converting-between-formats","title":"Converting between formats","text":"If you have audio data in the channels-last format but need it in channels-first format, you can easily convert it using the transpose operation of numpy ndarrays:
import numpy as np\n\n# Assuming y is your audio data in channels-last format\ny_transposed = np.transpose(y)\n\n# Alternative, shorter syntax:\ny_transposed = y.T\n
Now, y_transposed
will be in channels-first format and can be used with audiomentations
.
You can access the parameters
property of a transform. Code example:
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift\nimport numpy as np\n\naugment = Compose([\n AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),\n TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),\n PitchShift(min_semitones=-4, max_semitones=4, p=0.5),\n Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),\n])\n\n# Generate 2 seconds of dummy audio for the sake of example\nsamples = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)\n\n# Augment/transform/perturb the audio data\naugmented_samples = augment(samples=samples, sample_rate=16000)\n\nfor transform in augment.transforms:\n print(f\"{transform.__class__.__name__}: {transform.parameters}\")\n
When running the example code above, it may print something like this:
AddGaussianNoise: {'should_apply': True, 'amplitude': 0.0027702725003923272}\nTimeStretch: {'should_apply': True, 'rate': 1.158377360016495}\nPitchShift: {'should_apply': False}\nShift: {'should_apply': False}\n
"},{"location":"guides/transform_parameters/#how-to-use-apply-a-transform-with-the-same-parameters-to-multiple-inputs","title":"How to use apply a transform with the same parameters to multiple inputs","text":"This technique can be useful if you want to transform e.g. a target sound in the same way as an input sound. Code example:
from audiomentations import Gain\nimport numpy as np\n\naugment = Gain(p=1.0)\n\nsamples = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)\nsamples2 = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)\n\naugmented_samples = augment(samples=samples, sample_rate=16000)\naugment.freeze_parameters()\nprint(augment.parameters)\naugmented_samples2 = augment(samples=samples2, sample_rate=16000)\nprint(augment.parameters)\naugment.unfreeze_parameters()\n
When running the example code above, it may print something like this:
{'should_apply': True, 'amplitude_ratio': 0.9688148624484364}\n{'should_apply': True, 'amplitude_ratio': 0.9688148624484364}\n
In other words, this means that both sounds (samples
and samples2
) were gained by the same amount
AddBackgroundNoise
","text":"Added in v0.9.0
Mix in another sound, e.g. a background noise. Useful if your original sound is clean and you want to simulate an environment where background noise is present.
Can also be used for mixup when training classification/annotation models.
A path to a file/folder with sound(s), or a list of file/folder paths, must be specified. These sounds should ideally be at least as long as the input sounds to be transformed. Otherwise, the background sound will be repeated, which may sound unnatural.
Note that in the default case (noise_rms=\"relative\"
) the gain of the added noise is relative to the amount of signal in the input. This implies that if the input is completely silent, no noise will be added.
Optionally, the added noise sound can be transformed (with noise_transform
) before it gets mixed in.
Here are some examples of datasets that can be downloaded and used as background noise:
Here we add some music to a speech recording, targeting a signal-to-noise ratio (SNR) of 5 decibels (dB), which means that the speech (signal) is 5 dB louder than the music (noise).
Input sound Transformed sound"},{"location":"waveform_transforms/add_background_noise/#usage-examples","title":"Usage examples","text":"Relative RMSAbsolute RMSfrom audiomentations import AddBackgroundNoise, PolarityInversion\n\ntransform = AddBackgroundNoise(\n sounds_path=\"/path/to/folder_with_sound_files\",\n min_snr_in_db=3.0,\n max_snr_in_db=30.0,\n noise_transform=PolarityInversion(),\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
from audiomentations import AddBackgroundNoise, PolarityInversion\n\ntransform = AddBackgroundNoise(\n sounds_path=\"/path/to/folder_with_sound_files\",\n noise_rms=\"absolute\",\n min_absolute_rms_in_db=-45.0,\n max_absolute_rms_in_db=-15.0,\n noise_transform=PolarityInversion(),\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/add_background_noise/#addbackgroundnoise-api","title":"AddBackgroundNoise API","text":"sounds_path
: Union[List[Path], List[str], Path, str]
A path or list of paths to audio file(s) and/or folder(s) with audio files. Can be str or Path instance(s). The audio files given here are supposed to be background noises. min_snr_db
: float
\u2022 unit: Decibel Default: 3.0
. Minimum signal-to-noise ratio in dB. Is only used if noise_rms
is set to \"relative\"
max_snr_db
: float
\u2022 unit: Decibel Default: 30.0
. Maximum signal-to-noise ratio in dB. Is only used if noise_rms
is set to \"relative\"
min_snr_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use min_snr_db
instead max_snr_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use max_snr_db
instead noise_rms
: str
\u2022 choices: \"absolute\"
, \"relative\"
Default: \"relative\"
. Defines how the background noise will be added to the audio input. If the chosen option is \"relative\"
, the root mean square (RMS) of the added noise will be proportional to the RMS of the input sound. If the chosen option is \"absolute\"
, the background noise will have an RMS independent of the rms of the input audio file min_absolute_rms_db
: float
\u2022 unit: Decibel Default: -45.0
. Is only used if noise_rms
is set to \"absolute\"
. It is the minimum RMS value in dB that the added noise can take. The lower the RMS is, the lower the added sound will be. max_absolute_rms_db
: float
\u2022 unit: Decibel Default: -15.0
. Is only used if noise_rms
is set to \"absolute\"
. It is the maximum RMS value in dB that the added noise can take. Note that this value can not exceed 0. min_absolute_rms_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use min_absolute_rms_db
instead max_absolute_rms_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use max_absolute_rms_in_db
instead noise_transform
: Optional[Callable[[NDArray[np.float32], int], NDArray[np.float32]]]
Default: None
. A callable waveform transform (or composition of transforms) that gets applied to the noise before it gets mixed in. The callable is expected to input audio waveform (numpy array) and sample rate (int). p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform. lru_cache_size
: int
Default: 2
. Maximum size of the LRU cache for storing noise files in memory"},{"location":"waveform_transforms/add_color_noise/","title":"AddColorNoise
","text":"To be added in v0.35.0
Mix in noise with color, optionally weighted by an A-weighting curve. When f_decay=0
, this is equivalent to AddGaussianNoise
. Otherwise, see: Colors of Noise .
min_snr_db
: float
\u2022 unit: Decibel Default: 5.0
. Minimum signal-to-noise ratio in dB. A lower number means more noise. max_snr_db
: float
\u2022 unit: Decibel Default: 40.0
. Maximum signal-to-noise ratio in dB. A greater number means less noise. min_f_decay
: float
\u2022 unit: Decibels/octave Default: -6.0
. Minimum noise decay in dB per octave. max_f_decay
: float
\u2022 unit: Decibels/octave Default: 6.0
. Maximum noise decay in dB per octave. Those values can be chosen from the following table:
Colourf_decay
(db/octave) pink -3.01 brown/brownian -6.02 red -6.02 blue 3.01 azure 3.01 violet 6.02 white 0.0 See Colors of noise on Wikipedia about those values.
p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform. p_apply_a_weighting
: float
\u2022 range: [0.0, 1.0] Default: 0.0
. The probability of additionally weighting the transform using an A-weighting
curve. n_fft
: int
Default: 128
. The number of points the decay curve is computed (for coloring white noise)."},{"location":"waveform_transforms/add_gaussian_noise/","title":"AddGaussianNoise
","text":"Added in v0.1.0
Add gaussian noise to the samples
"},{"location":"waveform_transforms/add_gaussian_noise/#input-output-example","title":"Input-output example","text":"Here we add some gaussian noise (with amplitude 0.01) to a speech recording.
Input sound Transformed sound"},{"location":"waveform_transforms/add_gaussian_noise/#usage-example","title":"Usage example","text":"from audiomentations import AddGaussianNoise\n\ntransform = AddGaussianNoise(\n min_amplitude=0.001,\n max_amplitude=0.015,\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/add_gaussian_noise/#addgaussiannoise-api","title":"AddGaussianNoise API","text":"min_amplitude
: float
\u2022 unit: linear amplitude Default: 0.001
. Minimum noise amplification factor. max_amplitude
: float
\u2022 unit: linear amplitude Default: 0.015
. Maximum noise amplification factor. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/add_gaussian_snr/","title":"AddGaussianSNR
","text":"Added in v0.7.0
The AddGaussianSNR
transform injects Gaussian noise into an audio signal. It applies a Signal-to-Noise Ratio (SNR) that is chosen randomly from a uniform distribution on the decibel scale. This choice is consistent with the nature of human hearing, which is logarithmic rather than linear.
SNR is a common measure used in science and engineering to compare the level of a desired signal to the level of noise. In the context of audio, the signal is the meaningful sound that you're interested in, like a person's voice, music, or other audio content, while the noise is unwanted sound that can interfere with the signal.
The SNR quantifies the ratio of the power of the signal to the power of the noise. The higher the SNR, the less the noise is present in relation to the signal.
Gaussian noise, a kind of white noise, is a type of statistical noise where the amplitude of the noise signal follows a Gaussian distribution. This means that most of the samples are close to the mean (zero), and fewer of them are farther away. It's called Gaussian noise due to its characteristic bell-shaped Gaussian distribution.
Gaussian noise is similar to the sound of a radio or TV tuned to a nonexistent station: a kind of constant, uniform hiss or static.
"},{"location":"waveform_transforms/add_gaussian_snr/#input-output-example","title":"Input-output example","text":"Here we add some gaussian noise (with SNR = 16 dB) to a speech recording.
Input sound Transformed sound"},{"location":"waveform_transforms/add_gaussian_snr/#usage-example","title":"Usage example","text":"from audiomentations import AddGaussianSNR\n\ntransform = AddGaussianSNR(\n min_snr_db=5.0,\n max_snr_db=40.0,\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/add_gaussian_snr/#addgaussiansnr-api","title":"AddGaussianSNR API","text":"min_snr_db
: float
\u2022 unit: Decibel Default: 5.0
. Minimum signal-to-noise ratio in dB. A lower number means more noise. max_snr_db
: float
\u2022 unit: decibel Default: 40.0
. Maximum signal-to-noise ratio in dB. A greater number means less noise. min_snr_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use min_snr_db
instead max_snr_in_db
: float
\u2022 unit: decibel Deprecated as of v0.31.0. Use max_snr_db
instead p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/add_short_noises/","title":"AddShortNoises
","text":"Added in v0.9.0
Mix in various (bursts of overlapping) sounds with random pauses between. Useful if your original sound is clean and you want to simulate an environment where short noises sometimes occur.
A folder of (noise) sounds to be mixed in must be specified.
"},{"location":"waveform_transforms/add_short_noises/#input-output-example","title":"Input-output example","text":"Here we add some short noise sounds to a voice recording.
Input sound Transformed sound"},{"location":"waveform_transforms/add_short_noises/#usage-examples","title":"Usage examples","text":"Noise RMS relative to whole inputAbsolute RMSfrom audiomentations import AddShortNoises, PolarityInversion\n\ntransform = AddShortNoises(\n sounds_path=\"/path/to/folder_with_sound_files\",\n min_snr_in_db=3.0,\n max_snr_in_db=30.0,\n noise_rms=\"relative_to_whole_input\",\n min_time_between_sounds=2.0,\n max_time_between_sounds=8.0,\n noise_transform=PolarityInversion(),\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
from audiomentations import AddShortNoises, PolarityInversion\n\ntransform = AddShortNoises(\n sounds_path=\"/path/to/folder_with_sound_files\",\n min_absolute_noise_rms_db=-50.0,\n max_absolute_noise_rms_db=-20.0, \n noise_rms=\"absolute\",\n min_time_between_sounds=2.0,\n max_time_between_sounds=8.0,\n noise_transform=PolarityInversion(),\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/add_short_noises/#addshortnoises-api","title":"AddShortNoises API","text":"sounds_path
: Union[List[Path], List[str], Path, str]
A path or list of paths to audio file(s) and/or folder(s) with audio files. Can be str or Path instance(s). The audio files given here are supposed to be (short) noises. min_snr_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use min_snr_db
instead max_snr_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use max_snr_db
instead min_snr_db
: float
\u2022 unit: Decibel Default: -6.0
. Minimum signal-to-noise ratio in dB. A lower value means the added sounds/noises will be louder. This gets ignored if noise_rms
is set to \"absolute\"
. max_snr_db
: float
\u2022 unit: Decibel Default: 18.0
. Maximum signal-to-noise ratio in dB. A lower value means the added sounds/noises will be louder. This gets ignored if noise_rms
is set to \"absolute\"
. min_time_between_sounds
: float
\u2022 unit: seconds Default: 2.0
. Minimum pause time (in seconds) between the added sounds/noises max_time_between_sounds
: float
\u2022 unit: seconds Default: 8.0
. Maximum pause time (in seconds) between the added sounds/noises noise_rms
: str
\u2022 choices: \"absolute\"
, \"relative\"
, \"relative_to_whole_input\"
Default: \"relative\"
(<=v0.27), but will be changed to \"relative_to_whole_input\"
in a future version.
This parameter defines how the noises will be added to the audio input.
\"relative\"
: the RMS value of the added noise will be proportional to the RMS value of the input sound calculated only for the region where the noise is added.\"absolute\"
: the added noises will have an RMS independent of the RMS of the input audio file.\"relative_to_whole_input\"
: the RMS of the added noises will be proportional to the RMS of the whole input sound.min_absolute_noise_rms_db
: float
\u2022 unit: Decibel Default: -50.0
. Is only used if noise_rms
is set to \"absolute\"
. It is the minimum RMS value in dB that the added noise can take. The lower the RMS is, the lower will the added sound be. max_absolute_noise_rms_db
: float
\u2022 unit: seconds Default: -20.0
. Is only used if noise_rms
is set to \"absolute\"
. It is the maximum RMS value in dB that the added noise can take. Note that this value can not exceed 0. add_all_noises_with_same_level
: bool
Default: False
. Whether to add all the short noises (within one audio snippet) with the same SNR. If noise_rms
is set to \"absolute\"
, the RMS is used instead of SNR. The target SNR (or RMS) will change every time the parameters of the transform are randomized. include_silence_in_noise_rms_estimation
: bool
Default: True
. It chooses how the RMS of the noises to be added will be calculated. If this option is set to False, the silence in the noise files will be disregarded in the RMS calculation. It is useful for non-stationary noises where silent periods occur. burst_probability
: float
Default: 0.22
. For every noise that gets added, there is a probability of adding an extra burst noise that overlaps with the noise. This parameter controls that probability. min_pause_factor_during_burst
and max_pause_factor_during_burst
control the amount of overlap. min_pause_factor_during_burst
: float
Default: 0.1
. Min value of how far into the current sound (as fraction) the burst sound should start playing. The value must be greater than 0. max_pause_factor_during_burst
: float
Default: 1.1
. Max value of how far into the current sound (as fraction) the burst sound should start playing. The value must be greater than 0. min_fade_in_time
: float
\u2022 unit: seconds Default: 0.005
. Min noise fade in time in seconds. Use a value larger than 0 to avoid a \"click\" at the start of the noise. max_fade_in_time
: float
\u2022 unit: seconds Default: 0.08
. Max noise fade in time in seconds. Use a value larger than 0 to avoid a \"click\" at the start of the noise. min_fade_out_time
: float
\u2022 unit: seconds Default: 0.01
. Min sound/noise fade out time in seconds. Use a value larger than 0 to avoid a \"click\" at the end of the sound/noise. max_fade_out_time
: float
\u2022 unit: seconds Default: 0.1
. Max sound/noise fade out time in seconds. Use a value larger than 0 to avoid a \"click\" at the end of the sound/noise. signal_gain_in_db_during_noise
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use signal_gain_db_during_noise
instead signal_gain_db_during_noise
: float
\u2022 unit: Decibel Default: 0.0
. Gain applied to the signal during a short noise. When fading the signal to the custom gain, the same fade times are used as for the noise, so it's essentially cross-fading. The default value (0.0) means the signal will not be gained. If set to a very low value, e.g. -100.0, this feature could be used for completely replacing the signal with the noise. This could be relevant in some use cases, for example:
noise_transform
: Optional[Callable[[NDArray[np.float32], int], NDArray[np.float32]]]
Default: None
. A callable waveform transform (or composition of transforms) that gets applied to noises before they get mixed in. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform. lru_cache_size
: int
Default: 64
. Maximum size of the LRU cache for storing noise files in memory"},{"location":"waveform_transforms/adjust_duration/","title":"AdjustDuration
","text":"Added in v0.30.0
Trim or pad the audio to the specified length/duration in samples or seconds. If the input sound is longer than the target duration, pick a random offset and crop the sound to the target duration. If the input sound is shorter than the target duration, pad the sound so the duration matches the target duration.
This transform can be useful if you need audio with constant length, e.g. as input to a machine learning model. The reason for varying audio clip lengths can be e.g.
Here we input an audio clip and remove a part of the start and the end, so the length of the result matches the specified target length.
Input sound Transformed sound"},{"location":"waveform_transforms/adjust_duration/#usage-examples","title":"Usage examples","text":"Target length in samplesTarget duration in secondsfrom audiomentations import AdjustDuration\n\ntransform = AdjustDuration(duration_samples=60000, p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
from audiomentations import AdjustDuration\n\ntransform = AdjustDuration(duration_seconds=3.75, p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/adjust_duration/#adjustduration-api","title":"AdjustDuration API","text":"duration_samples
: int
\u2022 range: [0, \u221e) Target duration in number of samples. duration_seconds
: float
\u2022 range: [0.0, \u221e) Target duration in seconds. padding_mode
: str
\u2022 choices: \"silence\"
, \"wrap\"
, \"reflect\"
Default: \"silence\"
. Padding mode. Only used when audio input is shorter than the target duration. padding_position
: str
\u2022 choices: \"start\"
, \"end\"
Default: \"end\"
. The position of the inserted/added padding. Only used when audio input is shorter than the target duration. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/air_absorption/","title":"AirAbsorption
","text":"Added in v0.25.0
A lowpass-like filterbank with variable octave attenuation that simulates attenuation of high frequencies due to air absorption. This transform is parametrized by temperature, humidity, and the distance between audio source and microphone.
This is not a scientifically accurate transform but basically applies a uniform filterbank with attenuations given by:
att = exp(- distance * absorption_coefficient)
where distance
is the microphone-source assumed distance in meters and absorption_coefficient
is adapted from a lookup table by pyroomacoustics. It can also be seen as a lowpass filter with variable octave attenuation.
Note that since this transform mostly affects high frequencies, it is only suitable for audio with sufficiently high sample rate, like 32 kHz and above.
Note also that this transform only \"simulates\" the dampening of high frequencies, and does not attenuate according to the distance law. Gain augmentation needs to be done separately.
"},{"location":"waveform_transforms/air_absorption/#input-output-example","title":"Input-output example","text":"Here we input a high-quality speech recording and apply AirAbsorption
with an air temperature of 20 degrees celsius, 70% humidity and a distance of 20 meters. One can see clearly in the spectrogram that the highs, especially above ~13 kHz, are rolled off in the output, but it may require a quiet room and some concentration to hear it clearly in the audio comparison.
from audiomentations import AirAbsorption\n\ntransform = AirAbsorption(\n min_distance=10.0,\n max_distance=50.0,\n p=1.0,\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=48000)\n
"},{"location":"waveform_transforms/air_absorption/#airabsorption-api","title":"AirAbsorption API","text":"min_temperature
: float
\u2022 unit: Celsius \u2022 choices: [10.0, 20.0] Default: 10.0
. Minimum temperature in Celsius (can take a value of either 10.0 or 20.0) max_temperature
: float
\u2022 unit: Celsius \u2022 choices: [10.0, 20.0] Default: 20.0
. Maximum temperature in Celsius (can take a value of either 10.0 or 20.0) min_humidity
: float
\u2022 unit: percent \u2022 range: [30.0, 90.0] Default: 30.0
. Minimum humidity in percent (between 30.0 and 90.0) max_humidity
: float
\u2022 unit: percent \u2022 range: [30.0, 90.0] Default: 90.0
. Maximum humidity in percent (between 30.0 and 90.0) min_distance
: float
\u2022 unit: meters Default: 10.0
. Minimum microphone-source distance in meters. max_distance
: float
\u2022 unit: meters Default: 100.0
. Maximum microphone-source distance in meters. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/aliasing/","title":"Aliasing
","text":"To be added in v0.35.0
Downsample the audio to a lower sample rate by linear interpolation, without low-pass filtering it first, resulting in aliasing artifacts. You get aliasing artifacts when there is high-frequency audio in the input audio that falls above the nyquist frequency of the chosen target sample rate. Audio with frequencies above the nyquist frequency cannot be reproduced accurately and get \"reflected\"/mirrored to other frequencies. The aliasing artifacts \"replace\" the original high frequency signals. The result can be described as coarse and metallic.
After the downsampling, the signal gets upsampled to the original signal again, so the length of the output becomes the same as the length of the input.
For more information, see
Here we target a sample rate of 12000 Hz. Note the vertical mirroring in the spectrogram in the transformed sound.
Input sound Transformed sound"},{"location":"waveform_transforms/aliasing/#usage-example","title":"Usage example","text":"from audiomentations import Aliasing\n\ntransform = Aliasing(min_sample_rate=8000, max_sample_rate=30000, p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=44100)\n
"},{"location":"waveform_transforms/aliasing/#aliasing-api","title":"Aliasing API","text":"min_sample_rate
: int
\u2022 unit: Hz \u2022 range: [2, \u221e) Minimum target sample rate to downsample to max_sample_rate
: int
\u2022 unit: Hz \u2022 range: [2, \u221e) Maximum target sample rate to downsample to p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/apply_impulse_response/","title":"ApplyImpulseResponse
","text":"Added in v0.7.0
This transform convolves the audio with a randomly selected (room) impulse response file.
ApplyImpulseResponse
is commonly used as a data augmentation technique that adds realistic-sounding reverb to recordings. This can for example make denoisers and speech recognition systems more robust to different acoustic environments and distances between the sound source and the microphone. It could also be used to generate roomy audio examples for the training of dereverberation models.
Convolution with an impulse response is a powerful technique in signal processing that can be employed to emulate the acoustic characteristics of specific environments or devices. This process can transform a dry recording, giving it the sonic signature of being played in a specific location or through a particular device.
What is an impulse response? An impulse response (IR) captures the unique acoustical signature of a space or object. It's essentially a recording of how a specific environment or system responds to an impulse (a short, sharp sound). By convolving an audio signal with an impulse response, we can simulate how that signal would sound in the captured environment.
Note that some impulse responses, especially those captured in larger spaces or from specific equipment, can introduce a noticeable delay when convolved with an audio signal. In some applications, this delay is a desirable property. However, in some other applications, the convolved audio should not have a delay compared to the original audio. If this is the case for you, you can align the audio afterwards with fast-align-audio , for example.
Impulse responses can be created using e.g. http://tulrich.com/recording/ir_capture/
Some datasets of impulse responses are publicly available:
Impulse responses are represented as audio (ideally wav) files in the given ir_path
.
Another thing worth checking is that your IR files have the same sample rate as your audio inputs. Why? Because if they have different sample rates, the internal resampling will slow down execution, and because some high frequencies may get lost.
"},{"location":"waveform_transforms/apply_impulse_response/#input-output-example","title":"Input-output example","text":"Here we make a dry speech recording quite reverbant by convolving it with a room impulse response
Input sound Transformed sound"},{"location":"waveform_transforms/apply_impulse_response/#usage-example","title":"Usage example","text":"from audiomentations import ApplyImpulseResponse\n\ntransform = ApplyImpulseResponse(ir_path=\"/path/to/sound_folder\", p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=48000)\n
"},{"location":"waveform_transforms/apply_impulse_response/#applyimpulseresponse-api","title":"ApplyImpulseResponse API","text":"ir_path
: Union[List[Path], List[str], str, Path]
A path or list of paths to audio file(s) and/or folder(s) with audio files. Can be str
or Path
instance(s). The audio files given here are supposed to be (room) impulse responses. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform. lru_cache_size
: int
Default: 128
. Maximum size of the LRU cache for storing impulse response files in memory. leave_length_unchanged
: bool
Default: True
. When set to True
, the tail of the sound (e.g. reverb at the end) will be chopped off so that the length of the output is equal to the length of the input."},{"location":"waveform_transforms/band_pass_filter/","title":"BandPassFilter
","text":"Added in v0.18.0, updated in v0.21.0
Apply band-pass filtering to the input audio. Filter steepness (6/12/18... dB / octave) is parametrized. Can also be set for zero-phase filtering (will result in a 6 dB drop at cutoffs).
"},{"location":"waveform_transforms/band_pass_filter/#input-output-example","title":"Input-output example","text":"Here we input a high-quality speech recording and apply BandPassFilter
with a center frequency of 2500 Hz and a bandwidth fraction of 0.8, which means that the bandwidth in this example is 2000 Hz, so the low frequency cutoff is 1500 Hz and the high frequency cutoff is 3500 Hz. One can see in the spectrogram that the high and the low frequencies are both attenuated in the output. If you listen to the audio example, you might notice that the transformed output almost sounds like a phone call from the time when phone audio was narrowband and mostly contained frequencies between ~300 and ~3400 Hz.
from audiomentations import BandPassFilter\n\ntransform = BandPassFilter(min_center_freq=100.0, max_center_freq=6000, p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=48000)\n
"},{"location":"waveform_transforms/band_pass_filter/#bandpassfilter-api","title":"BandPassFilter API","text":"min_center_freq
: float
\u2022 unit: hertz Default: 200.0
. Minimum center frequency in hertz max_center_freq
: float
\u2022 unit: hertz Default: 4000.0
. Maximum center frequency in hertz min_bandwidth_fraction
: float
\u2022 range: [0.0, 2.0] Default: 0.5
. Minimum bandwidth relative to center frequency max_bandwidth_fraction
: float
\u2022 range: [0.0, 2.0] Default: 1.99
. Maximum bandwidth relative to center frequency min_rolloff
: float
\u2022 unit: Decibels/octave Default: 12
. Minimum filter roll-off (in dB/octave). Must be a multiple of 6 max_rolloff
: float
\u2022 unit: Decibels/octave Default: 24
. Maximum filter roll-off (in dB/octave) Must be a multiple of 6 zero_phase
: bool
Default: False
. Whether filtering should be zero phase. When this is set to True
it will not affect the phase of the input signal but will sound 3 dB lower at the cutoff frequency compared to the non-zero phase case (6 dB vs. 3 dB). Additionally, it is 2 times slower than in the non-zero phase case. If you absolutely want no phase distortions (e.g. want to augment an audio file with lots of transients, like a drum track), set this to True
. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/band_stop_filter/","title":"BandStopFilter
","text":"Added in v0.21.0
Apply band-stop filtering to the input audio. Also known as notch filter or band reject filter. It relates to the frequency mask idea in the SpecAugment paper . Center frequency gets picked in mel space, so it is somewhat aligned with human hearing, which is not linear. Filter steepness (6/12/18... dB / octave) is parametrized. Can also be set for zero-phase filtering (will result in a 6 dB drop at cutoffs).
Applying band-stop filtering as data augmentation during model training can aid in preventing overfitting to specific frequency relationships, helping to make the model robust to diverse audio environments and scenarios, where frequency losses can occur.
"},{"location":"waveform_transforms/band_stop_filter/#input-output-example","title":"Input-output example","text":"Here we input a speech recording and apply BandStopFilter
with a center frequency of 2500 Hz and a bandwidth fraction of 0.8, which means that the bandwidth in this example is 2000 Hz, so the low frequency cutoff is 1500 Hz and the high frequency cutoff is 3500 Hz. One can see in the spectrogram of the transformed sound that the band stop filter has attenuated this frequency range. If you listen to the audio example, you can hear that the timbre is different in the transformed sound than in the original.
min_center_freq
: float
\u2022 unit: hertz Default: 200.0
. Minimum center frequency in hertz max_center_freq
: float
\u2022 unit: hertz Default: 4000.0
. Maximum center frequency in hertz min_bandwidth_fraction
: float
Default: 0.5
. Minimum bandwidth relative to center frequency max_bandwidth_fraction
: float
Default: 1.99
. Maximum bandwidth relative to center frequency min_rolloff
: float
\u2022 unit: Decibels/octave Default: 12
. Minimum filter roll-off (in dB/octave). Must be a multiple of 6 max_rolloff
: float
\u2022 unit: Decibels/octave Default: 24
. Maximum filter roll-off (in dB/octave) Must be a multiple of 6 zero_phase
: bool
Default: False
. Whether filtering should be zero phase. When this is set to True
it will not affect the phase of the input signal but will sound 3 dB lower at the cutoff frequency compared to the non-zero phase case (6 dB vs. 3 dB). Additionally, it is 2 times slower than in the non-zero phase case. If you absolutely want no phase distortions (e.g. want to augment an audio file with lots of transients, like a drum track), set this to True
. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/bit_crush/","title":"BitCrush
","text":"To be added in v0.35.0
Apply a bit crush effect to the audio by reducing the bit depth. In other words, it reduces the number of bits that can be used for representing each audio sample. This adds quantization noise, and affects dynamic range. This transform does not apply dithering.
For more information, see
Here we reduce the bit depth from 16 to 6 bits per sample
Input sound Transformed sound"},{"location":"waveform_transforms/bit_crush/#usage-example","title":"Usage example","text":"from audiomentations import BitCrush\n\ntransform = BitCrush(min_bit_depth=5, max_bit_depth=14, p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/bit_crush/#bitcrush-api","title":"BitCrush API","text":"min_bit_depth
: int
\u2022 unit: bits \u2022 range: [1, 32] Minimum bit depth the audio will be \"converted\" to max_bit_depth
: int
\u2022 unit: bits \u2022 range: [1, 32] Maximum bit depth the audio will be \"converted\" to p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/clip/","title":"Clip
","text":"Added in v0.17.0
Clip audio by specified values. e.g. set a_min=-1.0
and a_max=1.0
to ensure that no samples in the audio exceed that extent. This can be relevant for avoiding integer overflow or underflow (which results in unintended wrap distortion that can sound horrible) when exporting to e.g. 16-bit PCM wav.
Another way of ensuring that all values stay between -1.0 and 1.0 is to apply PeakNormalization
.
This transform is different from ClippingDistortion
in that it takes fixed values for clipping instead of clipping a random percentile of the samples. Arguably, this transform is not very useful for data augmentation. Instead, think of it as a very cheap and harsh limiter (for samples that exceed the allotted extent) that can sometimes be useful at the end of a data augmentation pipeline.
a_min
: float
Default: -1.0
. Minimum value for clipping. a_max
: float
Default: 1.0
. Maximum value for clipping. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/clipping_distortion/","title":"ClippingDistortion
","text":"Added in v0.8.0
Distort signal by clipping a random percentage of points
The percentage of points that will be clipped is drawn from a uniform distribution between the two input parameters min_percentile_threshold
and max_percentile_threshold
. If for instance 30% is drawn, the samples are clipped if they're below the 15th or above the 85th percentile.
min_percentile_threshold
: int
Default: 0
. A lower bound on the total percent of samples that will be clipped max_percentile_threshold
: int
Default: 40
. An upper bound on the total percent of samples that will be clipped p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/gain/","title":"Gain
","text":"Added in v0.11.0
Multiply the audio by a random amplitude factor to reduce or increase the volume. This technique can help a model become somewhat invariant to the overall gain of the input audio.
Warning: This transform can return samples outside the [-1, 1] range, which may lead to clipping or wrap distortion, depending on what you do with the audio in a later stage. See also https://en.wikipedia.org/wiki/Clipping_(audio)#Digital_clipping
"},{"location":"waveform_transforms/gain/#gain-api","title":"Gain API","text":"min_gain_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use min_gain_db
instead max_gain_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use max_gain_db
instead min_gain_db
: float
\u2022 unit: Decibel Default: -12.0
. Minimum gain. max_gain_db
: float
\u2022 unit: Decibel Default: 12.0
. Maximum gain. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/gain_transition/","title":"GainTransition
","text":"Added in v0.22.0
Gradually change the volume up or down over a random time span. Also known as fade in and fade out. The fade works on a logarithmic scale, which is natural to human hearing.
The way this works is that it picks two gains: a first gain and a second gain. Then it picks a time range for the transition between those two gains. Note that this transition can start before the audio starts and/or end after the audio ends, so the output audio can start or end in the middle of a transition. The gain starts at the first gain and is held constant until the transition start. Then it transitions to the second gain. Then that gain is held constant until the end of the sound.
"},{"location":"waveform_transforms/gain_transition/#gaintransition-api","title":"GainTransition API","text":"min_gain_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use min_gain_db
instead max_gain_in_db
: float
\u2022 unit: Decibel Deprecated as of v0.31.0. Use max_gain_db
instead min_gain_db
: float
\u2022 unit: Decibel Default: -24.0
. Minimum gain. max_gain_db
: float
\u2022 unit: Decibel Default: 6.0
. Maximum gain. min_duration
: Union[float, int]
\u2022 unit: see duration_unit
Default: 0.2
. Minimum length of transition. max_duration
: Union[float, int]
\u2022 unit: see duration_unit
Default: 6.0
. Maximum length of transition. duration_unit
: str
\u2022 choices: \"fraction\"
, \"samples\"
, \"seconds\"
Default: \"seconds\"
. Defines the unit of the value of min_duration
and max_duration
.
\"fraction\"
: Fraction of the total sound length\"samples\"
: Number of audio samples\"seconds\"
: Number of secondsp
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/high_pass_filter/","title":"HighPassFilter
","text":"Added in v0.18.0, updated in v0.21.0
Apply high-pass filtering to the input audio of parametrized filter steepness (6/12/18... dB / octave). Can also be set for zero-phase filtering (will result in a 6 dB drop at cutoff).
"},{"location":"waveform_transforms/high_pass_filter/#highpassfilter-api","title":"HighPassFilter API","text":"min_cutoff_freq
: float
\u2022 unit: hertz Default: 20.0
. Minimum cutoff frequency max_cutoff_freq
: float
\u2022 unit: hertz Default: 2400.0
. Maximum cutoff frequency min_rolloff
: float
\u2022 unit: Decibels/octave Default: 12
. Minimum filter roll-off (in dB/octave). Must be a multiple of 6 max_rolloff
: float
\u2022 unit: Decibels/octave Default: 24
. Maximum filter roll-off (in dB/octave). Must be a multiple of 6 zero_phase
: bool
Default: False
. Whether filtering should be zero phase. When this is set to True
it will not affect the phase of the input signal but will sound 3 dB lower at the cutoff frequency compared to the non-zero phase case (6 dB vs. 3 dB). Additionally, it is 2 times slower than in the non-zero phase case. If you absolutely want no phase distortions (e.g. want to augment an audio file with lots of transients, like a drum track), set this to True
. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/high_shelf_filter/","title":"HighShelfFilter
","text":"Added in v0.21.0
A high shelf filter is a filter that either boosts (increases amplitude) or cuts (decreases amplitude) frequencies above a certain center frequency. This transform applies a high-shelf filter at a specific center frequency in hertz. The gain at nyquist frequency is controlled by {min,max}_gain_db
(note: can be positive or negative!). Filter coefficients are taken from the W3 Audio EQ Cookbook
min_center_freq
: float
\u2022 unit: hertz Default: 300.0
. The minimum center frequency of the shelving filter max_center_freq
: float
\u2022 unit: hertz Default: 7500.0
. The maximum center frequency of the shelving filter min_gain_db
: float
\u2022 unit: Decibel Default: -18.0
. The minimum gain at the nyquist frequency max_gain_db
: float
\u2022 unit: Decibel Default: 18.0
. The maximum gain at the nyquist frequency min_q
: float
\u2022 range: (0.0, 1.0] Default: 0.1
. The minimum quality factor Q. The higher the Q, the steeper the transition band will be. max_q
: float
\u2022 range: (0.0, 1.0] Default: 0.999
. The maximum quality factor Q. The higher the Q, the steeper the transition band will be. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/lambda/","title":"Lambda
","text":"Added in v0.26.0
Apply a user-defined transform (callable) to the signal. The inspiration for this transform comes from albumentation's lambda transform. This allows one to have a little more fine-grained control over the operations in the context of a Compose
, OneOf
or SomeOf
import random\n\nfrom audiomentations import Lambda, OneOf, Gain\n\n\ndef gain_only_left_channel(samples, sample_rate):\n samples[0, :] *= random.uniform(0.8, 1.25)\n return samples\n\n\ntransform = OneOf(\n transforms=[Lambda(transform=gain_only_left_channel, p=1.0), Gain(p=1.0)]\n)\n\naugmented_sound = transform(my_stereo_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/lambda/#lambda-api","title":"Lambda API","text":"transform
: Callable
A callable to be applied. It should input samples (ndarray), sample_rate (int) and optionally some user-defined keyword arguments. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform. **kwargs
Optional extra parameters passed to the callable transform"},{"location":"waveform_transforms/limiter/","title":"Limiter
","text":"Added in v0.26.0
The Limiter
, based on cylimiter , is a straightforward audio transform that applies dynamic range compression. It is capable of limiting the audio signal based on certain parameters. Additionally, please note that this transform introduces a slight delay in the signal, equivalent to a fraction of the attack time.
In this example we apply the limiter with a threshold that is 10 dB lower than the signal peak
Input sound Transformed sound"},{"location":"waveform_transforms/limiter/#usage-examples","title":"Usage examples","text":"Threshold relative to signal peakAbsolute thresholdfrom audiomentations import Limiter\n\ntransform = Limiter(\n min_threshold_db=-16.0,\n max_threshold_db=-6.0,\n threshold_mode=\"relative_to_signal_peak\",\n p=1.0,\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
from audiomentations import Limiter\n\ntransform = Limiter(\n min_threshold_db=-16.0,\n max_threshold_db=-6.0,\n threshold_mode=\"absolute\",\n p=1.0,\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/limiter/#limiter-api","title":"Limiter API","text":"min_threshold_db
: float
\u2022 unit: Decibel Default: -24.0
. Minimum threshold max_threshold_db
: float
\u2022 unit: Decibel Default: -2.0
. Maximum threshold min_attack
: float
\u2022 unit: seconds Default: 0.0005
. Minimum attack time max_attack
: float
\u2022 unit: seconds Default: 0.025
. Maximum attack time min_release
: float
\u2022 unit: seconds Default: 0.05
. Minimum release time max_release
: float
\u2022 unit: seconds Default: 0.7
. Maximum release time threshold_mode
: str
\u2022 choices: \"relative_to_signal_peak\"
, \"absolute\"
Default: relative_to_signal_peak
. Specifies the mode for determining the threshold.
\"relative_to_signal_peak\"
means the threshold is relative to peak of the signal.\"absolute\"
means the threshold is relative to 0 dBFS, so it doesn't depend on the peak of the signal.p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/loudness_normalization/","title":"LoudnessNormalization
","text":"Added in v0.14.0
Apply a constant amount of gain to match a specific loudness (in LUFS). This is an implementation of ITU-R BS.1770-4.
For an explanation on LUFS, see https://en.wikipedia.org/wiki/LUFS
See also the following web pages for more info on audio loudness normalization:
Warning: This transform can return samples outside the [-1, 1] range, which may lead to clipping or wrap distortion, depending on what you do with the audio in a later stage. See also https://en.wikipedia.org/wiki/Clipping_(audio)#Digital_clipping
"},{"location":"waveform_transforms/loudness_normalization/#loudnessnormalization-api","title":"LoudnessNormalization API","text":"min_lufs_in_db
: float
\u2022 unit: LUFS Deprecated as of v0.31.0. Use min_lufs
instead max_lufs_in_db
: float
\u2022 unit: LUFS Deprecated as of v0.31.0. Use max_lufs
instead min_lufs
: float
\u2022 unit: LUFS Default: -31.0
. Minimum loudness target max_lufs
: float
\u2022 unit: LUFS Default: -13.0
. Maximum loudness target p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/low_pass_filter/","title":"LowPassFilter
","text":"Added in v0.18.0, updated in v0.21.0
Apply low-pass filtering to the input audio of parametrized filter steepness (6/12/18... dB / octave). Can also be set for zero-phase filtering (will result in a 6db drop at cutoff).
"},{"location":"waveform_transforms/low_pass_filter/#lowpassfilter-api","title":"LowPassFilter API","text":"min_cutoff_freq
: float
\u2022 unit: hertz Default: 150.0
. Minimum cutoff frequency max_cutoff_freq
: float
\u2022 unit: hertz Default: 7500.0
. Maximum cutoff frequency min_rolloff
: float
\u2022 unit: Decibels/octave Default: 12
. Minimum filter roll-off (in dB/octave). Must be a multiple of 6 max_rolloff
: float
\u2022 unit: Decibels/octave Default: 24
. Maximum filter roll-off (in dB/octave) Must be a multiple of 6 zero_phase
: bool
Default: False
. Whether filtering should be zero phase. When this is set to True
it will not affect the phase of the input signal but will sound 3 dB lower at the cutoff frequency compared to the non-zero phase case (6 dB vs. 3 dB). Additionally, it is 2 times slower than in the non-zero phase case. If you absolutely want no phase distortions (e.g. want to augment an audio file with lots of transients, like a drum track), set this to True
. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/low_shelf_filter/","title":"LowShelfFilter
","text":"Added in v0.21.0
A low shelf filter is a filter that either boosts (increases amplitude) or cuts (decreases amplitude) frequencies below a certain center frequency. This transform applies a low-shelf filter at a specific center frequency in hertz. The gain at DC frequency is controlled by {min,max}_gain_db
(note: can be positive or negative!). Filter coefficients are taken from the W3 Audio EQ Cookbook
min_center_freq
: float
\u2022 unit: hertz Default: 50.0
. The minimum center frequency of the shelving filter max_center_freq
: float
\u2022 unit: hertz Default: 4000.0
. The maximum center frequency of the shelving filter min_gain_db
: float
\u2022 unit: Decibel Default: -18.0
. The minimum gain at DC (0 Hz) max_gain_db
: float
\u2022 unit: Decibel Default: 18.0
. The maximum gain at DC (0 Hz) min_q
: float
\u2022 range: (0.0, 1.0] Default: 0.1
. The minimum quality factor Q. The higher the Q, the steeper the transition band will be. max_q
: float
\u2022 range: (0.0, 1.0] Default: 0.999
. The maximum quality factor Q. The higher the Q, the steeper the transition band will be. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/mp3_compression/","title":"Mp3Compression
","text":"Added in v0.12.0
Compress the audio using an MP3 encoder to lower the audio quality. This may help machine learning models deal with compressed, low-quality audio.
This transform depends on either lameenc or pydub/ffmpeg.
Note that bitrates below 32 kbps are only supported for low sample rates (up to 24000 Hz).
Note: When using the \"lameenc\"
backend, the output may be slightly longer than the input due to the fact that the LAME encoder inserts some silence at the beginning of the audio.
Warning: This transform writes to disk, so it may be slow.
"},{"location":"waveform_transforms/mp3_compression/#mp3compression-api","title":"Mp3Compression API","text":"min_bitrate
: int
\u2022 unit: kbps \u2022 range: [8, max_bitrate
] Default: 8
. Minimum bitrate in kbps max_bitrate
: int
\u2022 unit: kbps \u2022 range: [min_bitrate
, 320] Default: 64
. Maximum bitrate in kbps backend
: str
\u2022 choices: \"pydub\"
, \"lameenc\"
Default: \"pydub\"
.
\"pydub\"
: May use ffmpeg under the hood. Pro: Seems to avoid introducing latency in the output. Con: Slightly slower than \"lameenc\"
.\"lameenc\"
: Pro: With this backend you can set the quality parameter in addition to the bitrate (although this parameter is not exposed in the audiomentations API yet). Con: Seems to introduce some silence at the start of the audio.p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/normalize/","title":"Normalize
","text":"Added in v0.6.0
Apply a constant amount of gain, so that highest signal level present in the sound becomes 0 dBFS, i.e. the loudest level allowed if all samples must be between -1 and 1. Also known as peak normalization.
"},{"location":"waveform_transforms/normalize/#normalize-api","title":"Normalize API","text":"p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/padding/","title":"Padding
","text":"Added in v0.23.0
Apply padding to the audio signal - take a fraction of the end or the start of the audio and replace that part with padding. This can be useful for preparing ML models with constant input length for padded inputs.
"},{"location":"waveform_transforms/padding/#padding-api","title":"Padding API","text":"mode
: str
\u2022 choices: \"silence\"
, \"wrap\"
, \"reflect\"
Default: \"silence\"
. Padding mode. min_fraction
: float
\u2022 range: [0.0, 1.0] Default: 0.01
. Minimum fraction of the signal duration to be padded max_fraction
: float
\u2022 range: [0.0, 1.0] Default: 0.7
. Maximum fraction of the signal duration to be padded pad_section
: str
\u2022 choices: \"start\"
, \"end\"
Default: \"end\"
. Which part of the signal should be replaced with padding p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/peaking_filter/","title":"PeakingFilter
","text":"Added in v0.21.0
Add a biquad peaking filter transform
"},{"location":"waveform_transforms/peaking_filter/#peakingfilter-api","title":"PeakingFilter API","text":"min_center_freq
: float
\u2022 unit: hertz \u2022 range: [0.0, \u221e) Default: 50.0
. The minimum center frequency of the peaking filter max_center_freq
: float
\u2022 unit: hertz \u2022 range: [0.0, \u221e) Default: 7500.0
. The maximum center frequency of the peaking filter min_gain_db
: float
\u2022 unit: Decibel Default: -24.0
. The minimum gain at center frequency max_gain_db
: float
\u2022 unit: Decibel Default: 24.0
. The maximum gain at center frequency min_q
: float
\u2022 range: [0.0, \u221e) Default: 0.5
. The minimum quality factor Q. The higher the Q, the steeper the transition band will be. max_q
: float
\u2022 range: [0.0, \u221e) Default: 5.0
. The maximum quality factor Q. The higher the Q, the steeper the transition band will be. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/pitch_shift/","title":"PitchShift
","text":"Added in v0.4.0
Pitch shift the sound up or down without changing the tempo.
Under the hood this does time stretching (by phase vocoding) followed by resampling. Note that phase vocoding can degrade audio quality by \"smearing\" transient sounds, altering the timbre of harmonic sounds, and distorting pitch modulations. This may result in a loss of sharpness, clarity, or naturalness in the transformed audio.
If you need a better sounding pitch shifting method, consider the following alternatives:
Here we pitch down a piano recording by 4 semitones:
Input sound Transformed sound"},{"location":"waveform_transforms/pitch_shift/#usage-example","title":"Usage example","text":"from audiomentations import PitchShift\n\ntransform = PitchShift(\n min_semitones=-5.0,\n max_semitones=5.0,\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=44100)\n
"},{"location":"waveform_transforms/pitch_shift/#pitchshift-api","title":"PitchShift API","text":"min_semitones
: float
\u2022 unit: semitones \u2022 range: [-12.0, 12.0] Default: -4.0
. Minimum semitones to shift. Negative number means shift down. max_semitones
: float
\u2022 unit: semitones \u2022 range: [-12.0, 12.0] Default: 4.0
. Maximum semitones to shift. Positive number means shift up. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/polarity_inversion/","title":"PolarityInversion
","text":"Added in v0.11.0
Flip the audio samples upside-down, reversing their polarity. In other words, multiply the waveform by -1, so negative values become positive, and vice versa. The result will sound the same compared to the original when played back in isolation. However, when mixed with other audio sources, the result may be different. This waveform inversion technique is sometimes used for audio cancellation or obtaining the difference between two waveforms. However, in the context of audio data augmentation, this transform can be useful when training phase-aware machine learning models.
"},{"location":"waveform_transforms/polarity_inversion/#polarityinversion-api","title":"PolarityInversion API","text":"p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/post_gain/","title":"PostGain
","text":"Added in v0.31.0
Gain up or down the audio after the given transform (or set of transforms) has processed the audio. There are several methods that determine how the audio should be gained. PostGain
can be useful for compensating for any gain differences introduced by a (set of) transform(s), or for preventing clipping in the output.
transform
: Callable[[NDArray[np.float32], int], NDArray[np.float32]]
A callable to be applied. It should input samples (ndarray), sample_rate (int) and optionally some user-defined keyword arguments. method
: str
\u2022 choices: \"same_rms\"
, \"same_lufs\"
or \"peak_normalize_always\"
This parameter defines the method for choosing the post gain amount.
\"same_rms\"
: The sound gets post-gained so that the RMS (Root Mean Square) of the output matches the RMS of the input.\"same_lufs\"
: The sound gets post-gained so that the LUFS (Loudness Units Full Scale) of the output matches the LUFS of the input.\"peak_normalize_always\"
: The sound gets peak normalized (gained up or down so that the absolute value of the most extreme sample in the output is 1.0)\"peak_normalize_if_too_loud\"
: The sound gets peak normalized if it is too loud (max absolute value greater than 1.0). This option can be useful for avoiding clipping.RepeatPart
","text":"Added in v0.32.0
Select a subsection (or \"part\") of the audio and repeat that part a number of times. This can be useful when simulating scenarios where a short audio snippet gets repeated, for example:
Note that the length of inputs you give it must be compatible with the part duration range and crossfade duration. If you give it an input audio array that is too short, a UserWarning
will be raised and no operation is applied to the signal.
In this speech example, the audio was transformed with
SevenBandParametricEQ
part transform. This is why each repeat in the output has a different timbre.from audiomentations import RepeatPart\n\ntransform = RepeatPart(mode=\"insert\", p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
from audiomentations import RepeatPart\n\ntransform = RepeatPart(mode=\"replace\", p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/repeat_part/#repeatpart-api","title":"RepeatPart API","text":"min_repeats
: int
\u2022 range: [1, max_repeats
] Default: 1
. Minimum number of times a selected audio segment should be repeated in addition to the original. For instance, if the selected number of repeats is 1, the selected segment will be followed by one repeat. max_repeats
: int
\u2022 range: [min_repeats
, \u221e) Default: 3
. Maximum number of times a selected audio segment can be repeated in addition to the original min_part_duration
: float
\u2022 unit: seconds \u2022 range: [0.00025, max_part_duration
] Default: 0.25
. Minimum duration (in seconds) of the audio segment that can be selected for repetition. max_part_duration
: float
\u2022 unit: seconds \u2022 range: [min_part_duration
, \u221e) Default: 1.2
. Maximum duration (in seconds) of the audio segment that can be selected for repetition. mode
: str
\u2022 choices: \"insert\"
, \"replace\"
Default: \"insert\"
. This parameter has two options:
\"insert\"
: Insert the repeat(s), making the array longer. After the last repeat there will be the last part of the original audio, offset in time compared to the input array.\"replace\"
: Have the repeats replace (as in overwrite) the original audio. Any remaining part at the end (if not overwritten by repeats) will be left untouched without offset. The length of the output array is the same as the input array.crossfade_duration
: float
\u2022 unit: seconds \u2022 range: 0.0 or [0.00025, \u221e) Default: 0.005
. Duration for crossfading between repeated parts as well as potentially from the original audio to the repeats and back. The crossfades will be equal-energy or equal-gain depending on the audio and/or the chosen parameters of the transform. The crossfading feature can be used to smooth transitions and avoid abrupt changes, which can lead to impulses/clicks in the audio. If you know what you're doing, and impulses/clicks are desired for your use case, you can disable the crossfading by setting this value to 0.0
. part_transform
: Optional[Callable[[NDArray[np.float32], int], NDArray[np.float32]]]
An optional callable (audiomentations transform) that gets applied individually to each repeat. This can be used to make each repeat slightly different from the previous one. Note that a part_transform
that makes the part shorter is only supported if the transformed part is at least two times the crossfade duration. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/resample/","title":"Resample
","text":"Added in v0.8.0
Resample signal using librosa.core.resample
To do downsampling only set both minimum and maximum sampling rate lower than original sampling rate and vice versa to do upsampling only.
"},{"location":"waveform_transforms/resample/#resample-api","title":"Resample API","text":"min_sample_rate
: int
\u2022 unit: Hz Default: 8000
. Minimum sample rate max_sample_rate
: int
\u2022 unit: Hz Default: 44100
. Maximum sample rate p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/reverse/","title":"Reverse
","text":"Added in v0.18.0
Reverse the audio. Also known as time inversion. Inversion of an audio track along its time axis relates to the random flip of an image, which is an augmentation technique that is widely used in the visual domain. This can be relevant in the context of audio classification. It was successfully applied in the paper AudioCLIP: Extending CLIP to Image, Text and Audio .
"},{"location":"waveform_transforms/reverse/#input-output-example","title":"Input-output example","text":"In this example, we reverse a speech recording
Input sound Transformed sound"},{"location":"waveform_transforms/reverse/#usage-example","title":"Usage example","text":"from audiomentations import Reverse\n\ntransform = Reverse(p=1.0)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=44100)\n
"},{"location":"waveform_transforms/reverse/#reverse-api","title":"Reverse API","text":"p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/room_simulator/","title":"RoomSimulator
","text":"Added in v0.23.0
A ShoeBox Room Simulator. Simulates a cuboid of parametrized size and average surface absorption coefficient. It also includes a source and microphones in parametrized locations.
Use it when you want a ton of synthetic room impulse responses of specific configurations characteristics or simply to quickly add reverb for augmentation purposes
"},{"location":"waveform_transforms/room_simulator/#roomsimulator-api","title":"RoomSimulator API","text":"min_size_x
: float
\u2022 unit: meters Default: 3.6
. Minimum width (x coordinate) of the room in meters max_size_x
: float
\u2022 unit: meters Default: 5.6
. Maximum width of the room in meters min_size_y
: float
\u2022 unit: meters Default: 3.6
. Minimum depth (y coordinate) of the room in meters max_size_y
: float
\u2022 unit: meters Default: 3.9
. Maximum depth of the room in meters min_size_z
: float
\u2022 unit: meters Default: 2.4
. Minimum height (z coordinate) of the room in meters max_size_z
: float
\u2022 unit: meters Default: 3.0
. Maximum height of the room in meters min_absorption_value
: float
Default: 0.075
. Minimum absorption coefficient value. When calculation_mode
is \"absorption\"
it will set the given coefficient value for the surfaces of the room (walls, ceilings, and floor). This coefficient takes values between 0 (fully reflective surface) and 1 (fully absorbing surface).
Example values (may differ!):
Environment Coefficient value Studio with acoustic panels > 0.40 Office / Library ~ 0.15 Factory ~ 0.05max_absorption_value
: float
Default: 0.4
. Maximum absorption coefficient value. See min_absorption_value
for more info. min_target_rt60
: float
\u2022 unit: seconds Default: 0.15
. Minimum target RT60. RT60 is defined as the measure of the time after the sound source ceases that it takes for the sound pressure level to reduce by 60 dB. When calculation_mode
is \"rt60\"
, it tries to set the absorption value of the surfaces of the room to achieve a target RT60 (in seconds). Note that this parameter changes only the materials (absorption coefficients) of the surfaces, not the dimension of the rooms.
Example values (may differ!):
Environment RT60 Recording studio 0.3 s Office 0.5 s Concert hall 1.5 smax_target_rt60
: float
\u2022 unit: seconds Default: 0.8
. Maximum target RT60. See min_target_rt60
for more info. min_source_x
: float
\u2022 unit: meters Default: 0.1
. Minimum x location of the source max_source_x
: float
\u2022 unit: meters Default: 3.5
. Maximum x location of the source min_source_y
: float
\u2022 unit: meters Default: 0.1
. Minimum y location of the source max_source_x
: float
\u2022 unit: meters Default: 2.7
. Maximum y location of the source min_source_z
: float
\u2022 unit: meters Default: 1.0
. Minimum z location of the source max_source_x
: float
\u2022 unit: meters Default: 2.1
. Maximum z location of the source min_mic_distance
: float
\u2022 unit: meters Default: 0.15
. Minimum distance of the microphone from the source in meters max_mic_distance
: float
\u2022 unit: meters Default: 0.35
. Maximum distance of the microphone from the source in meters min_mic_azimuth
: float
\u2022 unit: radians Default: -math.pi
. Minimum azimuth (angle around z axis) of the microphone relative to the source. max_mic_azimuth
: float
\u2022 unit: radians Default: math.pi
. Maximum azimuth (angle around z axis) of the microphone relative to the source. min_mic_elevation
: float
\u2022 unit: radians Default: -math.pi
. Minimum elevation of the microphone relative to the source, in radians. max_mic_elevation
: float
\u2022 unit: radians Default: math.pi
. Maximum elevation of the microphone relative to the source, in radians. calculation_mode
: str
\u2022 choices: \"rt60\"
, \"absorption\"
Default: \"absorption\"
. When set to \"absorption\"
, it will create the room with surfaces based on min_absorption_value
and max_absorption_value
. If set to \"rt60\"
it will try to assign surface materials that lead to a room impulse response with target rt60 given by min_target_rt60
and max_target_rt60
use_ray_tracing
: bool
Default: True
. Whether to use ray_tracing or not (slower but much more accurate). Disable this if you need speed but do not really care for incorrect results. max_order
: int
\u2022 range: [1, \u221e) Default: 1
. Maximum order of reflections for the Image Source Model. E.g. a value of 1 will only add first order reflections while a value of 12 will add a diffuse reverberation tail.
Warning
Placing this higher than 11-12 will result in a very slow augmentation process when calculation_mode=\"rt60\"
.
Tip
When using calculation_mode=\"rt60\"
, keep it around 3-4
.
leave_length_unchanged
: bool
Default: False
. When set to True, the tail of the sound (e.g. reverb at the end) will be chopped off so that the length of the output is equal to the length of the input. padding
: float
\u2022 unit: meters Default: 0.1
. Minimum distance in meters between source or mic and the room walls, floor or ceiling. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform. ray_tracing_options
: Optional[Dict]
Default: None
. Options for the ray tracer. See set_ray_tracing
here: https://github.com/LCAV/pyroomacoustics/blob/master/pyroomacoustics/room.py"},{"location":"waveform_transforms/seven_band_parametric_eq/","title":"SevenBandParametricEQ
","text":"Added in v0.24.0
Adjust the volume of different frequency bands. This transform is a 7-band parametric equalizer - a combination of one low shelf filter, five peaking filters and one high shelf filter, all with randomized gains, Q values and center frequencies.
Because this transform changes the timbre, but keeps the overall \"class\" of the sound the same (depending on application), it can be used for data augmentation to make ML models more robust to various frequency spectrums. Many things can affect the spectrum, for example:
The seven bands have center frequencies picked in the following ranges (min-max):
min_gain_db
: float
\u2022 unit: Decibel Default: -12.0
. Minimum number of dB to cut or boost a band max_gain_db
: float
\u2022 unit: decibel Default: 12.0
. Maximum number of dB to cut or boost a band p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/shift/","title":"Shift
","text":"Added in v0.5.0
Shift the samples forwards or backwards, with or without rollover
"},{"location":"waveform_transforms/shift/#shift-api","title":"Shift API","text":"This only applies to version 0.33.0 and newer. If you are using an older version, you should consider upgrading. Or if you really want to keep using the old version, you can check the \"Old Shift API (<=v0.32.0)\" section below
min_shift
: float | int
Default: -0.5
. Minimum amount of shifting in time. See also shift_unit
. max_shift
: float | int
Default: 0.5
. Maximum amount of shifting in time. See also shift_unit
. shift_unit
: str
\u2022 choices: \"fraction\"
, \"samples\"
, \"seconds\"
Default: \"fraction\"
Defines the unit of the value of min_shift
and max_shift
.
\"fraction\"
: Fraction of the total sound length\"samples\"
: Number of audio samples\"seconds\"
: Number of secondsrollover
: bool
Default: True
. When set to True
, samples that roll beyond the first or last position are re-introduced at the last or first. When set to False
, samples that roll beyond the first or last position are discarded. In other words, rollover=False
results in an empty space (with zeroes). fade_duration
: float
\u2022 unit: seconds \u2022 range: 0.0 or [0.00025, \u221e) Default: 0.005
. If you set this to a positive number, there will be a fade in and/or out at the \"stitch\" (that was the start or the end of the audio before the shift). This can smooth out an unwanted abrupt change between two consecutive samples (which sounds like a transient/click/pop). This parameter denotes the duration of the fade in seconds. To disable the fading feature, set this parameter to 0.0. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/shift/#old-shift-api-v0320","title":"Old Shift API (<=v0.32.0)","text":"This only applies to version 0.32.0 and older
min_fraction
: float
\u2022 range: [-1, 1] Default: -0.5
. Minimum fraction of total sound length to shift. max_fraction
: float
\u2022 range: [-1, 1] Default: 0.5
. Maximum fraction of total sound length to shift. rollover
: bool
Default: True
. When set to True
, samples that roll beyond the first or last position are re-introduced at the last or first. When set to False
, samples that roll beyond the first or last position are discarded. In other words, rollover=False
results in an empty space (with zeroes). fade
: bool
Default: False
. When set to True
, there will be a short fade in and/or out at the \"stitch\" (that was the start or the end of the audio before the shift). This can smooth out an unwanted abrupt change between two consecutive samples (which sounds like a transient/click/pop). fade_duration
: float
\u2022 unit: seconds Default: 0.01
. If fade=True
, then this is the duration of the fade in seconds. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/tanh_distortion/","title":"TanhDistortion
","text":"Added in v0.19.0
Apply tanh (hyperbolic tangent) distortion to the audio. This technique is sometimes used for adding distortion to guitar recordings. The tanh() function can give a rounded \"soft clipping\" kind of distortion, and the distortion amount is proportional to the loudness of the input and the pre-gain. Tanh is symmetric, so the positive and negative parts of the signal are squashed in the same way. This transform can be useful as data augmentation because it adds harmonics. In other words, it changes the timbre of the sound.
See this page for examples: http://gdsp.hf.ntnu.no/lessons/3/17/
"},{"location":"waveform_transforms/tanh_distortion/#input-output-example","title":"Input-output example","text":"In this example we apply tanh distortion with the \"distortion amount\" (think of it as a knob that goes from 0 to 1) set to 0.25
Input sound Transformed sound"},{"location":"waveform_transforms/tanh_distortion/#usage-example","title":"Usage example","text":"from audiomentations import TanhDistortion\n\ntransform = TanhDistortion(\n min_distortion=0.01,\n max_distortion=0.7,\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/tanh_distortion/#tanhdistortion-api","title":"TanhDistortion API","text":"min_distortion
: float
\u2022 range: [0.0, 1.0] Default: 0.01
. Minimum \"amount\" of distortion to apply to the signal. max_distortion
: float
\u2022 range: [0.0, 1.0] Default: 0.7
. Maximum \"amount\" of distortion to apply to the signal. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/time_mask/","title":"TimeMask
","text":"Added in v0.7.0
Make a randomly chosen part of the audio silent. Inspired by https://arxiv.org/pdf/1904.08779.pdf
"},{"location":"waveform_transforms/time_mask/#input-output-example","title":"Input-output example","text":"Here we silence a part of a speech recording.
Input sound Transformed sound"},{"location":"waveform_transforms/time_mask/#usage-example","title":"Usage example","text":"from audiomentations import TimeMask\n\ntransform = TimeMask(\n min_band_part=0.1,\n max_band_part=0.15,\n fade=True,\n p=1.0,\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/time_mask/#timemask-api","title":"TimeMask API","text":"min_band_part
: float
\u2022 range: [0.0, 1.0] Default: 0.0
. Minimum length of the silent part as a fraction of the total sound length. max_band_part
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. Maximum length of the silent part as a fraction of the total sound length. fade
: bool
Default: False
. When set to True
, add a linear fade in and fade out of the silent part. This can smooth out an unwanted abrupt change between two consecutive samples (which sounds like a transient/click/pop). p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/time_stretch/","title":"TimeStretch
","text":"Added in v0.2.0
Change the speed or duration of the signal without changing the pitch. This transform employs librosa.effects.time_stretch
under the hood to achieve the effect.
Under the hood this uses phase vocoding. Note that phase vocoding can degrade audio quality by \"smearing\" transient sounds, altering the timbre of harmonic sounds, and distorting pitch modulations. This may result in a loss of sharpness, clarity, or naturalness in the transformed audio, especially when the rate is set to an extreme value.
If you need a better sounding time stretch method, consider the following alternatives:
In this example we speed up a sound by 25%. This corresponds to a rate of 1.25.
Input sound Transformed sound"},{"location":"waveform_transforms/time_stretch/#usage-example","title":"Usage example","text":"from audiomentations import TimeStretch\n\ntransform = TimeStretch(\n min_rate=0.8,\n max_rate=1.25,\n leave_length_unchanged=True,\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/time_stretch/#timestretch-api","title":"TimeStretch API","text":"min_rate
: float
\u2022 range: [0.1, 10.0] Default: 0.8
. Minimum rate of change of total duration of the signal. A rate below 1 means the audio is slowed down. max_rate
: float
\u2022 range: [0.1, 10.0] Default: 1.25
. Maximum rate of change of total duration of the signal. A rate greater than 1 means the audio is sped up. leave_length_unchanged
: bool
Default: True
. The rate changes the duration and effects the samples. This flag is used to keep the total length of the generated output to be same as that of the input signal. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."},{"location":"waveform_transforms/trim/","title":"Trim
","text":"Added in v0.7.0
Trim leading and trailing silence from an audio signal using librosa.effects.trim
. It considers threshold (in decibels) below reference defined in parameter top_db
as silence.
In this example we remove silence from the start and end, using the default top_db parameter value
Input sound Transformed sound"},{"location":"waveform_transforms/trim/#usage-example","title":"Usage example","text":"from audiomentations import Trim\n\ntransform = Trim(\n top_db=30.0,\n p=1.0\n)\n\naugmented_sound = transform(my_waveform_ndarray, sample_rate=16000)\n
"},{"location":"waveform_transforms/trim/#trim-api","title":"Trim API","text":"top_db
: float
\u2022 unit: Decibel Default: 30.0
. The threshold value (in decibels) below which to consider silence and trim. p
: float
\u2022 range: [0.0, 1.0] Default: 0.5
. The probability of applying this transform."}]}
\ No newline at end of file
diff --git a/sitemap.xml b/sitemap.xml
new file mode 100644
index 00000000..aabc46f2
--- /dev/null
+++ b/sitemap.xml
@@ -0,0 +1,238 @@
+
+audiomentations is in a very early (read: not very useful yet) stage when it comes to spectrogram transforms. Consider applying waveform transforms before converting your waveforms to spectrograms, or check out alternative libraries
+SpecChannelShuffle
Added in v0.13.0
+Shuffle the channels of a multichannel spectrogram. This can help combat positional bias.
+SpecFrequencyMask
Added in v0.13.0
+Mask a set of frequencies in a spectrogram, à la Google AI SpecAugment. This type of data +augmentation has proved to make speech recognition models more robust.
+The masked frequencies can be replaced with either the mean of the original values or a +given constant (e.g. zero).
+ + + + + + +AddBackgroundNoise
Added in v0.9.0
+Mix in another sound, e.g. a background noise. Useful if your original sound is clean and +you want to simulate an environment where background noise is present.
+Can also be used for mixup when training +classification/annotation models.
+A path to a file/folder with sound(s), or a list of file/folder paths, must be +specified. These sounds should ideally be at least as long as the input sounds to be +transformed. Otherwise, the background sound will be repeated, which may sound unnatural.
+Note that in the default case (noise_rms="relative"
) the gain of the added noise is
+relative to the amount of signal in the input. This implies that if the input is
+completely silent, no noise will be added.
Optionally, the added noise sound can be transformed (with noise_transform
) before it gets mixed in.
Here are some examples of datasets that can be downloaded and used as background noise:
+ +Here we add some music to a speech recording, targeting a signal-to-noise ratio (SNR) of +5 decibels (dB), which means that the speech (signal) is 5 dB louder than the music (noise).
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import AddBackgroundNoise, PolarityInversion
+
+transform = AddBackgroundNoise(
+ sounds_path="/path/to/folder_with_sound_files",
+ min_snr_in_db=3.0,
+ max_snr_in_db=30.0,
+ noise_transform=PolarityInversion(),
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
from audiomentations import AddBackgroundNoise, PolarityInversion
+
+transform = AddBackgroundNoise(
+ sounds_path="/path/to/folder_with_sound_files",
+ noise_rms="absolute",
+ min_absolute_rms_in_db=-45.0,
+ max_absolute_rms_in_db=-15.0,
+ noise_transform=PolarityInversion(),
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
sounds_path
: Union[List[Path], List[str], Path, str]
min_snr_db
: float
• unit: Decibel3.0
. Minimum signal-to-noise ratio in dB. Is only
+used if noise_rms
is set to "relative"
max_snr_db
: float
• unit: Decibel30.0
. Maximum signal-to-noise ratio in dB. Is
+only used if noise_rms
is set to "relative"
min_snr_in_db
: float
• unit: Decibelmin_snr_db
insteadmax_snr_in_db
: float
• unit: Decibelmax_snr_db
insteadnoise_rms
: str
• choices: "absolute"
, "relative"
"relative"
. Defines how the background noise will
+be added to the audio input. If the chosen option is "relative"
, the root mean
+square (RMS) of the added noise will be proportional to the RMS of the input sound.
+If the chosen option is "absolute"
, the background noise will have an RMS
+independent of the rms of the input audio filemin_absolute_rms_db
: float
• unit: Decibel-45.0
. Is only used if noise_rms
is set to
+"absolute"
. It is the minimum RMS value in dB that the added noise can take. The
+lower the RMS is, the lower the added sound will be.max_absolute_rms_db
: float
• unit: Decibel-15.0
. Is only used if noise_rms
is set to
+"absolute"
. It is the maximum RMS value in dB that the added noise can take. Note
+that this value can not exceed 0.min_absolute_rms_in_db
: float
• unit: Decibelmin_absolute_rms_db
insteadmax_absolute_rms_in_db
: float
• unit: Decibelmax_absolute_rms_in_db
insteadnoise_transform
: Optional[Callable[[NDArray[np.float32], int], NDArray[np.float32]]]
None
. A callable waveform transform (or
+composition of transforms) that gets applied to the noise before it gets mixed in.
+The callable is expected to input audio waveform (numpy array) and sample rate (int).p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.lru_cache_size
: int
2
. Maximum size of the LRU cache for storing noise files in memoryAddColorNoise
To be added in v0.35.0
+Mix in noise with color, optionally weighted by an A-weighting curve. When
+f_decay=0
, this is equivalent to AddGaussianNoise
. Otherwise, see: Colors of Noise .
min_snr_db
: float
• unit: Decibel5.0
. Minimum signal-to-noise ratio in dB. A lower
+number means more noise.max_snr_db
: float
• unit: Decibel40.0
. Maximum signal-to-noise ratio in dB. A
+greater number means less noise.min_f_decay
: float
• unit: Decibels/octave-6.0
. Minimum noise decay in dB per octave.max_f_decay
: float
• unit: Decibels/octave6.0
. Maximum noise decay in dB per octave.Those values can be chosen from the following table:
+Colour | +f_decay (db/octave) |
+
---|---|
pink | +-3.01 | +
brown/brownian | +-6.02 | +
red | +-6.02 | +
blue | +3.01 | +
azure | +3.01 | +
violet | +6.02 | +
white | +0.0 | +
See Colors of noise on Wikipedia about those values.
+p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.p_apply_a_weighting
: float
• range: [0.0, 1.0]0.0
. The probability of additionally weighting the transform using an A-weighting
curve.n_fft
: int
128
. The number of points the decay curve is computed (for coloring white noise).AddGaussianNoise
Added in v0.1.0
+Add gaussian noise to the samples
+Here we add some gaussian noise (with amplitude 0.01) to a speech recording.
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import AddGaussianNoise
+
+transform = AddGaussianNoise(
+ min_amplitude=0.001,
+ max_amplitude=0.015,
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
min_amplitude
: float
• unit: linear amplitude0.001
. Minimum noise amplification factor.max_amplitude
: float
• unit: linear amplitude0.015
. Maximum noise amplification factor.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.AddGaussianSNR
Added in v0.7.0
+The AddGaussianSNR
transform injects Gaussian noise into an audio signal. It applies
+a Signal-to-Noise Ratio (SNR) that is chosen randomly from a uniform distribution on the
+decibel scale. This choice is consistent with the nature of human hearing, which is
+logarithmic rather than linear.
SNR is a common measure used in science and engineering to compare the level of a +desired signal to the level of noise. In the context of audio, the signal is the +meaningful sound that you're interested in, like a person's voice, music, or other +audio content, while the noise is unwanted sound that can interfere with the signal.
+The SNR quantifies the ratio of the power of the signal to the power of the noise. The +higher the SNR, the less the noise is present in relation to the signal.
+Gaussian noise, a kind of white noise, is a type of statistical noise where the +amplitude of the noise signal follows a Gaussian distribution. This means that most of +the samples are close to the mean (zero), and fewer of them are farther away. It's +called Gaussian noise due to its characteristic bell-shaped Gaussian distribution.
+Gaussian noise is similar to the sound of a radio or TV tuned to a nonexistent station: +a kind of constant, uniform hiss or static.
+Here we add some gaussian noise (with SNR = 16 dB) to a speech recording.
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import AddGaussianSNR
+
+transform = AddGaussianSNR(
+ min_snr_db=5.0,
+ max_snr_db=40.0,
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
min_snr_db
: float
• unit: Decibel5.0
. Minimum signal-to-noise ratio in dB. A lower
+number means more noise.max_snr_db
: float
• unit: decibel40.0
. Maximum signal-to-noise ratio in dB. A
+greater number means less noise.min_snr_in_db
: float
• unit: Decibelmin_snr_db
insteadmax_snr_in_db
: float
• unit: decibelmax_snr_db
insteadp
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.AddShortNoises
Added in v0.9.0
+Mix in various (bursts of overlapping) sounds with random pauses between. Useful if your +original sound is clean and you want to simulate an environment where short noises sometimes +occur.
+A folder of (noise) sounds to be mixed in must be specified.
+Here we add some short noise sounds to a voice recording.
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import AddShortNoises, PolarityInversion
+
+transform = AddShortNoises(
+ sounds_path="/path/to/folder_with_sound_files",
+ min_snr_in_db=3.0,
+ max_snr_in_db=30.0,
+ noise_rms="relative_to_whole_input",
+ min_time_between_sounds=2.0,
+ max_time_between_sounds=8.0,
+ noise_transform=PolarityInversion(),
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
from audiomentations import AddShortNoises, PolarityInversion
+
+transform = AddShortNoises(
+ sounds_path="/path/to/folder_with_sound_files",
+ min_absolute_noise_rms_db=-50.0,
+ max_absolute_noise_rms_db=-20.0,
+ noise_rms="absolute",
+ min_time_between_sounds=2.0,
+ max_time_between_sounds=8.0,
+ noise_transform=PolarityInversion(),
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
sounds_path
: Union[List[Path], List[str], Path, str]
min_snr_in_db
: float
• unit: Decibelmin_snr_db
insteadmax_snr_in_db
: float
• unit: Decibelmax_snr_db
insteadmin_snr_db
: float
• unit: Decibel-6.0
. Minimum signal-to-noise ratio in dB. A lower
+value means the added sounds/noises will be louder. This gets ignored if noise_rms
+is set to "absolute"
.max_snr_db
: float
• unit: Decibel18.0
. Maximum signal-to-noise ratio in dB. A
+lower value means the added sounds/noises will be louder. This gets ignored if
+noise_rms
is set to "absolute"
.min_time_between_sounds
: float
• unit: seconds2.0
. Minimum pause time (in seconds) between the
+added sounds/noisesmax_time_between_sounds
: float
• unit: seconds8.0
. Maximum pause time (in seconds) between the
+added sounds/noisesnoise_rms
: str
• choices: "absolute"
, "relative"
, "relative_to_whole_input"
Default: "relative"
(<=v0.27), but will be changed to
+"relative_to_whole_input"
in a future version.
This parameter defines how the noises will be added to the audio input.
+"relative"
: the RMS value of the added noise will be proportional to the RMS value of
+ the input sound calculated only for the region where the noise is added."absolute"
: the added noises will have an RMS independent of the RMS of the input audio
+ file."relative_to_whole_input"
: the RMS of the added noises will be
+ proportional to the RMS of the whole input sound.min_absolute_noise_rms_db
: float
• unit: Decibel-50.0
. Is only used if noise_rms
is set to
+"absolute"
. It is the minimum RMS value in dB that the added noise can take. The
+lower the RMS is, the lower will the added sound be.max_absolute_noise_rms_db
: float
• unit: seconds-20.0
. Is only used if noise_rms
is set to
+"absolute"
. It is the maximum RMS value in dB that the added noise can take. Note
+that this value can not exceed 0.add_all_noises_with_same_level
: bool
False
. Whether to add all the short noises
+(within one audio snippet) with the same SNR. If noise_rms
is set to "absolute"
,
+the RMS is used instead of SNR. The target SNR (or RMS) will change every time the
+parameters of the transform are randomized.include_silence_in_noise_rms_estimation
: bool
True
. It chooses how the RMS of
+the noises to be added will be calculated. If this option is set to False, the silence
+in the noise files will be disregarded in the RMS calculation. It is useful for
+non-stationary noises where silent periods occur.burst_probability
: float
0.22
. For every noise that gets added, there
+is a probability of adding an extra burst noise that overlaps with the noise. This
+parameter controls that probability. min_pause_factor_during_burst
and
+max_pause_factor_during_burst
control the amount of overlap.min_pause_factor_during_burst
: float
0.1
. Min value of how far into the current sound (as
+fraction) the burst sound should start playing. The value must be greater than 0.max_pause_factor_during_burst
: float
1.1
. Max value of how far into the current sound (as
+fraction) the burst sound should start playing. The value must be greater than 0.min_fade_in_time
: float
• unit: seconds0.005
. Min noise fade in time in seconds. Use a
+value larger than 0 to avoid a "click" at the start of the noise.max_fade_in_time
: float
• unit: seconds0.08
. Max noise fade in time in seconds. Use a
+value larger than 0 to avoid a "click" at the start of the noise.min_fade_out_time
: float
• unit: seconds0.01
. Min sound/noise fade out time in seconds.
+Use a value larger than 0 to avoid a "click" at the end of the sound/noise.max_fade_out_time
: float
• unit: seconds0.1
. Max sound/noise fade out time in seconds.
+Use a value larger than 0 to avoid a "click" at the end of the sound/noise.signal_gain_in_db_during_noise
: float
• unit: Decibelsignal_gain_db_during_noise
insteadsignal_gain_db_during_noise
: float
• unit: Decibel Default: 0.0
. Gain applied to the signal during a short noise.
+When fading the signal to the custom gain, the same fade times are used as
+for the noise, so it's essentially cross-fading. The default value (0.0) means
+the signal will not be gained. If set to a very low value, e.g. -100.0, this
+feature could be used for completely replacing the signal with the noise.
+This could be relevant in some use cases, for example:
noise_transform
: Optional[Callable[[NDArray[np.float32], int], NDArray[np.float32]]]
None
. A callable waveform transform (or
+composition of transforms) that gets applied to noises before they get mixed in.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.lru_cache_size
: int
64
. Maximum size of the LRU cache for storing
+noise files in memoryAdjustDuration
Added in v0.30.0
+Trim or pad the audio to the specified length/duration in samples or seconds. If the +input sound is longer than the target duration, pick a random offset and crop the +sound to the target duration. If the input sound is shorter than the target +duration, pad the sound so the duration matches the target duration.
+This transform can be useful if you need audio with constant length, e.g. as input to a +machine learning model. The reason for varying audio clip lengths can be e.g.
+Here we input an audio clip and remove a part of the start and the end, so the length of the result matches the specified target length.
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import AdjustDuration
+
+transform = AdjustDuration(duration_samples=60000, p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
from audiomentations import AdjustDuration
+
+transform = AdjustDuration(duration_seconds=3.75, p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
duration_samples
: int
• range: [0, ∞)duration_seconds
: float
• range: [0.0, ∞)padding_mode
: str
• choices: "silence"
, "wrap"
, "reflect"
"silence"
. Padding mode. Only used when audio input is shorter than the target duration.padding_position
: str
• choices: "start"
, "end"
"end"
. The position of the inserted/added padding. Only used when audio input is shorter than the target duration.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.AirAbsorption
Added in v0.25.0
+A lowpass-like filterbank with variable octave attenuation that simulates attenuation of +high frequencies due to air absorption. This transform is parametrized by temperature, +humidity, and the distance between audio source and microphone.
+This is not a scientifically accurate transform but basically applies a uniform +filterbank with attenuations given by:
+att = exp(- distance * absorption_coefficient)
where distance
is the microphone-source assumed distance in meters and absorption_coefficient
+is adapted from a lookup table by pyroomacoustics.
+It can also be seen as a lowpass filter with variable octave attenuation.
Note that since this transform mostly affects high frequencies, it is only +suitable for audio with sufficiently high sample rate, like 32 kHz and above.
+Note also that this transform only "simulates" the dampening of high frequencies, and +does not attenuate according to the distance law. Gain augmentation needs to be done +separately.
+Here we input a high-quality speech recording and apply AirAbsorption
with an air
+temperature of 20 degrees celsius, 70% humidity and a distance of 20 meters. One can see
+clearly in the spectrogram that the highs, especially above ~13 kHz, are rolled off in
+the output, but it may require a quiet room and some concentration to
+hear it clearly in the audio comparison.
Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import AirAbsorption
+
+transform = AirAbsorption(
+ min_distance=10.0,
+ max_distance=50.0,
+ p=1.0,
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=48000)
+
min_temperature
: float
• unit: Celsius • choices: [10.0, 20.0]10.0
. Minimum temperature in Celsius (can take a value of either 10.0 or 20.0)max_temperature
: float
• unit: Celsius • choices: [10.0, 20.0]20.0
. Maximum temperature in Celsius (can take a value of either 10.0 or 20.0)min_humidity
: float
• unit: percent • range: [30.0, 90.0]30.0
. Minimum humidity in percent (between 30.0 and 90.0)max_humidity
: float
• unit: percent • range: [30.0, 90.0]90.0
. Maximum humidity in percent (between 30.0 and 90.0)min_distance
: float
• unit: meters10.0
. Minimum microphone-source distance in meters.max_distance
: float
• unit: meters100.0
. Maximum microphone-source distance in meters.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Aliasing
To be added in v0.35.0
+Downsample the audio to a lower sample rate by linear interpolation, without low-pass +filtering it first, resulting in aliasing artifacts. You get aliasing artifacts when +there is high-frequency audio in the input audio that falls above the nyquist frequency +of the chosen target sample rate. Audio with frequencies above the nyquist frequency +cannot be reproduced accurately and get "reflected"/mirrored to other frequencies. The +aliasing artifacts "replace" the original high frequency signals. The result can be +described as coarse and metallic.
+After the downsampling, the signal gets upsampled to the original signal again, so the +length of the output becomes the same as the length of the input.
+For more information, see
+Here we target a sample rate of 12000 Hz. Note the vertical mirroring in the spectrogram in the transformed sound.
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import Aliasing
+
+transform = Aliasing(min_sample_rate=8000, max_sample_rate=30000, p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=44100)
+
min_sample_rate
: int
• unit: Hz • range: [2, ∞)max_sample_rate
: int
• unit: Hz • range: [2, ∞)p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.ApplyImpulseResponse
Added in v0.7.0
+This transform convolves the audio with a randomly selected (room) impulse response file.
+ApplyImpulseResponse
is commonly used as a data augmentation technique that adds
+realistic-sounding reverb to recordings. This can for example make denoisers and speech
+recognition systems more robust to different acoustic environments and distances between
+the sound source and the microphone. It could also be used to generate roomy audio
+examples for the training of dereverberation models.
Convolution with an impulse response is a powerful technique in signal processing that +can be employed to emulate the acoustic characteristics of specific environments or +devices. This process can transform a dry recording, giving it the sonic signature of +being played in a specific location or through a particular device.
+What is an impulse response? An impulse response (IR) captures the unique acoustical +signature of a space or object. It's essentially a recording of how a specific +environment or system responds to an impulse (a short, sharp sound). By convolving +an audio signal with an impulse response, we can simulate how that signal would sound in +the captured environment.
+Note that some impulse responses, especially those captured in larger spaces or from +specific equipment, can introduce a noticeable delay when convolved with an audio +signal. In some applications, this delay is a desirable property. However, in some other +applications, the convolved audio should not have a delay compared to the original +audio. If this is the case for you, you can align the audio afterwards with +fast-align-audio , for example.
+Impulse responses can be created using e.g. http://tulrich.com/recording/ir_capture/
+Some datasets of impulse responses are publicly available:
+Impulse responses are represented as audio (ideally wav) files in the given ir_path
.
Another thing worth checking is that your IR files have the same sample rate as your +audio inputs. Why? Because if they have different sample rates, the internal resampling +will slow down execution, and because some high frequencies may get lost.
+Here we make a dry speech recording quite reverbant by convolving it with a room impulse response
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import ApplyImpulseResponse
+
+transform = ApplyImpulseResponse(ir_path="/path/to/sound_folder", p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=48000)
+
ir_path
: Union[List[Path], List[str], str, Path]
str
or Path
instance(s). The audio files given here are
+supposed to be (room) impulse responses.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.lru_cache_size
: int
128
. Maximum size of the LRU cache for storing
+impulse response files in memory.leave_length_unchanged
: bool
True
. When set to True
, the tail of the sound
+(e.g. reverb at the end) will be chopped off so that the length of the output is
+equal to the length of the input.BandPassFilter
Added in v0.18.0, updated in v0.21.0
+Apply band-pass filtering to the input audio. Filter steepness (6/12/18... dB / octave) +is parametrized. Can also be set for zero-phase filtering (will result in a 6 dB drop at +cutoffs).
+Here we input a high-quality speech recording and apply BandPassFilter
with a center
+frequency of 2500 Hz and a bandwidth fraction of 0.8, which means that the bandwidth in
+this example is 2000 Hz, so the low frequency cutoff is 1500 Hz and the high frequency
+cutoff is 3500 Hz. One can see in the spectrogram that the high and the low frequencies
+are both attenuated in the output. If you listen to the audio example, you might notice
+that the transformed output almost sounds like a phone call from the time when
+phone audio was narrowband and mostly contained frequencies between ~300 and ~3400 Hz.
Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import BandPassFilter
+
+transform = BandPassFilter(min_center_freq=100.0, max_center_freq=6000, p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=48000)
+
min_center_freq
: float
• unit: hertz200.0
. Minimum center frequency in hertzmax_center_freq
: float
• unit: hertz4000.0
. Maximum center frequency in hertzmin_bandwidth_fraction
: float
• range: [0.0, 2.0]0.5
. Minimum bandwidth relative to center frequencymax_bandwidth_fraction
: float
• range: [0.0, 2.0]1.99
. Maximum bandwidth relative to center frequencymin_rolloff
: float
• unit: Decibels/octave12
. Minimum filter roll-off (in dB/octave).
+Must be a multiple of 6max_rolloff
: float
• unit: Decibels/octave24
. Maximum filter roll-off (in dB/octave)
+Must be a multiple of 6zero_phase
: bool
False
. Whether filtering should be zero phase.
+When this is set to True
it will not affect the phase of the input signal but will
+sound 3 dB lower at the cutoff frequency compared to the non-zero phase case (6 dB
+vs. 3 dB). Additionally, it is 2 times slower than in the non-zero phase case. If
+you absolutely want no phase distortions (e.g. want to augment an audio file with
+lots of transients, like a drum track), set this to True
.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.BandStopFilter
Added in v0.21.0
+Apply band-stop filtering to the input audio. Also known as notch filter or +band reject filter. It relates to the frequency mask idea in the SpecAugment paper . +Center frequency gets picked in mel space, so it is somewhat aligned with human hearing, +which is not linear. Filter steepness (6/12/18... dB / octave) is parametrized. Can also +be set for zero-phase filtering (will result in a 6 dB drop at cutoffs).
+Applying band-stop filtering as data augmentation during model training can aid in +preventing overfitting to specific frequency relationships, helping to make the model +robust to diverse audio environments and scenarios, where frequency losses can occur.
+Here we input a speech recording and apply BandStopFilter
with a center
+frequency of 2500 Hz and a bandwidth fraction of 0.8, which means that the bandwidth in
+this example is 2000 Hz, so the low frequency cutoff is 1500 Hz and the high frequency
+cutoff is 3500 Hz. One can see in the spectrogram of the transformed sound that the band
+stop filter has attenuated this frequency range. If you listen to the audio example, you
+can hear that the timbre is different in the transformed sound than in the original.
Input sound | +Transformed sound | +
---|---|
+ | + |
min_center_freq
: float
• unit: hertz200.0
. Minimum center frequency in hertzmax_center_freq
: float
• unit: hertz4000.0
. Maximum center frequency in hertzmin_bandwidth_fraction
: float
0.5
. Minimum bandwidth relative to center frequencymax_bandwidth_fraction
: float
1.99
. Maximum bandwidth relative to center frequencymin_rolloff
: float
• unit: Decibels/octave12
. Minimum filter roll-off (in dB/octave).
+Must be a multiple of 6max_rolloff
: float
• unit: Decibels/octave24
. Maximum filter roll-off (in dB/octave)
+Must be a multiple of 6zero_phase
: bool
False
. Whether filtering should be zero phase.
+When this is set to True
it will not affect the phase of the input signal but will
+sound 3 dB lower at the cutoff frequency compared to the non-zero phase case (6 dB
+vs. 3 dB). Additionally, it is 2 times slower than in the non-zero phase case. If
+you absolutely want no phase distortions (e.g. want to augment an audio file with
+lots of transients, like a drum track), set this to True
.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.BitCrush
To be added in v0.35.0
+Apply a bit crush effect to the audio by reducing the bit depth. In other words, it +reduces the number of bits that can be used for representing each audio sample. +This adds quantization noise, and affects dynamic range. This transform does not apply +dithering.
+For more information, see
+Here we reduce the bit depth from 16 to 6 bits per sample
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import BitCrush
+
+transform = BitCrush(min_bit_depth=5, max_bit_depth=14, p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
min_bit_depth
: int
• unit: bits • range: [1, 32]max_bit_depth
: int
• unit: bits • range: [1, 32]p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Clip
Added in v0.17.0
+Clip audio by specified values. e.g. set a_min=-1.0
and a_max=1.0
to ensure that no
+samples in the audio exceed that extent. This can be relevant for avoiding integer
+overflow or underflow (which results in unintended wrap distortion that can sound
+horrible) when exporting to e.g. 16-bit PCM wav.
Another way of ensuring that all values stay between -1.0 and 1.0 is to apply
+PeakNormalization
.
This transform is different from ClippingDistortion
in that it takes fixed values
+for clipping instead of clipping a random percentile of the samples. Arguably, this
+transform is not very useful for data augmentation. Instead, think of it as a very
+cheap and harsh limiter (for samples that exceed the allotted extent) that can
+sometimes be useful at the end of a data augmentation pipeline.
ClippingDistortion
Added in v0.8.0
+Distort signal by clipping a random percentage of points
+The percentage of points that will be clipped is drawn from a uniform distribution between
+the two input parameters min_percentile_threshold
and max_percentile_threshold
. If for instance
+30% is drawn, the samples are clipped if they're below the 15th or above the 85th percentile.
min_percentile_threshold
: int
0
. A lower bound on the total percent of samples
+that will be clippedmax_percentile_threshold
: int
40
. An upper bound on the total percent of
+samples that will be clippedp
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Gain
Added in v0.11.0
+Multiply the audio by a random amplitude factor to reduce or increase the volume. This +technique can help a model become somewhat invariant to the overall gain of the input audio.
+Warning: This transform can return samples outside the [-1, 1] range, which may lead to +clipping or wrap distortion, depending on what you do with the audio in a later stage. +See also https://en.wikipedia.org/wiki/Clipping_(audio)#Digital_clipping
+min_gain_in_db
: float
• unit: Decibelmin_gain_db
insteadmax_gain_in_db
: float
• unit: Decibelmax_gain_db
insteadmin_gain_db
: float
• unit: Decibel-12.0
. Minimum gain.max_gain_db
: float
• unit: Decibel12.0
. Maximum gain.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.GainTransition
Added in v0.22.0
+Gradually change the volume up or down over a random time span. Also known as +fade in and fade out. The fade works on a logarithmic scale, which is natural to +human hearing.
+The way this works is that it picks two gains: a first gain and a second gain. +Then it picks a time range for the transition between those two gains. +Note that this transition can start before the audio starts and/or end after the +audio ends, so the output audio can start or end in the middle of a transition. +The gain starts at the first gain and is held constant until the transition start. +Then it transitions to the second gain. Then that gain is held constant until the +end of the sound.
+min_gain_in_db
: float
• unit: Decibelmin_gain_db
insteadmax_gain_in_db
: float
• unit: Decibelmax_gain_db
insteadmin_gain_db
: float
• unit: Decibel-24.0
. Minimum gain.max_gain_db
: float
• unit: Decibel6.0
. Maximum gain.min_duration
: Union[float, int]
• unit: see duration_unit
0.2
. Minimum length of transition.max_duration
: Union[float, int]
• unit: see duration_unit
6.0
. Maximum length of transition.duration_unit
: str
• choices: "fraction"
, "samples"
, "seconds"
Default: "seconds"
. Defines the unit of the value of min_duration
and max_duration
.
"fraction"
: Fraction of the total sound length"samples"
: Number of audio samples"seconds"
: Number of secondsp
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.HighPassFilter
Added in v0.18.0, updated in v0.21.0
+Apply high-pass filtering to the input audio of parametrized filter steepness (6/12/18... dB / octave). +Can also be set for zero-phase filtering (will result in a 6 dB drop at cutoff).
+min_cutoff_freq
: float
• unit: hertz20.0
. Minimum cutoff frequencymax_cutoff_freq
: float
• unit: hertz2400.0
. Maximum cutoff frequencymin_rolloff
: float
• unit: Decibels/octave12
. Minimum filter roll-off (in dB/octave).
+Must be a multiple of 6max_rolloff
: float
• unit: Decibels/octave24
. Maximum filter roll-off (in dB/octave).
+Must be a multiple of 6zero_phase
: bool
False
. Whether filtering should be zero phase.
+When this is set to True
it will not affect the phase of the input signal but will
+sound 3 dB lower at the cutoff frequency compared to the non-zero phase case (6 dB
+vs. 3 dB). Additionally, it is 2 times slower than in the non-zero phase case. If
+you absolutely want no phase distortions (e.g. want to augment an audio file with
+lots of transients, like a drum track), set this to True
.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.HighShelfFilter
Added in v0.21.0
+A high shelf filter is a filter that either boosts (increases amplitude) or cuts
+(decreases amplitude) frequencies above a certain center frequency. This transform
+applies a high-shelf filter at a specific center frequency in hertz.
+The gain at nyquist frequency is controlled by {min,max}_gain_db
(note: can be positive or negative!).
+Filter coefficients are taken from the W3 Audio EQ Cookbook
min_center_freq
: float
• unit: hertz300.0
. The minimum center frequency of the shelving filtermax_center_freq
: float
• unit: hertz7500.0
. The maximum center frequency of the shelving filtermin_gain_db
: float
• unit: Decibel-18.0
. The minimum gain at the nyquist frequencymax_gain_db
: float
• unit: Decibel18.0
. The maximum gain at the nyquist frequencymin_q
: float
• range: (0.0, 1.0]0.1
. The minimum quality factor Q. The higher
+the Q, the steeper the transition band will be.max_q
: float
• range: (0.0, 1.0]0.999
. The maximum quality factor Q. The higher
+the Q, the steeper the transition band will be.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Lambda
Added in v0.26.0
+Apply a user-defined transform (callable) to the signal. The inspiration for this
+transform comes from albumentation's lambda transform. This allows one to have a little
+more fine-grained control over the operations in the context of a Compose
, OneOf
or SomeOf
import random
+
+from audiomentations import Lambda, OneOf, Gain
+
+
+def gain_only_left_channel(samples, sample_rate):
+ samples[0, :] *= random.uniform(0.8, 1.25)
+ return samples
+
+
+transform = OneOf(
+ transforms=[Lambda(transform=gain_only_left_channel, p=1.0), Gain(p=1.0)]
+)
+
+augmented_sound = transform(my_stereo_waveform_ndarray, sample_rate=16000)
+
transform
: Callable
p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.**kwargs
Limiter
Added in v0.26.0
+The Limiter
, based on cylimiter , is a straightforward audio transform that applies dynamic range compression.
+It is capable of limiting the audio signal based on certain parameters.
+Additionally, please note that this transform introduces a slight delay in the signal, equivalent to a fraction of the attack time.
In this example we apply the limiter with a threshold that is 10 dB lower than the signal peak
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import Limiter
+
+transform = Limiter(
+ min_threshold_db=-16.0,
+ max_threshold_db=-6.0,
+ threshold_mode="relative_to_signal_peak",
+ p=1.0,
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
from audiomentations import Limiter
+
+transform = Limiter(
+ min_threshold_db=-16.0,
+ max_threshold_db=-6.0,
+ threshold_mode="absolute",
+ p=1.0,
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
min_threshold_db
: float
• unit: Decibel-24.0
. Minimum thresholdmax_threshold_db
: float
• unit: Decibel-2.0
. Maximum thresholdmin_attack
: float
• unit: seconds0.0005
. Minimum attack timemax_attack
: float
• unit: seconds0.025
. Maximum attack timemin_release
: float
• unit: seconds0.05
. Minimum release timemax_release
: float
• unit: seconds0.7
. Maximum release timethreshold_mode
: str
• choices: "relative_to_signal_peak"
, "absolute"
Default: relative_to_signal_peak
. Specifies the mode for determining the threshold.
"relative_to_signal_peak"
means the threshold is relative to peak of the signal."absolute"
means the threshold is relative to 0 dBFS, so it doesn't depend
+ on the peak of the signal.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.LoudnessNormalization
Added in v0.14.0
+Apply a constant amount of gain to match a specific loudness (in LUFS). This is an +implementation of ITU-R BS.1770-4.
+For an explanation on LUFS, see https://en.wikipedia.org/wiki/LUFS
+See also the following web pages for more info on audio loudness normalization:
+ +Warning: This transform can return samples outside the [-1, 1] range, which may lead to +clipping or wrap distortion, depending on what you do with the audio in a later stage. +See also https://en.wikipedia.org/wiki/Clipping_(audio)#Digital_clipping
+min_lufs_in_db
: float
• unit: LUFSmin_lufs
insteadmax_lufs_in_db
: float
• unit: LUFSmax_lufs
insteadmin_lufs
: float
• unit: LUFS-31.0
. Minimum loudness targetmax_lufs
: float
• unit: LUFS-13.0
. Maximum loudness targetp
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.LowPassFilter
Added in v0.18.0, updated in v0.21.0
+Apply low-pass filtering to the input audio of parametrized filter steepness (6/12/18... dB / octave). +Can also be set for zero-phase filtering (will result in a 6db drop at cutoff).
+min_cutoff_freq
: float
• unit: hertz150.0
. Minimum cutoff frequencymax_cutoff_freq
: float
• unit: hertz7500.0
. Maximum cutoff frequencymin_rolloff
: float
• unit: Decibels/octave12
. Minimum filter roll-off (in dB/octave).
+Must be a multiple of 6max_rolloff
: float
• unit: Decibels/octave24
. Maximum filter roll-off (in dB/octave)
+Must be a multiple of 6zero_phase
: bool
False
. Whether filtering should be zero phase.
+When this is set to True
it will not affect the phase of the input signal but will
+sound 3 dB lower at the cutoff frequency compared to the non-zero phase case (6 dB
+vs. 3 dB). Additionally, it is 2 times slower than in the non-zero phase case. If
+you absolutely want no phase distortions (e.g. want to augment an audio file with
+lots of transients, like a drum track), set this to True
.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.LowShelfFilter
Added in v0.21.0
+A low shelf filter is a filter that either boosts (increases amplitude) or cuts
+(decreases amplitude) frequencies below a certain center frequency. This transform
+applies a low-shelf filter at a specific center frequency in hertz.
+The gain at DC frequency is controlled by {min,max}_gain_db
(note: can be positive or negative!).
+Filter coefficients are taken from the W3 Audio EQ Cookbook
min_center_freq
: float
• unit: hertz50.0
. The minimum center frequency of the shelving filtermax_center_freq
: float
• unit: hertz4000.0
. The maximum center frequency of the shelving filtermin_gain_db
: float
• unit: Decibel-18.0
. The minimum gain at DC (0 Hz)max_gain_db
: float
• unit: Decibel18.0
. The maximum gain at DC (0 Hz)min_q
: float
• range: (0.0, 1.0]0.1
. The minimum quality factor Q. The higher
+the Q, the steeper the transition band will be.max_q
: float
• range: (0.0, 1.0]0.999
. The maximum quality factor Q. The higher
+the Q, the steeper the transition band will be.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Mp3Compression
Added in v0.12.0
+Compress the audio using an MP3 encoder to lower the audio quality. This may help machine +learning models deal with compressed, low-quality audio.
+This transform depends on either lameenc or pydub/ffmpeg.
+Note that bitrates below 32 kbps are only supported for low sample rates (up to 24000 Hz).
+Note: When using the "lameenc"
backend, the output may be slightly longer than the input due
+to the fact that the LAME encoder inserts some silence at the beginning of the audio.
Warning: This transform writes to disk, so it may be slow.
+min_bitrate
: int
• unit: kbps • range: [8, max_bitrate
]8
. Minimum bitrate in kbpsmax_bitrate
: int
• unit: kbps • range: [min_bitrate
, 320]64
. Maximum bitrate in kbpsbackend
: str
• choices: "pydub"
, "lameenc"
Default: "pydub"
.
"pydub"
: May use ffmpeg under the hood. Pro: Seems to avoid introducing latency in
+ the output. Con: Slightly slower than "lameenc"
."lameenc"
: Pro: With this backend you can set the quality parameter in addition
+ to the bitrate (although this parameter is not exposed in the audiomentations API
+ yet). Con: Seems to introduce some silence at the start of the audio.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Normalize
Added in v0.6.0
+Apply a constant amount of gain, so that highest signal level present in the sound +becomes 0 dBFS, i.e. the loudest level allowed if all samples must be between -1 and 1. +Also known as peak normalization.
+p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Padding
Added in v0.23.0
+Apply padding to the audio signal - take a fraction of the end or the start of the +audio and replace that part with padding. This can be useful for preparing ML models +with constant input length for padded inputs.
+mode
: str
• choices: "silence"
, "wrap"
, "reflect"
"silence"
. Padding mode.min_fraction
: float
• range: [0.0, 1.0]0.01
. Minimum fraction of the signal duration to be paddedmax_fraction
: float
• range: [0.0, 1.0]0.7
. Maximum fraction of the signal duration to be paddedpad_section
: str
• choices: "start"
, "end"
"end"
. Which part of the signal should be replaced with paddingp
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.PeakingFilter
Added in v0.21.0
+Add a biquad peaking filter transform
+min_center_freq
: float
• unit: hertz • range: [0.0, ∞)50.0
. The minimum center frequency of the peaking filtermax_center_freq
: float
• unit: hertz • range: [0.0, ∞)7500.0
. The maximum center frequency of the peaking filtermin_gain_db
: float
• unit: Decibel-24.0
. The minimum gain at center frequencymax_gain_db
: float
• unit: Decibel24.0
. The maximum gain at center frequencymin_q
: float
• range: [0.0, ∞)0.5
. The minimum quality factor Q. The higher the
+Q, the steeper the transition band will be.max_q
: float
• range: [0.0, ∞)5.0
. The maximum quality factor Q. The higher the
+Q, the steeper the transition band will be.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.PitchShift
Added in v0.4.0
+Pitch shift the sound up or down without changing the tempo.
+Under the hood this does time stretching (by phase vocoding) followed by resampling. +Note that phase vocoding can degrade audio quality by "smearing" transient sounds, +altering the timbre of harmonic sounds, and distorting pitch modulations. This may +result in a loss of sharpness, clarity, or naturalness in the transformed audio.
+If you need a better sounding pitch shifting method, consider the following alternatives:
+Here we pitch down a piano recording by 4 semitones:
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import PitchShift
+
+transform = PitchShift(
+ min_semitones=-5.0,
+ max_semitones=5.0,
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=44100)
+
min_semitones
: float
• unit: semitones • range: [-12.0, 12.0]-4.0
. Minimum semitones to shift. Negative number means shift down.max_semitones
: float
• unit: semitones • range: [-12.0, 12.0]4.0
. Maximum semitones to shift. Positive number means shift up.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.PolarityInversion
Added in v0.11.0
+Flip the audio samples upside-down, reversing their polarity. In other words, multiply the +waveform by -1, so negative values become positive, and vice versa. The result will sound +the same compared to the original when played back in isolation. However, when mixed with +other audio sources, the result may be different. This waveform inversion technique +is sometimes used for audio cancellation or obtaining the difference between two waveforms. +However, in the context of audio data augmentation, this transform can be useful when +training phase-aware machine learning models.
+p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.PostGain
Added in v0.31.0
+Gain up or down the audio after the given transform (or set of transforms) has
+processed the audio. There are several methods that determine how the audio should
+be gained. PostGain
can be useful for compensating for any gain differences introduced
+by a (set of) transform(s), or for preventing clipping in the output.
transform
: Callable[[NDArray[np.float32], int], NDArray[np.float32]]
method
: str
• choices: "same_rms"
, "same_lufs"
or "peak_normalize_always"
This parameter defines the method for choosing the post gain amount.
+"same_rms"
: The sound gets post-gained so that the RMS (Root Mean Square) of
+ the output matches the RMS of the input."same_lufs"
: The sound gets post-gained so that the LUFS (Loudness Units Full Scale) of
+ the output matches the LUFS of the input."peak_normalize_always"
: The sound gets peak normalized (gained up or down so
+ that the absolute value of the most extreme sample in the output is 1.0)"peak_normalize_if_too_loud"
: The sound gets peak normalized if it is too
+ loud (max absolute value greater than 1.0). This option can be useful for
+ avoiding clipping.RepeatPart
Added in v0.32.0
+Select a subsection (or "part") of the audio and repeat that part a number of times. +This can be useful when simulating scenarios where a short audio snippet gets +repeated, for example:
+Note that the length of inputs you give it must be compatible with the part
+duration range and crossfade duration. If you give it an input audio array that is
+too short, a UserWarning
will be raised and no operation is applied to the signal.
In this speech example, the audio was transformed with
+SevenBandParametricEQ
part transform. This is why each repeat in the output
+ has a different timbre.Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import RepeatPart
+
+transform = RepeatPart(mode="insert", p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
from audiomentations import RepeatPart
+
+transform = RepeatPart(mode="replace", p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
min_repeats
: int
• range: [1, max_repeats
]1
. Minimum number of times a selected audio
+ segment should be repeated in addition to the original. For instance, if the selected
+ number of repeats is 1, the selected segment will be followed by one repeat.max_repeats
: int
• range: [min_repeats
, ∞)3
. Maximum number of times a selected audio
+ segment can be repeated in addition to the originalmin_part_duration
: float
• unit: seconds • range: [0.00025, max_part_duration
]0.25
. Minimum duration (in seconds) of the audio
+ segment that can be selected for repetition.max_part_duration
: float
• unit: seconds • range: [min_part_duration
, ∞)1.2
. Maximum duration (in seconds) of the audio
+ segment that can be selected for repetition.mode
: str
• choices: "insert"
, "replace"
Default: "insert"
. This parameter has two options:
"insert"
: Insert the repeat(s), making the array longer. After the last
+ repeat there will be the last part of the original audio, offset in time
+ compared to the input array."replace"
: Have the repeats replace (as in overwrite) the original audio.
+ Any remaining part at the end (if not overwritten by repeats) will be
+ left untouched without offset. The length of the output array is the
+ same as the input array.crossfade_duration
: float
• unit: seconds • range: 0.0 or [0.00025, ∞)0.005
. Duration for crossfading between repeated
+ parts as well as potentially from the original audio to the repeats and back.
+ The crossfades will be equal-energy or equal-gain depending on the audio and/or the
+ chosen parameters of the transform. The crossfading feature can be used to smooth
+ transitions and avoid abrupt changes, which can lead to impulses/clicks in the audio.
+ If you know what you're doing, and impulses/clicks are desired for your use case,
+ you can disable the crossfading by setting this value to 0.0
.part_transform
: Optional[Callable[[NDArray[np.float32], int], NDArray[np.float32]]]
part_transform
+ that makes the part shorter is only supported if the transformed part is at
+ least two times the crossfade duration.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Resample
Added in v0.8.0
+Resample signal using librosa.core.resample
+To do downsampling only set both minimum and maximum sampling rate lower than original +sampling rate and vice versa to do upsampling only.
+min_sample_rate
: int
• unit: Hz8000
. Minimum sample ratemax_sample_rate
: int
• unit: Hz44100
. Maximum sample ratep
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Reverse
Added in v0.18.0
+Reverse the audio. Also known as time inversion. Inversion of an audio track along its time +axis relates to the random flip of an image, which is an augmentation technique that is +widely used in the visual domain. This can be relevant in the context of audio +classification. It was successfully applied in the paper +AudioCLIP: Extending CLIP to Image, Text and Audio .
+In this example, we reverse a speech recording
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import Reverse
+
+transform = Reverse(p=1.0)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=44100)
+
p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.RoomSimulator
Added in v0.23.0
+A ShoeBox Room Simulator. Simulates a cuboid of parametrized size and average surface absorption coefficient. It also includes a source +and microphones in parametrized locations.
+Use it when you want a ton of synthetic room impulse responses of specific configurations +characteristics or simply to quickly add reverb for augmentation purposes
+min_size_x
: float
• unit: meters3.6
. Minimum width (x coordinate) of the room in metersmax_size_x
: float
• unit: meters5.6
. Maximum width of the room in metersmin_size_y
: float
• unit: meters3.6
. Minimum depth (y coordinate) of the room in metersmax_size_y
: float
• unit: meters3.9
. Maximum depth of the room in metersmin_size_z
: float
• unit: meters2.4
. Minimum height (z coordinate) of the room in metersmax_size_z
: float
• unit: meters3.0
. Maximum height of the room in metersmin_absorption_value
: float
Default: 0.075
. Minimum absorption coefficient value.
+When calculation_mode
is "absorption"
+it will set the given coefficient value for the surfaces of the room (walls,
+ceilings, and floor). This coefficient takes values between 0 (fully reflective
+surface) and 1 (fully absorbing surface).
Example values (may differ!):
+Environment | +Coefficient value | +
---|---|
Studio with acoustic panels | +> 0.40 | +
Office / Library | +~ 0.15 | +
Factory | +~ 0.05 | +
max_absorption_value
: float
0.4
. Maximum absorption coefficient value. See
+min_absorption_value
for more
+info.min_target_rt60
: float
• unit: seconds Default: 0.15
. Minimum target RT60. RT60 is defined as the
+measure of the time after the sound source ceases that it takes for the sound
+pressure level to reduce by 60 dB. When
+calculation_mode
is "rt60"
, it tries
+to set the absorption value of the surfaces of the room to achieve a target RT60
+(in seconds). Note that this parameter changes only the materials (absorption
+coefficients) of the surfaces, not the dimension of the rooms.
Example values (may differ!):
+Environment | +RT60 | +
---|---|
Recording studio | +0.3 s | +
Office | +0.5 s | +
Concert hall | +1.5 s | +
max_target_rt60
: float
• unit: seconds0.8
. Maximum target RT60. See
+min_target_rt60
for more info.min_source_x
: float
• unit: meters0.1
. Minimum x location of the sourcemax_source_x
: float
• unit: meters3.5
. Maximum x location of the sourcemin_source_y
: float
• unit: meters0.1
. Minimum y location of the sourcemax_source_x
: float
• unit: meters2.7
. Maximum y location of the sourcemin_source_z
: float
• unit: meters1.0
. Minimum z location of the sourcemax_source_x
: float
• unit: meters2.1
. Maximum z location of the sourcemin_mic_distance
: float
• unit: meters0.15
. Minimum distance of the microphone from the
+source in metersmax_mic_distance
: float
• unit: meters0.35
. Maximum distance of the microphone from the
+source in metersmin_mic_azimuth
: float
• unit: radians-math.pi
. Minimum azimuth (angle around z axis) of the
+microphone relative to the source.max_mic_azimuth
: float
• unit: radiansmath.pi
. Maximum azimuth (angle around z axis) of the
+microphone relative to the source.min_mic_elevation
: float
• unit: radians-math.pi
. Minimum elevation of the microphone relative
+to the source, in radians.max_mic_elevation
: float
• unit: radiansmath.pi
. Maximum elevation of the microphone relative
+to the source, in radians.calculation_mode
: str
• choices: "rt60"
, "absorption"
"absorption"
. When set to "absorption"
, it will
+create the room with surfaces based on
+min_absorption_value
and
+max_absorption_value
. If set to
+"rt60"
it will try to assign surface materials that lead to a room impulse
+response with target rt60 given by
+min_target_rt60
and
+max_target_rt60
use_ray_tracing
: bool
True
. Whether to use ray_tracing or not (slower
+but much more accurate). Disable this if you need speed but do not really care for
+incorrect results.max_order
: int
• range: [1, ∞) Default: 1
. Maximum order of reflections for the Image
+Source Model. E.g. a value of 1 will only add first order reflections while a value
+of 12 will add a diffuse reverberation tail.
Warning
+Placing this higher than 11-12 will result in a very slow augmentation process when calculation_mode="rt60"
.
Tip
+When using calculation_mode="rt60"
, keep it around 3-4
.
leave_length_unchanged
: bool
False
. When set to True, the tail of the sound
+(e.g. reverb at the end) will be chopped off so that the length of the output is
+equal to the length of the input.padding
: float
• unit: meters0.1
. Minimum distance in meters between source or
+mic and the room walls, floor or ceiling.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.ray_tracing_options
: Optional[Dict]
None
. Options for the ray tracer. See set_ray_tracing
here:SevenBandParametricEQ
Added in v0.24.0
+Adjust the volume of different frequency bands. This transform is a 7-band +parametric equalizer - a combination of one low shelf filter, five peaking filters +and one high shelf filter, all with randomized gains, Q values and center frequencies.
+Because this transform changes the timbre, but keeps the overall "class" of the +sound the same (depending on application), it can be used for data augmentation to +make ML models more robust to various frequency spectrums. Many things can affect +the spectrum, for example:
+The seven bands have center frequencies picked in the following ranges (min-max):
+min_gain_db
: float
• unit: Decibel-12.0
. Minimum number of dB to cut or boost a bandmax_gain_db
: float
• unit: decibel12.0
. Maximum number of dB to cut or boost a bandp
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Shift
Added in v0.5.0
+Shift the samples forwards or backwards, with or without rollover
+This only applies to version 0.33.0 and newer. If you are using an older +version, you should consider upgrading. Or if you really want to keep using the old +version, you can check the "Old Shift API (<=v0.32.0)" section below
+min_shift
: float | int
-0.5
. Minimum amount of shifting in time. See also
+shift_unit
.max_shift
: float | int
0.5
. Maximum amount of shifting in time. See also
+shift_unit
.shift_unit
: str
• choices: "fraction"
, "samples"
, "seconds"
Default: "fraction"
Defines the unit of the value of
+min_shift
and max_shift
.
"fraction"
: Fraction of the total sound length"samples"
: Number of audio samples"seconds"
: Number of secondsrollover
: bool
True
. When set to True
, samples that roll
+beyond the first or last position are re-introduced at the last or first. When set
+to False
, samples that roll beyond the first or last position are discarded. In
+other words, rollover=False
results in an empty space (with zeroes).fade_duration
: float
• unit: seconds • range: 0.0 or [0.00025, ∞)0.005
. If you set this to a positive number,
+there will be a fade in and/or out at the "stitch" (that was the start or the end
+of the audio before the shift). This can smooth out an unwanted abrupt
+change between two consecutive samples (which sounds like a
+transient/click/pop). This parameter denotes the duration of the fade in
+seconds. To disable the fading feature, set this parameter to 0.0.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.This only applies to version 0.32.0 and older
+min_fraction
: float
• range: [-1, 1]-0.5
. Minimum fraction of total sound length to
+shift.max_fraction
: float
• range: [-1, 1]0.5
. Maximum fraction of total sound length to
+shift.rollover
: bool
True
. When set to True
, samples that roll
+beyond the first or last position are re-introduced at the last or first. When set
+to False
, samples that roll beyond the first or last position are discarded. In
+other words, rollover=False
results in an empty space (with zeroes).fade
: bool
False
. When set to True
, there will be a short
+fade in and/or out at the "stitch" (that was the start or the end of the audio
+before the shift). This can smooth out an unwanted abrupt change between two
+consecutive samples (which sounds like a transient/click/pop).fade_duration
: float
• unit: seconds0.01
. If fade=True
, then this is the duration
+of the fade in seconds.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.TanhDistortion
Added in v0.19.0
+Apply tanh (hyperbolic tangent) distortion to the audio. This technique is sometimes +used for adding distortion to guitar recordings. The tanh() function can give a rounded +"soft clipping" kind of distortion, and the distortion amount is proportional to the +loudness of the input and the pre-gain. Tanh is symmetric, so the positive and +negative parts of the signal are squashed in the same way. This transform can be +useful as data augmentation because it adds harmonics. In other words, it changes +the timbre of the sound.
+See this page for examples: http://gdsp.hf.ntnu.no/lessons/3/17/
+In this example we apply tanh distortion with the "distortion amount" (think of it as a knob that goes from 0 to 1) set to 0.25
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import TanhDistortion
+
+transform = TanhDistortion(
+ min_distortion=0.01,
+ max_distortion=0.7,
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
min_distortion
: float
• range: [0.0, 1.0]0.01
. Minimum "amount" of distortion to apply to the signal.max_distortion
: float
• range: [0.0, 1.0]0.7
. Maximum "amount" of distortion to apply to the signal.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.TimeMask
Added in v0.7.0
+Make a randomly chosen part of the audio silent. Inspired by +https://arxiv.org/pdf/1904.08779.pdf
+Here we silence a part of a speech recording.
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import TimeMask
+
+transform = TimeMask(
+ min_band_part=0.1,
+ max_band_part=0.15,
+ fade=True,
+ p=1.0,
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
min_band_part
: float
• range: [0.0, 1.0]0.0
. Minimum length of the silent part as a
+fraction of the total sound length.max_band_part
: float
• range: [0.0, 1.0]0.5
. Maximum length of the silent part as a
+fraction of the total sound length.fade
: bool
False
. When set to True
, add a linear fade in
+and fade out of the silent part. This can smooth out an unwanted abrupt change
+between two consecutive samples (which sounds like a transient/click/pop).p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.TimeStretch
Added in v0.2.0
+Change the speed or duration of the signal without changing the pitch. This transform
+employs librosa.effects.time_stretch
under the hood to achieve the effect.
Under the hood this uses phase vocoding. Note that phase vocoding can degrade audio +quality by "smearing" transient sounds, altering the timbre of harmonic sounds, and +distorting pitch modulations. This may result in a loss of sharpness, clarity, or +naturalness in the transformed audio, especially when the rate is set to an extreme +value.
+If you need a better sounding time stretch method, consider the following alternatives:
+In this example we speed up a sound by 25%. This corresponds to a rate of 1.25.
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import TimeStretch
+
+transform = TimeStretch(
+ min_rate=0.8,
+ max_rate=1.25,
+ leave_length_unchanged=True,
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+
min_rate
: float
• range: [0.1, 10.0]0.8
. Minimum rate of change of total duration of the signal. A rate below 1 means the audio is slowed down.max_rate
: float
• range: [0.1, 10.0]1.25
. Maximum rate of change of total duration of the signal. A rate greater than 1 means the audio is sped up.leave_length_unchanged
: bool
True
. The rate changes the duration and effects the samples. This flag is used to keep the total length of the generated output to be same as that of the input signal.p
: float
• range: [0.0, 1.0]0.5
. The probability of applying this transform.Trim
Added in v0.7.0
+Trim leading and trailing silence from an audio signal using librosa.effects.trim
. It considers threshold
+(in decibels) below reference defined in parameter top_db
as silence.
In this example we remove silence from the start and end, using the default top_db parameter value
+ +Input sound | +Transformed sound | +
---|---|
+ | + |
from audiomentations import Trim
+
+transform = Trim(
+ top_db=30.0,
+ p=1.0
+)
+
+augmented_sound = transform(my_waveform_ndarray, sample_rate=16000)
+