This library is designed to augment audio data for machine learning purposes. It combines several tools and libraries for audio data augmentation and provides a unified interface that can be used to apply a large set of audio augmentations in one place.
The library is designed to be used with the PyTorch machine learning framework. It can also work solely on just simple audio waveforms and augment just those. Additionally it can also augment local audio datasets.
This library specifically combines these libraries and tools:
Table below shows which library was used to apply specific audio augmentation/codec.
For a more complex example see example colab notebook above.
Or see jupyter notebook AudioAugmentor_Usage_Example.ipynb
in the examples
directory within this repository.
Note: AudioAugmentor was mainly tested using Python 3.11.8 and Fedora 38 (Google Colab uses Python 3.10 and Ubuntu)
0. You need to install the library and necessary packages first
!!!You may need to run the following commands with sudo!!!
If so install these packages manually in terminal.
pip install -U pip
pip install AudioAugmentor
dnf install -y sox # FEDORA
dnf install -y sox-devel # FEDORA
dnf install -y ffmpeg # FEDORA
# apt-get install -y sox # UBUNTU
# apt-get install -y libsox-dev # UBUNTU
# apt-get install -y ffmpeg # UBUNTU
1. Import necessary libraries
import torch
import torchaudio
import numpy as np
import audiomentations as AA
from IPython.display import Audio, display
from AudioAugmentor import transf_gen
from AudioAugmentor import sox_parser
from AudioAugmentor import core
from AudioAugmentor import rir_setup
from AudioAugmentor import torchaudio_transf_wrapper as TTW
2. Define the augmentations you want to apply to your audio data.
You have 3 options of how to define the augmentations:
a) Use transf_gen.transf_gen
function to generate list of transformations.
See supported transformation table and examples of every augmentation, so you know what parameters are needed for each augmentation method.
You can enter augmentation parameters as a string or as a dictionary.
PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0'
PitchShift={'sample_rate': 16000, 'n_steps': [1, 1.5, 0.1], 'p': 1.0}
transformations = transf_gen.transf_gen(verbose=True,
PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0',
Speed={'orig_freq': 16000, 'factor': [0.9, 1.5, 0.1], 'p': 1},
LowPassFilter={'min_cutoff_freq': 700, 'max_cutoff_freq': 800, 'sample_rate': sampling_rate, 'p': 1},
)
b) Use pseudo SoX command. SoX command must be in this format:
--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"
(When you don't want to apply some codec after applying SoX effects)
OR
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k
(In this case, you want to apply codec after applying SoX effects -> Codec is entered in the form codec_name
codec_parameter_name
codec_parameter_value
directly after the SoX effects command)
example_sox = '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k'
c) Use a file with multiple pseudo SoX commands. Random SoX command from this file will be chosen and applied to your data.
File must to be loaded using sox_parser.load_sox_file
function.
sox_file_content_to_write = '''--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"
#--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s"
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" gsm
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k
'''
with open('sox_file_example.txt', 'w') as f:
f.write(sox_file_content_to_write)
sox_file_content = sox_parser.load_sox_file('sox_file_example.txt')
print('SOX FILE LOADED:', sox_file_content, type(sox_file_content))
3. Apply augmentations
a) Use generated the transformations
list, single SoX command
or loaded SoX file content
while initializing Collator
class.
Use this initiated class as an argument for the collate_fn
parameter of PyTorch's dataloader.
collate_fn = core.Collator(
transformations=transformations, device='cpu', sox_effects=None, sample_rate=sampling_rate, verbose=True,
#transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=sampling_rate, verbose=False,
#transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=sampling_rate, verbose=False,
)
dataset = torchaudio.datasets.LIBRISPEECH("../data", url="train-clean-100", download=True)
aug_dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=1,
num_workers=0,
collate_fn=collate_fn,
)
augmented_record_from_dataset = next(iter(aug_dataloader))
display(Audio(augmented_record_from_dataset[0].squeeze(0).squeeze(0).squeeze(0).cpu(), rate=sampling_rate))
OR
b) Use generated the transformations
list, single SoX command
or loaded SoX file content
while initializing AugmentWaveform
class and apply the augmentations to the audio signal.
augment = core.AugmentWaveform(
transformations=transformations, device='cpu', sox_effects=None, sample_rate=16000, verbose=False,
#transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=16000, verbose=False,
#transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=16000, verbose=False,
)
# Load test wav file
signal, fs = torchaudio.load('../data/test.wav')
# Apply transformations
waveform = augment(signal.numpy()[0])
display(Audio(waveform, rate=fs))
c) Use generated the transformations
list, single SoX command
or loaded SoX file content
while initializing AugmentLocalAudioDataset
class and apply the augmentations to the local audio dataset.
augment = core.AugmentLocalAudioDataset(
transformations=transformations, device='cpu', sox_effects=None, sample_rate=16000, verbose=False,
#transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=16000, verbose=False,
#transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=16000, verbose=False,
)
augment(input_dir='../data/test-input-folder', output_dir='../data/test-output-folder')
!!!Put following examples as an argument for transf_gen.transf_gen
function to generate a list of transformations!!!
Like this:
transformations = transf_gen.transf_gen(verbose=True,
AddBackgroundNoise=f'background_paths="../data/musan/noise/free-sound", min_snr_in_db=10, max_snr_in_db=20, p=1, sample_rate={sampling_rate}',
AddColoredNoise=f'min_snr_in_db=9, max_snr_in_db=10, p=1, sample_rate={sampling_rate}',
)
You can enter augmentation parameters as a string or as a dictionary.
PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0'
PitchShift={'sample_rate': 16000, 'n_steps': [1, 1.5, 0.1], 'p': 1.0}
AddBackgroundNoise=f'''background_paths="../data/musan/noise/free-sound",
min_snr_in_db=10,
max_snr_in_db=20,
p=1,
sample_rate={sampling_rate}''',
AddColoredNoise=f'''min_snr_in_db=9,
max_snr_in_db=10,
p=1,
sample_rate={sampling_rate}''',
AddGaussianNoise={'min_amplitude': 0.001,
'max_amplitude': 0.015,
'p': 1},
AddShortNoises={'sounds_path': "../data/musan/noise/free-sound",
'min_snr_in_db': 3.0,
'max_snr_in_db': 30.0,
'noise_rms': "relative_to_whole_input",
'min_time_between_sounds': 2.0,
'max_time_between_sounds': 8.0,
'noise_transform': AA.PolarityInversion(),
'p': 1.0},
AdjustDuration={'duration_seconds': 3.5,
'padding_mode': 'silence',
'p': 1},
AirAbsorption={'min_distance': 10.0,
'max_distance': 50.0,
'min_humidity': 80.0,
'max_humidity': 90.0,
'min_temperature': 10.0,
'max_temperature': 20.0,
'p': 1.0},
ApplyImpulseResponse=f'''ir_paths="../data/Rir.wav",
p=1,
sample_rate={sampling_rate}''',
BandPassFilter=f'''min_center_frequency=200,
max_center_frequency=4000,
min_bandwidth_fraction=0.5,
max_bandwidth_fraction=1.99,
sample_rate={sampling_rate},
p=1''',
BandStopFilter=f'''min_center_frequency=200,
max_center_frequency=4000,
min_bandwidth_fraction=0.5,
max_bandwidth_fraction=1.99,
sample_rate={sampling_rate},
p=1''',
ClippingDistortion={'min_percentile_threshold': 10,
'max_percentile_threshold': 30,
'p': 1},
FrequencyMasking={'freq_mask_param': 80},
Vol={'gain': [2.5, 3, 0.1],
'p': 1.0},
GainTransition={'min_gain_db': 30,
'max_gain_db': 40,
'min_duration': 5,
'max_duration': 16,
'duration_unit': 'seconds',
'p': 1},
HighPassFilter=f'''min_cutoff_freq=700,
max_cutoff_freq=800,
sample_rate={sampling_rate},
p=1''',
HighShelfFilter={'min_center_freq': 2000,
'max_center_freq': 5000,
'min_gain_db': 10.0,
'max_gain_db': 16.0,
'min_q': 0.5,
'max_q': 1.0,
'p': 1},
Limiter='''min_threshold_db=-24,
max_threshold_db=-2,
min_attack=0.0005,
max_attack=0.025,
min_release=0.05,
max_release=0.7,
threshold_mode="relative_to_signal_peak",
p=1''',
LoudnessNormalization={'min_lufs': -31,
'max_lufs': -13,
'p': 1},
LowPassFilter={'min_cutoff_freq': 700,
'max_cutoff_freq': 800,
'sample_rate': sampling_rate,
'p': 1},
LowShelfFilter={'min_center_freq': 20,
'max_center_freq': 600,
'min_gain_db': -16.0,
'max_gain_db': 16.0,
'min_q': 0.5,
'max_q': 1.0,
'p': 1},
Mp3Compression={'min_bitrate': 8,
'max_bitrate': 8,
'backend': 'pydub',
'p': 1},
MelSpectrogram={'sample_rate': 16000},
Normalize={'p': 1},
Padding={'mode': 'silence',
'min_fraction': 0.02,
'max_fraction': 0.8,
'pad_section': 'start',
'p': 1},
PeakNormalization={'p': 1,
'sample_rate': sampling_rate},
PeakingFilter={'min_center_freq': 51,
'max_center_freq': 7400,
'min_gain_db': -22,
'max_gain_db': 22,
'min_q': 0.5,
'max_q': 1.0,
'p': 1},
PitchShift={'sample_rate': 16000,
'n_steps': [1, 1.5, 0.1],
'bins_per_octave': 12,
'n_fft': 512,
'win_length':512,
'hop_length': 512//4,
'p': 1.0},
PolarityInversion={'p': 1,
'sample_rate': sampling_rate},
TimeInversion={'p': 1,
'sample_rate': sampling_rate},
⬆️ ApplyRIR
# Use this to see available materials you can use as walls_mat, floor_mat and ceiling_mat argument
# from AudioAugmentor import rir_setup
# rir_setup.get_all_materials_info()
# This way you set up parameters when you want to generate random room parameter
rir_kwargs = {
'audio_sample_rate': 16000,
'x_range': (0, 100),
'y_range': (0, 100),
'num_vertices_range': (3, 6),
'mic_height': 1.5,
'source_height': 1.5,
'walls_mat': 'curtains_cotton_0.5',
'room_height': 2.0,
'max_order': 3,
'floor_mat': 'carpet_cotton',
'ceiling_mat': 'hard_surface',
'ray_tracing': True,
'air_absorption': True,
}
# This way you set up parameters when you want to generate specific room
rir_kwargs = {
'audio_sample_rate': 16000,
'corners_coord': [[0, 0], [0, 3], [5, 3], [5, 1], [3, 1], [3, 0]],
'walls_mat': 'curtains_cotton_0.5',
'room_height': 2.0,
'max_order': 3,
'floor_mat': 'carpet_cotton',
'ceiling_mat': 'hard_surface',
'ray_tracing': True,
'air_absorption': True,
'source_coord': [[1.0], [1.0], [0.5]],
'microphones_coord': [[3.5], [2.0], [0.5]],
}
transformations = transf_gen.transf_gen(verbose=True,
ApplyRIR=rir_kwargs,
)
SevenBandParametricEQ={'min_gain_db': -10,
'max_gain_db': 10,
'p': 1},
Shift={'min_shift': 1,
'max_shift': 2,
'p': 1,
'sample_rate': sampling_rate},
Speed={'orig_freq': 16000,
'factor': [0.9, 1.5, 0.1],
'p': 1},
Spectrogram={'sample_rate': 16000},
TanhDistortion={'min_distortion': 0.1,
'max_distortion': 0.8,
'p': 1},
TimeMasking={'time_mask_param': 80},
TimeStretch='''min_rate=0.9,
max_rate=1.1,
p=0.2,
leave_length_unchanged=False''',
⬆️ Codecs using torchaudio
You can select just one. No need to use them all. :)
transformations = transf_gen.transf_gen(verbose=True,
ac3=True,
adpcm_ima_wav=True,
adpcm_ms=True,
adpcm_yamaha=True,
eac3=True,
flac=True,
libmp3lame=True,
mp2=True,
pcm_alaw=True,
pcm_f32le=True,
pcm_mulaw=True,
pcm_s16le=True,
pcm_s24le=True,
pcm_s32le=True,
pcm_u8=True,
wmav1=True,
wmav2=True,
)
⬆️ g726
g726={'audio_bitrate': '40k'},
⬆️ gsm
gsm=True,
⬆️ amr
amr={'audio_bitrate': '4.75k'},