AudioAugmentor

Python library for augmenting audio data

This library is designed to augment audio data for machine learning purposes. It combines several tools and libraries for audio data augmentation and provides a unified interface that can be used to apply a large set of audio augmentations in one place.

The library is designed to be used with the PyTorch machine learning framework. It can also work solely on just simple audio waveforms and augment just those. Additionally it can also augment local audio datasets.

This library specifically combines these libraries and tools:

Available augmentations

Table below shows which library was used to apply specific audio augmentation/codec.

	audiomentations	torch-audiomentations	torchaudio	pyroomacoustics	ffmpeg-python
AddBackgroundNoise		✅
AddColoredNoise		✅
AddGaussianNoise	✅
AddShortNoises	✅
AdjustDuration	✅
AirAbsorption	✅
ApplyImpulseResponse		✅
BandPassFilter		✅
BandStopFilter		✅
ClippingDistortion	✅
FrequencyMasking			✅
Volume / Gain			✅
GainTransition	✅
HighPassFilter		✅
HighShelfFilter	✅
Limiter	✅
LoudnessNormalization	✅
LowPassFilter		✅
LowShelfFilter	✅
Mp3Compression	✅
MelSpectrogram			✅
Normalize	✅
Padding	✅
PeakNormalization		✅
PeakingFilter	✅
PitchShift			✅
PolarityInversion		✅
Time inversion		✅
ApplyRIR (RoomSimulator)				✅
SevenBandParametricEQ	✅
Shift		✅
Speed			✅
Spectrogram			✅
TanhDistortion	✅
TimeMasking			✅
TimeStretch	✅
ac3			✅
adpcm_ima_wav			✅
adpcm_ms			✅
adpcm_yamaha			✅
eac3			✅
flac			✅
libmp3lame			✅
mp2			✅
pcm_alaw			✅
pcm_f32le			✅
pcm_mulaw			✅
pcm_s16le			✅
pcm_s24le			✅
pcm_s32le			✅
pcm_u8			✅
wmav1			✅
wmav2			✅
g726					✅
gsm					✅
amr					✅

Usage

For a more complex example see example colab notebook above. Or see jupyter notebook AudioAugmentor_Usage_Example.ipynb in the examples directory within this repository.

Note: AudioAugmentor was mainly tested using Python 3.11.8 and Fedora 38 (Google Colab uses Python 3.10 and Ubuntu)

0. You need to install the library and necessary packages first

!!!You may need to run the following commands with sudo!!!

If so install these packages manually in terminal.

pip install -U pip
pip install AudioAugmentor
dnf install -y sox                # FEDORA
dnf install -y sox-devel          # FEDORA
dnf install -y ffmpeg             # FEDORA
# apt-get install -y sox          # UBUNTU
# apt-get install -y libsox-dev   # UBUNTU
# apt-get install -y ffmpeg       # UBUNTU

1. Import necessary libraries

import torch
import torchaudio
import numpy as np
import audiomentations as AA
from IPython.display import Audio, display

from AudioAugmentor import transf_gen
from AudioAugmentor import sox_parser
from AudioAugmentor import core
from AudioAugmentor import rir_setup
from AudioAugmentor import torchaudio_transf_wrapper as TTW

2. Define the augmentations you want to apply to your audio data.

You have 3 options of how to define the augmentations:

a) Use transf_gen.transf_gen function to generate list of transformations.

See supported transformation table and examples of every augmentation, so you know what parameters are needed for each augmentation method.

You can enter augmentation parameters as a string or as a dictionary.

PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0'

PitchShift={'sample_rate': 16000, 'n_steps': [1, 1.5, 0.1], 'p': 1.0}

transformations = transf_gen.transf_gen(verbose=True,
                                        PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0',
                                        Speed={'orig_freq': 16000, 'factor': [0.9, 1.5, 0.1], 'p': 1},
                                        LowPassFilter={'min_cutoff_freq': 700, 'max_cutoff_freq': 800, 'sample_rate': sampling_rate, 'p': 1},
)

b) Use pseudo SoX command. SoX command must be in this format:

--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"

(When you don't want to apply some codec after applying SoX effects)

OR

--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k

(In this case, you want to apply codec after applying SoX effects -> Codec is entered in the form codec_name codec_parameter_name codec_parameter_value directly after the SoX effects command)

example_sox = '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k'

c) Use a file with multiple pseudo SoX commands. Random SoX command from this file will be chosen and applied to your data.

File must to be loaded using sox_parser.load_sox_file function.

sox_file_content_to_write = '''--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"
#--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s"
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" gsm
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k
'''
with open('sox_file_example.txt', 'w') as f:
    f.write(sox_file_content_to_write)

sox_file_content = sox_parser.load_sox_file('sox_file_example.txt')
print('SOX FILE LOADED:', sox_file_content, type(sox_file_content))

3. Apply augmentations

a) Use generated the transformations list, single SoX command or loaded SoX file content while initializing Collator class.

Use this initiated class as an argument for the collate_fn parameter of PyTorch's dataloader.

collate_fn = core.Collator(
    transformations=transformations, device='cpu', sox_effects=None, sample_rate=sampling_rate, verbose=True,
    #transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=sampling_rate, verbose=False,
    #transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=sampling_rate, verbose=False,
)

dataset = torchaudio.datasets.LIBRISPEECH("../data", url="train-clean-100", download=True)
aug_dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=1,
    num_workers=0,
    collate_fn=collate_fn,
)
augmented_record_from_dataset = next(iter(aug_dataloader))
display(Audio(augmented_record_from_dataset[0].squeeze(0).squeeze(0).squeeze(0).cpu(), rate=sampling_rate))

OR

b) Use generated the transformations list, single SoX command or loaded SoX file content while initializing AugmentWaveform class and apply the augmentations to the audio signal.

augment = core.AugmentWaveform(
    transformations=transformations, device='cpu', sox_effects=None, sample_rate=16000, verbose=False,
    #transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=16000, verbose=False,
    #transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=16000, verbose=False,
)
# Load test wav file
signal, fs = torchaudio.load('../data/test.wav')
# Apply transformations
waveform = augment(signal.numpy()[0])
display(Audio(waveform, rate=fs))

c) Use generated the transformations list, single SoX command or loaded SoX file content while initializing AugmentLocalAudioDataset class and apply the augmentations to the local audio dataset.

augment = core.AugmentLocalAudioDataset(
    transformations=transformations, device='cpu', sox_effects=None, sample_rate=16000, verbose=False,
    #transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=16000, verbose=False,
    #transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=16000, verbose=False,
)
augment(input_dir='../data/test-input-folder', output_dir='../data/test-output-folder')

EXAMPLES OF AVAILABLE AUGMENTATIONS

!!!Put following examples as an argument for `transf_gen.transf_gen` function to generate a list of transformations!!!

Like this:

transformations = transf_gen.transf_gen(verbose=True,
                                        AddBackgroundNoise=f'background_paths="../data/musan/noise/free-sound", min_snr_in_db=10, max_snr_in_db=20, p=1, sample_rate={sampling_rate}',
                                        AddColoredNoise=f'min_snr_in_db=9, max_snr_in_db=10, p=1, sample_rate={sampling_rate}',
                                        )

You can enter augmentation parameters as a string or as a dictionary.

PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0'

PitchShift={'sample_rate': 16000, 'n_steps': [1, 1.5, 0.1], 'p': 1.0}

⬆️ AddBackgroundNoise docs

AddBackgroundNoise=f'''background_paths="../data/musan/noise/free-sound",
                       min_snr_in_db=10, 
                       max_snr_in_db=20, 
                       p=1, 
                       sample_rate={sampling_rate}''',

⬆️ AddColoredNoise docs

AddColoredNoise=f'''min_snr_in_db=9,
                    max_snr_in_db=10, 
                    p=1, 
                    sample_rate={sampling_rate}''',

⬆️ AddGaussianNoise docs

AddGaussianNoise={'min_amplitude': 0.001, 
                  'max_amplitude': 0.015, 
                  'p': 1},

⬆️ AddShortNoises docs

AddShortNoises={'sounds_path': "../data/musan/noise/free-sound",
                'min_snr_in_db': 3.0,
                'max_snr_in_db': 30.0,
                'noise_rms': "relative_to_whole_input",
                'min_time_between_sounds': 2.0,
                'max_time_between_sounds': 8.0,
                'noise_transform': AA.PolarityInversion(),
                'p': 1.0},

⬆️ AdjustDuration docs

AdjustDuration={'duration_seconds': 3.5, 
                'padding_mode': 'silence', 
                'p': 1},

⬆️ AirAbsorption docs

AirAbsorption={'min_distance': 10.0, 
               'max_distance': 50.0, 
               'min_humidity': 80.0, 
               'max_humidity': 90.0, 
               'min_temperature': 10.0, 
               'max_temperature': 20.0, 
               'p': 1.0},

⬆️ ApplyImpulseResponse docs

ApplyImpulseResponse=f'''ir_paths="../data/Rir.wav", 
                         p=1, 
                         sample_rate={sampling_rate}''',

⬆️ BandPassFilter docs

BandPassFilter=f'''min_center_frequency=200, 
                   max_center_frequency=4000, 
                   min_bandwidth_fraction=0.5, 
                   max_bandwidth_fraction=1.99, 
                   sample_rate={sampling_rate}, 
                   p=1''',

⬆️ BandStopFilter docs

BandStopFilter=f'''min_center_frequency=200, 
                   max_center_frequency=4000, 
                   min_bandwidth_fraction=0.5, 
                   max_bandwidth_fraction=1.99, 
                   sample_rate={sampling_rate}, 
                   p=1''',

⬆️ ClippingDistortion docs

ClippingDistortion={'min_percentile_threshold': 10, 
                    'max_percentile_threshold': 30, 
                    'p': 1},

⬆️ FrequencyMasking docs

FrequencyMasking={'freq_mask_param': 80},

⬆️ Volume / Gain docs

Vol={'gain': [2.5, 3, 0.1], 
     'p': 1.0},

⬆️ GainTransition docs

GainTransition={'min_gain_db': 30, 
                'max_gain_db': 40, 
                'min_duration': 5, 
                'max_duration': 16, 
                'duration_unit': 'seconds', 
                'p': 1},

⬆️ HighPassFilter docs

HighPassFilter=f'''min_cutoff_freq=700,
                   max_cutoff_freq=800,
                   sample_rate={sampling_rate},
                   p=1''',

⬆️ HighShelfFilter docs

HighShelfFilter={'min_center_freq': 2000, 
                 'max_center_freq': 5000, 
                 'min_gain_db': 10.0, 
                 'max_gain_db': 16.0, 
                 'min_q': 0.5, 
                 'max_q': 1.0, 
                 'p': 1},

⬆️ Limiter docs

Limiter='''min_threshold_db=-24, 
           max_threshold_db=-2,
           min_attack=0.0005, 
           max_attack=0.025, 
           min_release=0.05, 
           max_release=0.7, 
           threshold_mode="relative_to_signal_peak", 
           p=1''',

⬆️ LoudnessNormalization docs

LoudnessNormalization={'min_lufs': -31, 
                       'max_lufs': -13, 
                       'p': 1},

⬆️ LowPassFilter docs

LowPassFilter={'min_cutoff_freq': 700, 
               'max_cutoff_freq': 800, 
               'sample_rate': sampling_rate, 
               'p': 1},

⬆️ LowShelfFilter docs

LowShelfFilter={'min_center_freq': 20, 
                'max_center_freq': 600, 
                'min_gain_db': -16.0, 
                'max_gain_db': 16.0, 
                'min_q': 0.5, 
                'max_q': 1.0, 
                'p': 1},

⬆️ Mp3Compression docs

Mp3Compression={'min_bitrate': 8, 
                'max_bitrate': 8, 
                'backend': 'pydub', 
                'p': 1},

⬆️ MelSpectrogram docs

MelSpectrogram={'sample_rate': 16000},

⬆️ Normalize docs

Normalize={'p': 1},

⬆️ Padding docs

Padding={'mode': 'silence', 
         'min_fraction': 0.02, 
         'max_fraction': 0.8, 
         'pad_section': 'start', 
         'p': 1},

⬆️ PeakNormalization docs

PeakNormalization={'p': 1, 
                   'sample_rate': sampling_rate},

⬆️ PeakingFilter docs

PeakingFilter={'min_center_freq': 51, 
               'max_center_freq': 7400, 
               'min_gain_db': -22, 
               'max_gain_db': 22, 
               'min_q': 0.5, 
               'max_q': 1.0, 
               'p': 1},

⬆️ PitchShift docs

PitchShift={'sample_rate': 16000, 
            'n_steps': [1, 1.5, 0.1],
            'bins_per_octave': 12, 
            'n_fft': 512, 
            'win_length':512, 
            'hop_length': 512//4, 
            'p': 1.0},

⬆️ PolarityInversion docs

PolarityInversion={'p': 1, 
                   'sample_rate': sampling_rate},

⬆️ Time inversion docs

TimeInversion={'p': 1, 
               'sample_rate': sampling_rate},

⬆️ ApplyRIR

# Use this to see available materials you can use as walls_mat, floor_mat and ceiling_mat argument
# from AudioAugmentor import rir_setup
# rir_setup.get_all_materials_info()

# This way you set up parameters when you want to generate random room parameter
rir_kwargs = {
    'audio_sample_rate': 16000,
    'x_range': (0, 100), 
    'y_range': (0, 100), 
    'num_vertices_range': (3, 6),
    'mic_height': 1.5,
    'source_height': 1.5,
    'walls_mat': 'curtains_cotton_0.5',
    'room_height': 2.0,
    'max_order': 3,
    'floor_mat': 'carpet_cotton',
    'ceiling_mat': 'hard_surface',
    'ray_tracing': True,
    'air_absorption': True,
}
# This way you set up parameters when you want to generate specific room
rir_kwargs = {
    'audio_sample_rate': 16000,
    'corners_coord': [[0, 0], [0, 3], [5, 3], [5, 1], [3, 1], [3, 0]],
    'walls_mat': 'curtains_cotton_0.5',
    'room_height': 2.0,
    'max_order': 3,
    'floor_mat': 'carpet_cotton',
    'ceiling_mat': 'hard_surface',
    'ray_tracing': True,
    'air_absorption': True,
    'source_coord': [[1.0], [1.0], [0.5]],
    'microphones_coord': [[3.5], [2.0], [0.5]],
}
transformations = transf_gen.transf_gen(verbose=True,
                                        ApplyRIR=rir_kwargs,
                                        )

⬆️ SevenBandParametricEQ docs

SevenBandParametricEQ={'min_gain_db': -10, 
                       'max_gain_db': 10, 
                       'p': 1},

⬆️ Shift docs

Shift={'min_shift': 1, 
       'max_shift': 2, 
       'p': 1, 
       'sample_rate': sampling_rate},

⬆️ Speed docs

Speed={'orig_freq': 16000, 
       'factor': [0.9, 1.5, 0.1], 
       'p': 1},

⬆️ Spectrogram docs

Spectrogram={'sample_rate': 16000},

⬆️ TanhDistortion docs

TanhDistortion={'min_distortion': 0.1, 
                'max_distortion': 0.8, 
                'p': 1},

⬆️ TimeMasking docs

TimeMasking={'time_mask_param': 80},

⬆️ TimeStretch docs

TimeStretch='''min_rate=0.9, 
               max_rate=1.1, 
               p=0.2, 
               leave_length_unchanged=False''',

⬆️ Codecs using torchaudio

You can select just one. No need to use them all. :)

transformations = transf_gen.transf_gen(verbose=True,
                                        ac3=True,
                                        adpcm_ima_wav=True,
                                        adpcm_ms=True,
                                        adpcm_yamaha=True,
                                        eac3=True,
                                        flac=True,
                                        libmp3lame=True,
                                        mp2=True,
                                        pcm_alaw=True,
                                        pcm_f32le=True,
                                        pcm_mulaw=True,
                                        pcm_s16le=True,
                                        pcm_s24le=True,
                                        pcm_s32le=True,
                                        pcm_u8=True,
                                        wmav1=True,
                                        wmav2=True,
                                        )

⬆️ g726

g726={'audio_bitrate': '40k'},

⬆️ gsm

gsm=True,

⬆️ amr

amr={'audio_bitrate': '4.75k'},

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
AudioAugmentor		AudioAugmentor
data		data
examples		examples
image_generation		image_generation
testing/whisper		testing/whisper
.gitignore		.gitignore
AudioAugmentor-Paper-Excel@FIT2024.pdf		AudioAugmentor-Paper-Excel@FIT2024.pdf
AudioAugmentor-Poster-Excel@FIT2024.pdf		AudioAugmentor-Poster-Excel@FIT2024.pdf
AudioAugmentor-bachelor-thesis.pdf		AudioAugmentor-bachelor-thesis.pdf
LICENSE		LICENSE
OrbitalMechanics_HW.ipynb		OrbitalMechanics_HW.ipynb
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

LadislavVasina1/AudioAugmentor_public

Folders and files

Latest commit

History

Repository files navigation