A novel Deep Learning approach adapting Image Segmentation architectures for robust Audio Classification.
UNetropolis represents a paradigm shift in audio classification. Instead of traditional 1D signal processing, this project treats sound as a visual texture. By converting audio into Mel-Spectrograms, we leverage the powerful spatial feature extraction capabilities of Computer Vision architectures.
Specifically, we utilize a Double-UNet architectureβstacking two U-Net (Encoder-Decoder) models sequentiallyβto capture both high-level temporal patterns and fine-grained spectral details. The output is refined through a Global Average Pooling classification head to distinguish between 10 complex urban sound categories.
- Hybrid Double-UNet Architecture: Two stacked U-Nets providing deep feature refinement, transitioning from segmentation-style maps to classification vectors.
- Visual Audio Processing: Converts raw waveforms into 128-band Mel-Spectrograms (Log-Scale).
- Dynamic Data Augmentation: Implements Mask Overlay Augmentation, mixing random noise samples with target audio on-the-fly to simulate real-world overlapping sounds (e.g., a dog barking over traffic).
- Robust Training Pipeline: Includes Learning Rate Schedulers, Early Stopping, and Model Checkpointing.
- Self-Contained Executive Script: The main script handles everything from directory setup and custom data generator creation (
src/datagen.py) to training and evaluation.
The model treats the spectrogram as a 2D image.
graph LR
Input["Mel-Spectogram<br>(128 x Time x 1)"] --> U1["UNet Block 1<br>(Encoder-Decoder)"]
U1 --> U2["UNet Block 2<br>(Refinement)"]
U2 --> GAP[Global Avg Pooling]
GAP --> Dense["Dense Prediction Layer<br>(Softmax)"]
Dense --> Output[10 Class Probabilities]
- Input: Log-Mel Spectrograms.
- UNet block 1: Extracts coarse spectral features.
- UNet block 2: Refines features, focusing on subtle differences between similar sounds (e.g., Siren vs Street Music).
- Head: Flattens the feature map and maps it to specific classes.
The model is designed for the UrbanSound8K dataset, consisting of 8,732 labeled sound excerpts (<= 4s) across 10 classes:
| ID | Class Name | ID | Class Name |
|---|---|---|---|
| 0 | Air Conditioner | 5 | Engine Idling |
| 1 | Car Horn | 6 | Gun Shot |
| 2 | Children Playing | 7 | Jackhammer |
| 3 | Dog Bark | 8 | Siren |
| 4 | Drilling | 9 | Street Music |
Ensure you have Python 3.7+ and the following libraries installed:
pip install tensorflow numpy pandas librosa matplotlib scikit-learn ipython- TensorFlow/Keras: Deep Learning backend.
- Librosa: Audio processing and spectral conversion.
- Pandas/NumPy: Data manipulation.
- Matplotlib: Visualization of waveforms and training history.
This project is designed to be plug-and-play. The main script automatically sets up its own environment structure.
- Clone the Repository
- Prepare Data: Ensure the
UrbanSound8Kdataset is available.- Note: The script defaults to a Kaggle input path
../input/urbansound8k. You may need to adjust theaudio_dirvariable in the script if running locally.
- Note: The script defaults to a Kaggle input path
- Run the Script:
python UNetropolis_Improved_Hybrid_UNet_for_Urban_Sound_with_improvement.py
What happens next?
- The script creates a
src/directory and generatesdatagen.py. - It loads metadata and visualizes random samples.
- Training begins (default 20 epochs).
- Training history (Accuracy/Loss) and a Confusion Matrix are plotted.
The script utilizes matplotlib to provide rich insights during execution:
- Waveform Analysis: Raw amplitude visualization of inputs.
- Spectrograms: Heatmaps showing frequency intensity over time.
- Confusion Matrix: A detailed breakdown of classification performance (True Label vs. Predicted Label).
Sachin Paunikar
This project serves as a Proof of Concept (PoC) for applying advanced Computer Vision techniques to the domain of Audio Signal Processing.