Skip to content

Frame-level shouting detection for music/speech using librosa features and a PyTorch Lightning CNN. Includes training, inference, visualization tools, Docker support, and diagnostics for inspecting labels and class balance.

Notifications You must be signed in to change notification settings

LucciVanSchandt/ShoutingVoiceDetection

Repository files navigation

Shouting Voice Detection

Frame-level shouting detection for music or speech audio using librosa for feature extraction and PyTorch Lightning for training. The project targets Apple Silicon (MPS) but works on CPU-only setups as well.

Features

  • 16 kHz resampling and mel-spectrogram features via ShoutingVoiceFrameDataset.
  • Lightweight CNN LightningModule (ShoutingVoiceFrameCNN) with BCE loss and accuracy logging.
  • Training CLI (app/train.py) with configurable frame settings, hyperparameters, and deterministic splits.
  • Inference CLI (app/predict.py) that loads checkpoints and exports [time, probability] arrays.
  • Visualization CLI (app/visualize.py) overlaying shouting spans on the waveform.
  • Unit tests covering dataset behavior and CNN forward/training steps.

Project Layout

.
├─ app/
│  ├─ data/                # sample WAV + labels
│  ├─ model/
│  │  ├─ dataset.py        # ShoutingVoiceFrameDataset implementation
│  │  └─ model.py          # ShoutingVoiceFrameCNN LightningModule
│  ├─ train.py             # training entry point
│  ├─ predict.py           # checkpoint-driven inference
│  └─ visualize.py         # waveform + prediction overlay
├─ tests/                  # pytest suite
├─ docker/                 # placeholder for container assets
├─ environment.yml         # conda environment (Apple Silicon-friendly)
├─ requirements.txt        # pip-based dependency lock
├─ Makefile                # setup/format/lint/test helpers
├─ IMPLEMENTATION_PLAN.md  # progress checklist
└─ AGENTS.md               # contributor guidelines

Environment Setup

Option 1: Conda (Recommended for Apple Silicon)

git clone <repo-url>
cd ShoutingVoiceDetection
conda env create -f environment.yml
conda activate svd
python -m pip install --upgrade pip

Option 2: Virtualenv via Makefile

git clone <repo-url>
cd ShoutingVoiceDetection
make setup
source .venv/bin/activate  # after setup completes

Both paths install PyTorch, PyTorch Lightning, librosa, matplotlib, pytest, black, and ruff. The Makefile targets:

  • make formatblack app tests
  • make lintruff check app tests
  • make testpytest (PYTHONPATH configured via pytest.ini)

Option 3: Docker (Reproducible Everywhere)

The repo ships with docker/Dockerfile, which creates a slim CPU-only image that already contains Python, system audio libraries, and every dependency from requirements.txt. Use it when you want a guaranteed-clean environment or to run training in CI without managing conda.

Build once from the repo root:

docker build -t svd:cpu -f docker/Dockerfile .

Kick off training inside the container (all flags pass through to app.train):

docker run --rm svd:cpu --max_epochs=5 --batch_size=8

Need an interactive shell for debugging? Override the entrypoint:

docker run -it --entrypoint /bin/bash svd:cpu

Data Requirements

Place short WAV clips under app/data/audio/ and create app/data/labels.csv with:

file,start,end,label
example.wav,0.0,2.0,shouting
example.wav,2.0,4.0,non_vocal

Times are in seconds at 16 kHz. Labels accept string values like shouting, non_vocal, or numeric 0/1.

Dataset Diagnostics

  • Inspect class balance with the same frame settings you use for training:
    python -m app.utils.report_class_balance \
      --labels_csv app/data/labels.csv \
      --audio_dir app/data/audio \
      --frame_duration 1.0 \
      --hop_duration 0.5
    This prints how many positive vs. negative frames exist overall and per file.
  • Visualize the ground-truth spans from labels.csv on top of the waveform:
    python -m app.utils.plot_labels \
      --audio app/data/audio/example.wav \
      --labels_csv app/data/labels.csv \
      --output outputs/example_labels.png
    The plot highlights shouting intervals (red) and non-vocal intervals (green).

Training

Run on CPU or Apple MPS:

python -m app.train \
  --labels_csv app/data/labels.csv \
  --audio_dir app/data/audio \
  --batch_size 4 \
  --max_epochs 5 \
  --frame_duration 1.0 \
  --hop_duration 0.5

Key flags:

  • --default_root_dir <dir> (Lightning) if you want checkpoints somewhere other than lightning_logs/svd/.
  • --sample_rate, --n_mels, --n_fft, --spec_hop_length to tweak the feature extractor.
  • --num_workers for DataLoaders (set >0 when running outside notebooks).
  • --log_dir to relocate TensorBoard events and checkpoints (default lightning_logs).

Lightning checkpoints land under lightning_logs/svd/.../checkpoints/epoch=*-step=*.ckpt. Copy or symlink a checkpoint to a stable location (e.g., checkpoints/last.ckpt) for inference.

Visualize Training with TensorBoard

TensorBoard is included in requirements.txt/environment.yml. After any training run, Lightning writes logs under lightning_logs/svd/. Launch TensorBoard from the repo root:

tensorboard --logdir lightning_logs --port 6006

Open http://localhost:6006 to inspect loss curves, metrics, and learning-rate schedules across runs.

Inference

Generate frame probabilities for any WAV:

python -m app.predict \
  checkpoints/last.ckpt \
  app/data/audio/example.wav \
  --output outputs/example_preds.npy \
  --frame_duration 1.0 \
  --hop_duration 0.5 \
  --threshold 0.6

Outputs a NumPy array with shape (num_frames, 2) containing [start_time_sec, probability].

Visualization

Overlay shouting spans on the waveform using the saved predictions:

python -m app.visualize \
  --audio app/data/audio/example.wav \
  --predictions outputs/example_preds.npy \
  --threshold 0.6 \
  --output outputs/example_plot.png

If --output is omitted, the plot displays interactively.

Example Output

Shouting voice visualization

Testing & Quality

  • pytest tests/model -q validates dataset and model components.
  • make format / make lint keep code style consistent (black + ruff).
  • For coverage-oriented runs: pytest --cov=app --cov-report=term-missing.

Implementation Progress

Track ongoing work in IMPLEMENTATION_PLAN.md. Major milestones already complete:

  1. Environment setup (conda + Makefile).
  2. Repository skeleton and sample data.
  3. Dataset/model implementations with unit tests.
  4. Training CLI and smoke test.
  5. Inference + visualization pipeline.

Remaining tasks include Dockerization, README screenshots/examples, and CI hooks.

Contributing

See AGENTS.md for contributor expectations:

  • Use 4-space indentation, snake_case, PascalCase classes.
  • Run make format lint test before opening a PR.
  • Keep commits scoped (feat:, fix:, etc.) and link issues with Closes #<id>.
  • Do not commit large audio datasets or secrets; store them outside git-tracked paths.

Troubleshooting

  • ModuleNotFoundError: app → ensure you run commands via python -m app.train or set PYTHONPATH=$(pwd).
  • MPS/Metal errors → rerun with --accelerator cpu or set PYTORCH_ENABLE_MPS_FALLBACK=1.
  • librosa import issues → confirm the active environment is the one you created via conda/Make.

For more background, refer to shouting_voice_detection_tutorial.md, which mirrors the end-to-end workflow described above. Happy experimenting!

About

Frame-level shouting detection for music/speech using librosa features and a PyTorch Lightning CNN. Includes training, inference, visualization tools, Docker support, and diagnostics for inspecting labels and class balance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published