Frame-level shouting detection for music or speech audio using librosa for feature extraction and PyTorch Lightning for training. The project targets Apple Silicon (MPS) but works on CPU-only setups as well.
- 16 kHz resampling and mel-spectrogram features via
ShoutingVoiceFrameDataset. - Lightweight CNN LightningModule (
ShoutingVoiceFrameCNN) with BCE loss and accuracy logging. - Training CLI (
app/train.py) with configurable frame settings, hyperparameters, and deterministic splits. - Inference CLI (
app/predict.py) that loads checkpoints and exports[time, probability]arrays. - Visualization CLI (
app/visualize.py) overlaying shouting spans on the waveform. - Unit tests covering dataset behavior and CNN forward/training steps.
.
├─ app/
│ ├─ data/ # sample WAV + labels
│ ├─ model/
│ │ ├─ dataset.py # ShoutingVoiceFrameDataset implementation
│ │ └─ model.py # ShoutingVoiceFrameCNN LightningModule
│ ├─ train.py # training entry point
│ ├─ predict.py # checkpoint-driven inference
│ └─ visualize.py # waveform + prediction overlay
├─ tests/ # pytest suite
├─ docker/ # placeholder for container assets
├─ environment.yml # conda environment (Apple Silicon-friendly)
├─ requirements.txt # pip-based dependency lock
├─ Makefile # setup/format/lint/test helpers
├─ IMPLEMENTATION_PLAN.md # progress checklist
└─ AGENTS.md # contributor guidelines
git clone <repo-url>
cd ShoutingVoiceDetection
conda env create -f environment.yml
conda activate svd
python -m pip install --upgrade pipgit clone <repo-url>
cd ShoutingVoiceDetection
make setup
source .venv/bin/activate # after setup completesBoth paths install PyTorch, PyTorch Lightning, librosa, matplotlib, pytest, black, and ruff. The Makefile targets:
make format→black app testsmake lint→ruff check app testsmake test→pytest(PYTHONPATH configured viapytest.ini)
The repo ships with docker/Dockerfile, which creates a slim CPU-only image that already contains Python, system audio libraries, and every dependency from requirements.txt. Use it when you want a guaranteed-clean environment or to run training in CI without managing conda.
Build once from the repo root:
docker build -t svd:cpu -f docker/Dockerfile .Kick off training inside the container (all flags pass through to app.train):
docker run --rm svd:cpu --max_epochs=5 --batch_size=8Need an interactive shell for debugging? Override the entrypoint:
docker run -it --entrypoint /bin/bash svd:cpuPlace short WAV clips under app/data/audio/ and create app/data/labels.csv with:
file,start,end,label
example.wav,0.0,2.0,shouting
example.wav,2.0,4.0,non_vocal
Times are in seconds at 16 kHz. Labels accept string values like shouting, non_vocal, or numeric 0/1.
- Inspect class balance with the same frame settings you use for training:
This prints how many positive vs. negative frames exist overall and per file.
python -m app.utils.report_class_balance \ --labels_csv app/data/labels.csv \ --audio_dir app/data/audio \ --frame_duration 1.0 \ --hop_duration 0.5
- Visualize the ground-truth spans from
labels.csvon top of the waveform:The plot highlights shouting intervals (red) and non-vocal intervals (green).python -m app.utils.plot_labels \ --audio app/data/audio/example.wav \ --labels_csv app/data/labels.csv \ --output outputs/example_labels.png
Run on CPU or Apple MPS:
python -m app.train \
--labels_csv app/data/labels.csv \
--audio_dir app/data/audio \
--batch_size 4 \
--max_epochs 5 \
--frame_duration 1.0 \
--hop_duration 0.5Key flags:
--default_root_dir <dir>(Lightning) if you want checkpoints somewhere other thanlightning_logs/svd/.--sample_rate,--n_mels,--n_fft,--spec_hop_lengthto tweak the feature extractor.--num_workersfor DataLoaders (set >0 when running outside notebooks).--log_dirto relocate TensorBoard events and checkpoints (defaultlightning_logs).
Lightning checkpoints land under lightning_logs/svd/.../checkpoints/epoch=*-step=*.ckpt. Copy or symlink a checkpoint to a stable location (e.g., checkpoints/last.ckpt) for inference.
TensorBoard is included in requirements.txt/environment.yml. After any training run, Lightning writes logs under lightning_logs/svd/. Launch TensorBoard from the repo root:
tensorboard --logdir lightning_logs --port 6006Open http://localhost:6006 to inspect loss curves, metrics, and learning-rate schedules across runs.
Generate frame probabilities for any WAV:
python -m app.predict \
checkpoints/last.ckpt \
app/data/audio/example.wav \
--output outputs/example_preds.npy \
--frame_duration 1.0 \
--hop_duration 0.5 \
--threshold 0.6Outputs a NumPy array with shape (num_frames, 2) containing [start_time_sec, probability].
Overlay shouting spans on the waveform using the saved predictions:
python -m app.visualize \
--audio app/data/audio/example.wav \
--predictions outputs/example_preds.npy \
--threshold 0.6 \
--output outputs/example_plot.pngIf --output is omitted, the plot displays interactively.
pytest tests/model -qvalidates dataset and model components.make format/make lintkeep code style consistent (black + ruff).- For coverage-oriented runs:
pytest --cov=app --cov-report=term-missing.
Track ongoing work in IMPLEMENTATION_PLAN.md. Major milestones already complete:
- Environment setup (conda + Makefile).
- Repository skeleton and sample data.
- Dataset/model implementations with unit tests.
- Training CLI and smoke test.
- Inference + visualization pipeline.
Remaining tasks include Dockerization, README screenshots/examples, and CI hooks.
See AGENTS.md for contributor expectations:
- Use 4-space indentation, snake_case, PascalCase classes.
- Run
make format lint testbefore opening a PR. - Keep commits scoped (
feat:,fix:, etc.) and link issues withCloses #<id>. - Do not commit large audio datasets or secrets; store them outside git-tracked paths.
- ModuleNotFoundError: app → ensure you run commands via
python -m app.trainor setPYTHONPATH=$(pwd). - MPS/Metal errors → rerun with
--accelerator cpuor setPYTORCH_ENABLE_MPS_FALLBACK=1. - librosa import issues → confirm the active environment is the one you created via conda/Make.
For more background, refer to shouting_voice_detection_tutorial.md, which mirrors the end-to-end workflow described above. Happy experimenting!
