arXiv | Conference | BibTeX
Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification
Sangmin Bae*,
June-Woo Kim*,
Won-Yang Cho,
Hyerim Baek,
Soyoun Son,
Byungjo Lee,
Changwan Ha,
Kyongpil Tae,
Sungnyun Kim
* equal contribution
- We demonstrate that the pretrained model on large-scale visual and audio datasets can be generalized to the respiratory sound classification task.
- We introduce a straightforward Patch-Mix augmentation, which randomly mixes patches between different samples, with Audio Spectrogram Transformer (AST).
- To overcome the label hierarchy in lung sound datasets, we propose an effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space.
Install the necessary packages with:
$ pip install torch torchvision torchaudio
$ pip install -r requirements.txt
Download the ICBHI dataset files from official_page.
$ wget https://bhichallenge.med.auth.gr/sites/default/files/ICBHI_final_database/ICBHI_final_database.zip
All *.wav
and *.txt
should be saved in data/icbhi_dataset/audio_test_data
.
Note that ICBHI dataset consists of a total of 6,898 respiratory cycles, of which 1,864 contain crackles, 886 contain wheezes, and 506 contain both crackles and wheezes, in 920 annotated audio samples from 126 subjects.
To simply train the model, run the shell files in scripts/
.
scripts/icbhi_ce.sh
: Cross-Entropy loss with AST model.scripts/icbhi_patchmix_ce.sh
: Patch-Mix loss with AST model, where the label depends on the interpolation ratio.scripts/icbhi_patchmix_cl.sh
: Patch-Mix contrastive loss with AST model.
Important arguments for different data settings.
--dataset
: other lungsound datasets or heart sound can be implemented--class_split
: "lungsound" or "diagnosis" classification--n_cls
: set number of classes as 4 or 2 (normal / abnormal) for lungsound classification--test_fold
: "official" denotes 60/40% train/test split, and "0"~"4" denote 80/20% split
Important arguments for models.
--model
: network architecture, see models--from_sl_official
: load ImageNet pretrained checkpoint--audioset_pretrained
: load AudioSet pretrained checkpoint and only support AST and SSAST
Important arugment for evaluation.
--eval
: switch mode to evaluation without any training--pretrained
: load pretrained checkpoint and requirepretrained_ckpt
argument.--pretrained_ckpt
: path for the pretrained checkpoint
The pretrained model checkpoints will be saved at save/[EXP_NAME]/best.pth
.
Patch-Mix Contrastive Learning achieves the state-of-the-art performance of 62.37%, which is higher than previous Score by +4.08%.
If you find this repo useful for your research, please consider citing our paper:
@inproceedings{bae23b_interspeech,
title = {Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification},
author = {Sangmin Bae and June-Woo Kim and Won-Yang Cho and Hyerim Baek and Soyoun Son and Byungjo Lee and Changwan Ha and Kyongpil Tae and Sungnyun Kim and Se-Young Yun},
year = {2023},
booktitle = {INTERSPEECH 2023},
pages = {5436--5440},
doi = {10.21437/Interspeech.2023-1426},
issn = {2958-1796},
}
- Sangmin Bae: bsmn0223@kaist.ac.kr
- June-Woo Kim: kaen2891@gmail.com