A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer”, IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2024.
- Contributions
-
Self-supervised learning of spatial acoustic representation (SSL-SAR)
- first self-supervised learning method in spatial acoustic representation learning and multi-channel audio signal processing
- designs cross-channel signal reconstruction pretext task to learn the spatial acoustic and the spectral pattern information
- learns useful knowledge that can be transferred to the spatial acoustics-related tasks
-
Multi-channel audio Conformer (MC-Conformer)
- unified architecture for both the pretext and downstream tasks
- learns the local and global properties of spatial acoustics present in the time-frequency domain
- boosts the performance of both pretext and downstream tasks
-
- Source signals: from WSJ0 database
- Simulated RIRs: generated by gpuRIR toolbox
- Simulated noise: generated by arbitrary noise field generator
- Real-world RIRs or microphone signals: from MIR, MeshRIR, DCASE, dEchorate, BUTReverb, ACE, LOCATA, MC-WSJ-AV, LibriCSS, AMIMeeting, AISHELL-4, AliMeeting, RealMAN databases
Datasets #Room Microphone Array #Mic. Pair #Room x #Source position x #Array position Noise Type MIR 3 Three 8-channel linear arrays 60 3 x 26 x 1 W/o MeshRIR 1 441 microphones 8874 1 x 32 x 1 W/o DCASE 9 A 4-channel tetrahedral array (EM32) 3 38530 Ambience dEchorate 11 Six 5-channel linear arrays 48 11 x 3 x 1 Ambience, babble, white BUTReverb 9 An 8-channel spherical array 28 51 Ambience ACE 7 A 2-channel array (Chromebook), 433 7 x 1 x 2 Ambience, babble, fan a 3-channel right-angled triangle array (Mobile), an 8-channel linear array (Lin8Ch), a 32-channel spherical array (EM32) LOCATA 1 A 15-channel linear array (DICIT), 492 Moving/static Ambience a 12-channel robot array (Robot head), a 32-channel spherical array (Eigenmike) MC- WSJ-AV 3 Two 8-channel linear arrays LibriCSS 1 A 7-channel circular array AMIMeeting 3 A 8-channel circular array AISHELL-4 10 A 8-channel circular array AliMeeting 21 A 8-channel circular array RealMAN 32 A 32-channel high-precision array
code
: 202407: the results are testing (to be updated).code_v1
: 202402, the results are the same as the paper.
1. Download datasets to folders according to the following dictionary
.-SAR-SSL
| .-code
| .-data
| .-exp
.-data
.-SrcSig
| .-wsj0
| .-dt
| .-et
| .-tr
.-RIR
| .-Mesh
| | .-S32-M441_npy
| .-MIRDB
| | .-Impulse_response_Acoustic_Lab_Bar-Ilan_University
| .-DCASE
| | .-TAU-SRIR_DB
| | .-TAU-SNoise_DB
| .-dEchorate
| | .-dEchorate_database.csv
| | .-dEchorate_rir.h5
| | .-dEchorate_annotations.h5
| | .-dEchorate_noise_gzip7.hdf5
| | .-dEchorate_babble_gzip7.hdf5
| | .-dEchorate_silence_gzip7.hdf5
| .-BUTReverb
| | .-RIRs
| .-ACE
| .-RIRN
| .-Data
.-MicSig
.-LOCATA
.-dev
.-eval
.- MC_WSJ_AV
.- LibriCSS
.- AMIMeeting
.- AISHELL-4
.- AliMeeting
.- RealMAN
2. Generate room impulse responses or microphone signals
-
Data for simulated experimets
- pre-training
python gen_simu.py --mode sig --stage pretrain --data_num 512000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0,1] python gen_simu.py --mode sig --stage preval --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0] python gen_simu.py --mode sig --stage pretest --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
- some test instances
python gen_simu.py --mode sig --stage pretest_ins_T1000 --data_num 10 --room_sz_range [[5,10],[3,6],[2.5,3]] --T60_range [1.0,1.0] --snr_range [20,20] --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
- downstream training
python gen_simu_certain_room.py --mode sig --stage train --room_num 1000 --sig_num_each_rir 2 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds python gen_simu_certain_room.py --mode sig --stage val --room_num 20 --sig_num_each_rir 1 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds python gen_simu_certain_room.py --mode sig --stage test --room_num 20 --sig_num_each_rir 4 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds
- pre-training
-
Data for real-world experimets
- real-wolrld RIR and noise signals
python gen_real_rir.py --dataset DCASE dEchorate BUTReverb ACE --data_type rir noise --read_dir ../../../data/RIR --save_dir ../../data/RIR/real python gen_real_rir.py --dataset Mesh MIR --data_type rir --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
- microphone signals for pre-training with selected RIRs and noise signals
python gen_sig_from_real_rir.py --stage pretrain --dataset Mesh MIR DCASE dEchorate BUTReverb ACE --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real python gen_sig_from_real_rir.py --stage preval --dataset DCASE BUTReverb --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real
- LOCATA microphone signals for downstream training (TDOA estimation)
python gen_LOCATA.py --stage train --save-to../../data/MicSig/real_ds_locata python gen_LOCATA.py --stage val --save-to../../data/MicSig/real_ds_locata python gen_LOCATA.py --stage test --save-to../../data/MicSig/real_ds_locata
- additional RIRs for downstream training
python gen_simu_certain_room.py --mode rir --stage train --room_num 1000 --save_to ../../data/RIR/simu
- real-wolrld RIR and noise signals
1. Preparation
- Install: numpy, scipy, soundfile, gpuRIR, etc.
2. Training
-
Simulated experiments
-
Pretext task: pre-training
python run_pretrain.py --pretrain --simu-exp --gpu-id 0,
-
Pretext task: evaluation
# * denotes the time version of pre-training model # --test-mode all: all or ins python run_pretrain.py --test --simu-exp --time * --test-mode all --gpu-id 0,
-
Downstream task: training
# --ds-nsimroom: 2,4,8,16,32,64,128 or 256 # --ds-task: TDOA DRR T60 C50 or ABS # --ds-trainmode: finetune, scratchLOW or lineareval python run_downstream.py --ds-train --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,
Stage Trials nRooms nRIRs/Room nSrcSig/RIR nMicSig train x16 2 50 2 200 x8 4 50 2 400 x4 8 50 2 800 x2 16 50 2 1600 x1 32 50 2 3200 x1 64 50 2 6400 x1 128 50 2 12800 x1 256 50 2 25600 val - 20 50 1 1000 test - 20 50 4 4000 -
Downstream task: evaluation
# --ds-nsimroom: 2, 4, 8, 16, 32, 64, 128 or 256 # --ds-task: TDOA, DRR, T60, C50, or ABS # --ds-trainmode: finetune, scratchLOW or lineareval # --test_mode: cal_metric, cal_metric_wo_info or vis_embed python run_downstream.py --ds-test --test_mode cal_metric --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,
-
-
Real-world experiments
-
Pretext task:pre-training
when using real-world data, first train on simulated data with a default cosine-decay learing rate (initialized with 0.001), and then finetune on real-world data with a learning rate 0.0001.
python run_pretrain.py --pretrain --gpu-id 0,
-
Downstream task: training
# --ds-task: TDOA DRR T60 C50 or ABS # --ds-trainmode: finetune, scratchLOW or lineareval # --ds-real-sim-ratio = 1 1, 1 0 or 0 1 python run_downstream.py --ds-train --ds-trainmode finetune --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0, python run_downstream.py --ds-train --ds-trainmode scratchLOW --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0,
-
Downstream task: read downstream results (MAEs of TDOA, DRR, T60, C50, SNR, ABS estimation) from saved mat files
python read_result_from_downstream_matfile.py --time * python read_lossmetric_simdata_from_logfile.py python read_lossmetric_realdata_from_logfile.py
-
-
Trained models
- pretext task
- best_model.tar
- downstream task
- ensemble_model.tar
- pretext task
If OSError: [Errno 24] Too many open files
occurs, input the following at the command line
ulimit -n 2048
If you find our work useful in your research, please consider citing:
@InProceedings{yang2024sarssl,
Author = "Bing Yang and Xiaofei Li",
Title = "Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer",
Journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)",
Volume = "32",
Number = "",
Pages = "4211-4225",
Year = "2024"}
MIT