SAR-SSL

A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer”, IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2024.

Contributions
- Self-supervised learning of spatial acoustic representation (SSL-SAR)
  - first self-supervised learning method in spatial acoustic representation learning and multi-channel audio signal processing
  - designs cross-channel signal reconstruction pretext task to learn the spatial acoustic and the spectral pattern information
  - learns useful knowledge that can be transferred to the spatial acoustics-related tasks
- Multi-channel audio Conformer (MC-Conformer)
  - unified architecture for both the pretext and downstream tasks
  - learns the local and global properties of spatial acoustics present in the time-frequency domain
  - boosts the performance of both pretext and downstream tasks

Datasets

Source signals: from WSJ0 database
Simulated RIRs: generated by gpuRIR toolbox
Simulated noise: generated by arbitrary noise field generator

Real-world RIRs or microphone signals: from MIR, MeshRIR, DCASE, dEchorate, BUTReverb, ACE, LOCATA, MC-WSJ-AV, LibriCSS, AMIMeeting, AISHELL-4, AliMeeting, RealMAN databases

Datasets	#Room	Microphone Array	#Mic. Pair	#Room x #Source position x #Array position	Noise Type
MIR	3	Three 8-channel linear arrays	60	3 x 26 x 1	W/o
MeshRIR	1	441 microphones	8874	1 x 32 x 1	W/o
DCASE	9	A 4-channel tetrahedral array (EM32)	3	38530	Ambience
dEchorate	11	Six 5-channel linear arrays	48	11 x 3 x 1	Ambience, babble, white
BUTReverb	9	An 8-channel spherical array	28	51	Ambience
ACE	7	A 2-channel array (Chromebook),	433	7 x 1 x 2	Ambience, babble, fan
		a 3-channel right-angled triangle array (Mobile),
		an 8-channel linear array (Lin8Ch),
		a 32-channel spherical array (EM32)
LOCATA	1	A 15-channel linear array (DICIT),	492	Moving/static	Ambience
		a 12-channel robot array (Robot head),
		a 32-channel spherical array (Eigenmike)
MC- WSJ-AV	3	Two 8-channel linear arrays
LibriCSS	1	A 7-channel circular array
AMIMeeting	3	A 8-channel circular array
AISHELL-4	10	A 8-channel circular array
AliMeeting	21	A 8-channel circular array
RealMAN	32	A 32-channel high-precision array

Quick start

Version update

code: 202407: the results are testing (to be updated).
code_v1: 202402, the results are the same as the paper.

Data generation

1. Download datasets to folders according to the following dictionary

.-SAR-SSL
| .-code
| .-data
| .-exp
.-data
  .-SrcSig
  | .-wsj0
  |   .-dt
  |   .-et
  |   .-tr
  .-RIR
  | .-Mesh
  | | .-S32-M441_npy
  | .-MIRDB
  | | .-Impulse_response_Acoustic_Lab_Bar-Ilan_University
  | .-DCASE
  | | .-TAU-SRIR_DB
  | | .-TAU-SNoise_DB
  | .-dEchorate
  | | .-dEchorate_database.csv
  | | .-dEchorate_rir.h5
  | | .-dEchorate_annotations.h5
  | | .-dEchorate_noise_gzip7.hdf5
  | | .-dEchorate_babble_gzip7.hdf5
  | | .-dEchorate_silence_gzip7.hdf5
  | .-BUTReverb
  | | .-RIRs
  | .-ACE
  |   .-RIRN
  |   .-Data
  .-MicSig
    .-LOCATA
      .-dev
      .-eval
    .- MC_WSJ_AV
    .- LibriCSS
    .- AMIMeeting
    .- AISHELL-4
    .- AliMeeting
    .- RealMAN

2. Generate room impulse responses or microphone signals

Data for simulated experimets

pre-training

python gen_simu.py --mode sig --stage pretrain --data_num 512000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0,1]
python gen_simu.py --mode sig --stage preval --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
python gen_simu.py --mode sig --stage pretest --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]

some test instances

python gen_simu.py --mode sig --stage pretest_ins_T1000 --data_num 10 --room_sz_range [[5,10],[3,6],[2.5,3]] --T60_range [1.0,1.0] --snr_range [20,20] --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]

downstream training

python gen_simu_certain_room.py --mode sig --stage train --room_num 1000 --sig_num_each_rir 2 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
python gen_simu_certain_room.py --mode sig --stage val --room_num 20 --sig_num_each_rir 1 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
python gen_simu_certain_room.py --mode sig --stage test --room_num 20 --sig_num_each_rir 4 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds

Data for real-world experimets

real-wolrld RIR and noise signals

python gen_real_rir.py --dataset DCASE dEchorate BUTReverb ACE --data_type rir noise --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
python gen_real_rir.py --dataset Mesh MIR --data_type rir --read_dir ../../../data/RIR --save_dir ../../data/RIR/real

microphone signals for pre-training with selected RIRs and noise signals

python gen_sig_from_real_rir.py --stage pretrain --dataset Mesh MIR DCASE dEchorate BUTReverb ACE --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real 
python gen_sig_from_real_rir.py --stage preval --dataset DCASE BUTReverb --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real

LOCATA microphone signals for downstream training （TDOA estimation）

python gen_LOCATA.py --stage train --save-to../../data/MicSig/real_ds_locata
python gen_LOCATA.py --stage val --save-to../../data/MicSig/real_ds_locata
python gen_LOCATA.py --stage test --save-to../../data/MicSig/real_ds_locata

additional RIRs for downstream training

python gen_simu_certain_room.py --mode rir --stage train --room_num 1000 --save_to ../../data/RIR/simu

Pretext Task

1. Preparation

Install: numpy, scipy, soundfile, gpuRIR, etc.

2. Training

Simulated experiments

Pretext task: pre-training

python run_pretrain.py --pretrain --simu-exp --gpu-id 0,

Pretext task: evaluation

# * denotes the time version of pre-training model 
# --test-mode all: all or ins
python run_pretrain.py --test --simu-exp --time * --test-mode all --gpu-id 0,

Downstream task: training

# --ds-nsimroom: 2,4,8,16,32,64,128 or 256
# --ds-task: TDOA DRR T60 C50 or ABS
# --ds-trainmode: finetune, scratchLOW or lineareval
python run_downstream.py --ds-train --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,

Stage	Trials	nRooms	nRIRs/Room	nSrcSig/RIR	nMicSig
train	x16	2	50	2	200
	x8	4	50	2	400
	x4	8	50	2	800
	x2	16	50	2	1600
	x1	32	50	2	3200
	x1	64	50	2	6400
	x1	128	50	2	12800
	x1	256	50	2	25600
val	-	20	50	1	1000
test	-	20	50	4	4000

Downstream task: evaluation

# --ds-nsimroom: 2, 4, 8, 16, 32, 64, 128 or 256
# --ds-task: TDOA, DRR, T60, C50, or ABS
# --ds-trainmode: finetune, scratchLOW or lineareval
# --test_mode: cal_metric, cal_metric_wo_info or vis_embed
python run_downstream.py --ds-test --test_mode cal_metric --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,

Real-world experiments

Pretext task:pre-training

when using real-world data, first train on simulated data with a default cosine-decay learing rate (initialized with 0.001), and then finetune on real-world data with a learning rate 0.0001.
```
python run_pretrain.py --pretrain --gpu-id 0, 
```

Downstream task: training

# --ds-task: TDOA DRR T60 C50 or ABS
# --ds-trainmode: finetune, scratchLOW or lineareval
# --ds-real-sim-ratio = 1 1, 1 0 or 0 1
python run_downstream.py --ds-train --ds-trainmode finetune --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0, 
python run_downstream.py --ds-train --ds-trainmode scratchLOW --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0,

Downstream task: read downstream results (MAEs of TDOA, DRR, T60, C50, SNR, ABS estimation) from saved mat files

python read_result_from_downstream_matfile.py --time *
python read_lossmetric_simdata_from_logfile.py
python read_lossmetric_realdata_from_logfile.py

Trained models
- pretext task
  - best_model.tar
- downstream task
  - ensemble_model.tar

Others

If OSError: [Errno 24] Too many open files occurs, input the following at the command line

ulimit -n 2048

Citation

If you find our work useful in your research, please consider citing:

@InProceedings{yang2024sarssl,
    Author = "Bing Yang and Xiaofei Li",
    Title = "Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer",
    Journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)",
    Volume = "32",	
    Number = "",
    Pages = "4211-4225",
    Year = "2024"}

Licence

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SAR-SSL

Datasets

Quick start

Version update

Data generation

Pretext Task

Others

Citation

Licence

Files

README.md

Latest commit

History

README.md

File metadata and controls

SAR-SSL

Datasets

Quick start

Version update

Data generation

Pretext Task

Others

Citation

Licence