- The paper releasing SpeakerGuard has been accepted by IEEE Transactions on Dependable and Secure Computing (TDSC), 2022.
- Impact: SpeakerGuard is used by at least eight third-party research projects.
- Impact: SpeakerGuard has been included as part of an industry-academia collaboration project with Ant Group.
This repository contains the code for SpeakerGuard, a Pytorch library for security research on speaker recognition, addressesing the lack of benchmarks in the field.
Paper: SpeakerGuard Paper
Website: SpeakerGuard Website
Feel free to use SpeakerGuard for academic purpose 😄. For commercial purpose, please contact us 📫.
Cite our paper as follow:
@article{SpeakerGuard,
author = {Guangke Chen and
Zhe Zhao and
Fu Song and
Sen Chen and
Lingling Fan and
Feng Wang and
Jiashui Wang},
title = {Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition},
journal = {IEEE Transactions on Dependable and Secure Computing},
year = {2022}
}
This platform has the following features:
- fully developed with PyTorch, a user-friendly deep learning framework.
- modular design approach by breaking it into six key components, i.e., Model, Dataset, Attack (including 7 attack types), Defense (featuring 24 defenses), Adaptive Attack, and Metric (with 8 evaluation metrics), for rich functionality, usability, and extensibility.
- unified APIs and abstract classes for different objects within the same module, and step-by-step guidance with several detailed examples, for easy use.
- each component is structured as separate Python modules, each with its own directory and accessible via import, allowing seamless integration of new elements.
pytorch=1.6.0, torchaudio=0.6.0, numpy=1.19.2, scipy=1.4.1, libKMCUDA=6.2.3, kmeans-pytorch=0.3, torch-lfilter=0.0.3, pesq=0.0.2, pystoi=0.3.3, librosa=0.8.0, kaldi-io=0.9.4
Note: libKMCUDA and kmeans-pytorch=0.3 are used by our proposed feature-level defense Feature Compression (FeCo), for GPU and CPU version, respectively. If you don't have GPU, you can skip libKMCUDA. If you have problem in installing libKMCUDA, see my instructions.
If you want to use speech_compression methods in defense/speech_compression.py
, you should also install ffmpeg
and the required de/en-coders. See this instructions.
We provide five datasets, namely, Spk10_enroll, Spk10_test, Spk10_imposter, Spk251_train and Spk_251_test. They cover all the recognition tasks (i.e., CSI-E, CSI-NE, SV and OSI). The code in ./dataset/Dataset.py
will download them automatically when they are used. You can also manually download them using the follwing links:
Spk10_enroll.tar.gz, 18MB, MD5:0e90fb00b69989c0dde252a585cead85
Spk10_test.tar.gz, 114MB, MD5:b0f8eb0db3d2eca567810151acf13f16
Spk10_imposter.tar.gz, 212MB, MD5:42abd80e27b78983a13b74e44a67be65
Spk251_train.tar.gz, 10GB, MD5:02bee7caf460072a6fc22e3666ac2187 or Spk251_train.tar.gz 腾讯微云
Spk251_test.tar.gz, 1GB, MD5:182dd6b17f8bcfed7a998e1597828ed6
After downloading, untar them inside ./data
directory.
- Download pre-trained-models.tar.gz, 340MB, MD5:b011ead1e6663d557afa9e037f30a866 and untar it inside the reposity directory (i.e.,
./
). It contains the pre-trained ivector-PLDA and xvector-PLDA background models. - Run
python enroll.py iv_plda
andpython enroll.py xv_plda
to enroll the speakers in Spk10_enroll for ivector-PLDA and xvector-PLDA systems. Multiple speaker models for CSI-E and OSI tasks are stored asspeaker_model_iv_plda
andspeaker_model_xv_plda
inside./model_file
. Single speaker models for SV task are stored asspeaker_model_iv_plda_{ID}
andspeaker_model_xv_plda_{ID}
inside./model_file
. - Run
python set_threshold.py iv_plda
andpython set_threshold.py xv_plda
to set the threshold of SV/OSI tasks (also test the EER of SV/OSI tasks and the accuracy of CSI-E task).
- Sole natural training:
python natural_train.py -num_epoches 30 -batch_size 128 -model_ckpt ./model_file/natural-audionet -log ./model_file/natural-audionet-log
- Natural training with QT (q=512)
Note:
python natural_train.py -defense QT -defense_param 512 -defense_flag 0 -model_ckpt ./model_file/QT-512-natural-audionet -log ./model_file/QT-512-natural-audionet-log
-defense_flag 0
means QT operates at the waveform level.
- Sole FGSM adversarial training:
python adver_train.py -attacker FGSM -epsilon 0.002 -model_ckpt ./model_file/fgsm-adver-audionet -log ./model_file/fgsm-adver-audionet-log -evaluate_adver
- Sole PGD adversarial training:
python adver_train.py -attacker PGD -epsilon 0.002 -max_iter 10 -model_ckpt ./model_file/pgd-adver-audionet -log ./model_file/pgd-adver-audionet-log
- Combining adversarial training with input transformation AT (randomized, should use EOT during training)
python adver_train.py -defense AT -defense_param 16 -defense_flag 0 -attacker PGD -epsilon 0.002 -max_iter 10 -EOT_size 10 -EOT_batch_size 5 -model_ckpt ./model_file/AT-16-pgd-adver-audionet -log ./model_file/AT-16-pgd-adver-audionet-log
-
Example 1: FAKEBOB attack on naturally-trained audionet model with QT (q=512)
python attackMain.py -task CSI -root ./data -name Spk251_test -des ./adver-audio/QT-512-audionet-fakebob audionet_csine -extractor ./model_file/QT-512-natural-audionet FAKEBOB -epsilon 0.002
-
Example 2: PGD targeted attack on FeCo-defended ivector-plda model for CSI task. FeCo is randomized, using EOT
python attackMain.py -defense FeCo -defense_param "kmeans 0.2 L2" -defense_flag 1 -root ./data -name Spk10_test -des ./adver-audio/iv-pgd -task CSI -EOT_size 5 -EOT_batch_size 5 -targeted iv_plda -model_file ./model_file/iv_plda/speaker_model_iv_plda -gmm_frame_bs 50 PGD -epsilon 0.002 -max_iter 5 -loss Margin
Note:
-defense_flag 1
means we want FeCo to operate at the raw acoustic feature level. Set-defense_flag 2
or-defense_flag 3
for delta or cmvn acoustic feature level. For the iv_plda model, consider reducing the-gmm_frame_bs
parameter if you encounter the OOM error.
- Example 1: Testing for unadaptive attack
python test_attack.py -defense QT -defense_param 512 -defense_flag 0 -root ./adver-audio -name QT-512-audionet-fakebob -root_ori ./data -name_ori Spk251_test audionet_csine -extractor ./model_file/QT-512-natural-audionet
- Example 2: Testing for adaptive attack
python test_attack.py -defense FeCo -defense_param "kmeans 0.2 L2" -defense_flag 1 -root ./adver-audio -name iv-pgd iv_plda -model_file ./model_file/iv_plda/speaker_model_iv_plda
In Example 1, the adversarial examples are generated on undefended audionet model, but tested on QT-defended audionet model, so it is non-adaptive attack.
In Example 2, the adversarial examples are generated on FeCo-defended ivector-plda model using EOT (to overcome the randomness of FeCo), and also tested on FeCo-defended ivector-plda model, so it is adaptive attack. In this example, the adaptive attack may be not strong enough. You can improve its attack capacity by setting a larger max_iter or larger EOT_size at the cost of increased attack overhead.
By default, targeted attack randomly selects the targeted label. If you want to control the targeted label, you can run specify_target_label.py
and input the generated target label file to attackMain.py
and test_attack.py
.
test_attack.py
can also be used to test the benign accuracy of systems. Just let -root
and -name
point to the benign dataset.
You can also try the combination of different transformation-based defenses, e.g.,
-defense QT AT FeCo -defense_param 512 16 "kmeans 0.5 L2" -defense_flag 0 0 1 -defense_order sequential
where -defense_order
specifies the combination way (sequential or average).
If you would like to incorporate your attacks/defenses/models/datasets into our official repositor so that everyone can access them (also as a way to propaganda your works), feel free to make a pull resuest or contact us.
MC contains three state-of-the-art embedding-based speaker recognition models, i.e., ivector-PLDA, xvector-PLDA and AudioNet. Xvector-PLDA and AudioNet are based on neural networks while ivector-PLDA on statistic model (i.e Gaussian Mixture Model).
The flexibility and extensibility of SpeakerGuard make it easy to add new models.
To add a new model, one can define a new subclass of the torch.nn.Module
class and implement three methods: forward
, score
, and make_decision
, then it can be evaluated using different attacks.
We provide five datasets, namely, Spk10_enroll, Spk10_test, Spk10_imposter, Spk251_train and Spk_251_test. They cover all the recognition tasks (i.e., CSI-E, CSI-NE, SV and OSI).
All our datasets are subclasses of the class torch.utils.data.Dataset
. Hence, to add a new dataset, one just need to define a new subclass of torch.utils.data.Dataset
and implement two methods: __len__
and __getitem__
, which defines the length and loading sequence of the dataset.
SpeakerGuard currently incorporate four white-box attacks (FGSM, PGD, CW$_\infty$ and CW$_2$) and two black-box attacks (FAKEBOB and SirenAttack).
To add a new attack, one can define a new subclass of the abstract class Attack
and implement the attack
method. This design ensures that the attack
methods in different concrete Attack
classes have the same method signature, i.e., unified API.
To secure SRSs from adversarial attack, SpeakerGuard provides 2 robust training methods (FGSM and PGD adversarial training) and 22 speech/speaker-dedicated input transformation methods, including our feature-level approach FEATURE COMPRESSION (FeCo).
Since all our defenses are standalone functions, adding a new defense is straightforward, one just needs to implement it as a python function accepting the input audios or features as one of its arguments.
All these adaptive attack techniques are implemented as standalone wrappers so that they can be easily plugged into attacks to mount adaptive attacks.