This repo contains the implementation of the paper "Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection", by Yi Zhu, Saurabh Powar, and Tiago H. Falk.
-
Python version == 3.10.2
-
Pytorch version == 1.13.1
-
Speechbrain version == 0.5.14
-
torchaudio == 0.13.1
cd YOUR-PROJECT-FOLDER git clone Universal-representation-dynamics-of-deepfake-speech pip install -r requirements.txt
We employed data from ASVspoof 2019 LA track and ASVspoof 2021 DF track. For both we are unfortuantely not authorised to re-distribute the data and labels. Related information can be found at the challenge website.
This track includes training, development, and evaluation sets, all zipped in the LA.zip
file. Download link
This track uses training and development data from the 2019 LA track, which is already included in the LA.zip
file. The evaluation data can be accessed here.
We offer two ways to replicate our results:
- Run
python exps/train.py exps/hparams/XXX.yaml
on your machine. This automatically trains and evaluates the model. However, you might need to first of all unzip all the downloaded files then edit the corresponding data paths in the.yaml
file to point to your own data files. - Run
sbatch run.sh
. This bash script was submitted to Compute Canada cluster for model training and evaluation, so you may need to alter a few lines to meet your own requirements. This script moves all data to the desired folder, unzips them, and evaluates the models. We provide a more detailed instruction in thebatch_scripts
folder.
We will soon release our pre-trained models in this repo.
One of the key elements in our model is the modulation transformation block, this converts any 2D (feature by time) representation into another 2D dynamic representation. We experimented with wav2vec 2.0 and wavLM in this project, but the transformation can be scaled to other representations as well.
For flexibility, we defined an independent class modulator
in the ssl_family.py
. This class can be integrated with other DL model block. An examplar usage is provided as follows:
from ssl_family import modulator
MTB = modulator(
sample_rate=50,
win_length=128,
hop_length=32,
)
input = torch.randn((1,1000,768)) # (bathc, time, feature_channel)
output = MTB(input)
print(output.shape)
Below is a visualization of the modulation dynamics of different deepfakes (same speech content, same speaker)
For questions, contact us at Yi.Zhu@inrs.ca.