- ArEnAV has been accepted to ICLR-2026!! π₯π₯π₯
- ArEnAV-Full and ArEnAV-Preview is open-sourced π₯π₯π₯
The dataset is under the EULA. You need to agree and sign the EULA to access the dataset. A part of the dataset is open for viewing the samples, from all the splits. It can be accessed here: ArEnAV-Preview.
Once the EULA has been signed and emailed, access the dataset from here: Download β Hugging Face β ArEnAV-Full
git lfs install && git clone https://huggingface.co/datasets/kartik060702/ArEnAV-Full
Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce ArEnAV, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It contains 387k videos and over 765 hours of real and fake videos. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research.
We show the data generation pipeline for ArEnAV dataset. In a) input videos are analysed
for audio, face, and text extraction. Using few-shot prompts with GPT-4.1-mini, CSW-based spoken
text manipulation is performed. This is followed by speech and face enactment generation. b-d) The
plots show the data splits and CSW distribution.

- Transcript rewrite β GPT-4.1-mini applies eight edit modes (meaning-only, dialect-shift, translationβ¦).
- Speech synthesis β XTTS-v2 / Fairseq-Arabic TTS / GPT-TTS + OpenVoice-v2 clone the original speaker.
- Lip-sync β Diff2Lip & LatentSync diffusion models render mouth motion; 25 random audio/visual perturbations simulate real-world noise.
Dataset distribution for i) Train, ii) Val and iii) Test split. The outer layer shows the split between various combinations of Text-to-Speech and Lip-Sync models used for audio-visual manipulation. The middle layer shows the distribution of language in the original transcript, which is either Ar (Arabic) or CSW (Code-Switched English-Arabic). The inner layer shows the distribution of different operations applied to the original transcripts, "meaning only", "dialect+meaning", and "meaning + translation"
Cross-Dataset Comparison: Details for publicly available deep - fake datasets in chronologically ascending order.
Cla: Binary classification, SL: Spatial localization, TL: Temporal localization, FS: Face swapping, RE: Face reenactment, TTS: Text-to-speech, VC: Voice conversion.
| Dataset | Year | Tasks | Modality | Method | #Clips | Multilingual | Code-Switch |
|---|---|---|---|---|---|---|---|
| Google DFD | 2019 | Cla | V | FS | 3 431 | β | β |
| DFDC | 2020 | Cla | AV | FS | 128 154 | β | β |
| DeeperForensics | 2020 | Cla | V | FS | 60 000 | β | β |
| Celeb-DF | 2020 | Cla | V | FS | 6 229 | β | β |
| KoDF | 2021 | Cla | V | FS/RE | 237 942 | β | β |
| FakeAVCeleb | 2021 | Cla | AV | RE | 25 500+ | β | β |
| ForgeryNet | 2021 | SL/TL/Cla | V | FS/RE | 221 247 | β | β |
| ASVSpoof2021-DF | 2021 | Cla | A | TTS/VC | 593 253 | β | β |
| LAV-DF | 2022 | TL/Cla | AV | RE/TTS | 136 304 | β | β |
| DF-Platter | 2023 | Cla | V | FS | 265 756 | β | β |
| AV-1M | 2023 | TL/Cla | AV | RE/TTS | 1 146 760 | β | β |
| PolyGlotFake | 2024 | Cla | AV | RE/TTS/VC | 15 238 | β | β |
| Illusion | 2025 | Cla | AV | FS/RE/TTS | 1 376 371 | β | β |
| ArEnAV (ours) | 2025 | Cla/TL | AV | RE/TTS/VC | 387 072 | β | β |
Multilingual Datasets Comparison: Data distribution in ArEnAV and comparison with other multilingual datasets.
| Subset | Unique Vids | Real | Fake | Non-Eng | CSW Vids | Arabic Vids | Arabic Variants |
|---|---|---|---|---|---|---|---|
| PolyGlotFake | 766 | 766 | 14 472 | 11 941 | 0 | 1 403 | β |
| Illusion | β | 141 440 | 1 234 931 | 4 385 | 0 | β | β |
| Train | 6 117 | 67 600 | 202 800 | 270 400 | 69 544 | 200 856 | Eg, MSA, Lev, Gulf |
| Val | 876 | 9 560 | 28 680 | 38 240 | 10 416 | 27 824 | Eg, MSA, Lev, Gulf |
| Test | 1 816 | 19 608 | 58 824 | 78 432 | 19 832 | 58 600 | Eg, MSA, Lev, Gulf |
| Total | 8 809 | 96 768 | 290 304 | 387 072 | 99 792 | 287 280 | β |
(Full table preserved for reproducibility)
| Set | Method | Mod. | APΒ @0.5 | APΒ @0.75 | APΒ @0.9 | APΒ @0.95 | ARΒ @50 | ARΒ @30 | ARΒ @20 | ARΒ @10 | ARΒ @5 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Full | Meso4 | V | 0.02 | 0.01 | 0.00 | 0.00 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 |
| MesoInception4 | V | 0.56 | 0.18 | 0.04 | 0.01 | 4.11 | 4.11 | 4.11 | 4.11 | 4.08 | |
| Xception | V | 22.50 | 10.26 | 2.29 | 0.58 | 19.13 | 19.13 | 19.13 | 19.13 | 19.13 | |
| BA-TFD (ZS) | AV | 0.17 | 0.01 | 0.00 | 0.00 | 9.72 | 5.20 | 3.07 | 1.46 | 0.73 | |
| BA-TFD+ (ZS) | AV | 0.11 | 0.00 | 0.00 | 0.00 | 5.77 | 2.95 | 2.09 | 0.87 | 0.37 | |
| BA-TFD | AV | 2.42 | 0.55 | 0.01 | 0.00 | 22.30 | 10.31 | 3.41 | 2.54 | 1.67 | |
| BA-TFD+ | AV | 3.74 | 1.10 | 0.06 | 0.01 | 30.75 | 9.42 | 4.55 | 3.05 | 1.83 | |
| Set V | Meso4 | V | 0.02 | 0.01 | 0.00 | 0.00 | 0.10 | 0.10 | 0.10 | 0.10 | 0.10 |
| MesoInception4 | V | 0.83 | 0.27 | 0.05 | 0.01 | 5.56 | 5.56 | 5.56 | 5.56 | 5.53 | |
| Xception | V | 32.76 | 14.48 | 3.30 | 0.81 | 27.78 | 27.78 | 27.78 | 27.78 | 27.78 | |
| BA-TFD (ZS) | AV | 0.12 | 0.00 | 0.00 | 0.00 | 8.44 | 4.34 | 2.44 | 1.13 | 0.49 | |
| BA-TFD+ (ZS) | AV | 0.07 | 0.00 | 0.00 | 0.00 | 4.69 | 2.39 | 1.65 | 0.69 | 0.29 | |
| BA-TFD | AV | 3.65 | 0.25 | 0.01 | 0.00 | 25.31 | 9.03 | 3.64 | 2.34 | 1.64 | |
| BA-TFD+ | AV | 5.65 | 1.89 | 0.08 | 0.02 | 31.09 | 13.21 | 5.91 | 3.05 | 2.05 | |
| Set A | Meso4 | V | 0.02 | 0.01 | 0.00 | 0.00 | 0.08 | 0.08 | 0.08 | 0.08 | 0.08 |
| MesoInception4 | V | 0.38 | 0.09 | 0.01 | 0.00 | 3.25 | 3.25 | 3.25 | 3.25 | 3.22 | |
| Xception | V | 14.72 | 3.92 | 0.29 | 0.09 | 11.78 | 11.78 | 11.78 | 11.78 | 11.78 | |
| BA-TFD (ZS) | AV | 0.23 | 0.01 | 0.00 | 0.00 | 12.14 | 6.46 | 3.85 | 1.83 | 0.95 | |
| BA-TFD+ (ZS) | AV | 0.14 | 0.01 | 0.00 | 0.00 | 7.32 | 3.79 | 2.69 | 1.13 | 0.48 | |
| BA-TFD | AV | 3.21 | 0.60 | 0.02 | 0.00 | 24.45 | 9.26 | 4.15 | 2.61 | 1.93 | |
| BA-TFD+ | AV | 4.35 | 1.10 | 0.10 | 0.00 | 28.35 | 11.23 | 4.85 | 3.11 | 2.00 |
| Access | Pre-train | Method | Mod. | Full AUC | Full Acc | V AUC | V Acc | A AUC | A Acc |
|---|---|---|---|---|---|---|---|---|---|
| Zero-shot | ASVSpoof-19 | XLSR-Mamba | A | 39.19 | 52.77 | 52.73 | 40.68 | 52.50 | 42.59 |
| Zero-shot | Internet | Video-LLaMA-7B | V | 51.48 | 26.29 | 51.47 | 34.21 | 51.43 | 34.18 |
| Zero-shot | Internet | Video-LLaMA-7B | AV | 48.79 | 59.29 | 48.71 | 55.37 | 48.86 | 55.26 |
| Zero-shot | AV-1M | BA-TFD | AV | 61.73 | 26.00 | 66.42 | 34.07 | 59.36 | 33.97 |
| Zero-shot | AV-1M | BA-TFD+ | AV | 60.96 | 25.84 | 64.49 | 34.28 | 59.44 | 33.80 |
| Video-level | ArEnAV | XLSR-Mamba | A | 73.00 | 61.00 | 57.47 | 66.16 | 86.33 | 78.00 |
| Video-level | ArEnAV | Meso4 | V | 49.30 | 75.00 | 49.15 | 66.67 | 49.30 | 66.67 |
| Video-level | ArEnAV | MesoInception4 | V | 50.34 | 46.23 | 50.28 | 47.48 | 50.35 | 47.67 |
| Video-level | ArEnAV | Xception | V | 50.05 | 75.00 | 49.90 | 66.67 | 50.32 | 66.67 |
| Frame-level | ArEnAV | Meso4 | V | 49.55 | 26.60 | 49.60 | 34.40 | 49.53 | 34.36 |
| Frame-level | ArEnAV | MesoInception4 | V | 51.14 | 41.25 | 50.77 | 51.84 | 45.28 | 44.09 |
| Frame-level | ArEnAV | Xception | V | 74.21 | 52.09 | 85.36 | 67.22 | 68.59 | 51.70 |
| Frame-level | AV-1M + ArEnAV | BA-TFD | AV | 75.91 | 44.31 | 77.64 | 58.29 | 72.21 | 45.21 |
| Frame-level | AV-1M + ArEnAV | BA-TFD+ | AV | 79.97 | 27.44 | 84.20 | 36.47 | 72.89 | 34.56 |
Results on ArEnAv, AV-1M and LAV-DF. The low performance on ArEnAV demonstrates the data complexity in CSW settings.
| Method | Dataset | AP@0.5 | AP@0.75 | AP@0.95 | AR@50 | AR@20 | AR@10 |
|---|---|---|---|---|---|---|---|
| BA-TFD | LAV-DF | 79.15 | 38.57 | 0.24 | 64.18 | 60.89 | 58.51 |
| AV-1M | 37.37 | 6.34 | 0.02 | 45.55 | 35.95 | 30.66 | |
| ArEnAV | 2.42 | 0.55 | 0.01 | 22.30 | 3.41 | 2.54 | |
| BA-TFD+ | LAV-DF | 96.30 | 84.96 | 4.44 | 80.48 | 79.40 | 78.75 |
| AV-1M | 44.42 | 13.64 | 0.03 | 48.86 | 40.37 | 34.67 | |
| ArEnAV | 3.74 | 1.10 | 0.04 | 30.75 | 4.55 | 3.05 |
User study results show that the deepfake detection and localization in multilingual CSW videos is non-trivial for human observers.
| Metric | Value |
|---|---|
| Accuracy | 60 % |
| AP @ 0.1 | 8.35 |
| AP @ 0.5 | 0.79 |
| AR @ 1 | 1.38 |
ArEnAV reveals that state-of-the-art detectors struggle badly with cross-lingual, code-switched deepfakes (β₯ 35 pp performance drop vs. monolingual benchmarks).
We hope this dataset drives research toward robust, multilingual, audio-visual deepfake detection.
We are grateful to all the amazing open-sourced TTS and Lip-sync models that made this project possible, including COQUI-TTS, Diff2Lip and LatentSync.
@misc{kuckreja2025tellhabibirealfake,
title={Tell me Habibi, is it Real or Fake?},
author={Kartik Kuckreja and Parul Gupta and Injy Hamed and Thamar Solorio and Muhammad Haris Khan and Abhinav Dhall},
year={2025},
eprint={2505.22581},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.22581},
}

