DeepfakeGenome: Toward Next-Generation Deepfake Attribution

Abstract

In recent years, AIGC technologies are capable of generating hyper-realistic forged facial images, posing severe threats to facial security safety. To tackle these challenges, early Deepfake research primarily centered on real-fake classification. Recently, growing efforts have been devoted to the Deepfake Attribution (DFA) of generated content.

Nevertheless, existing research on Deepfake detection and attribution faces saturated performance in binary classification, limited diversity in datasets and algorithms, and imperfect evaluation protocols, which severely impede practical application.

To address these limitations, we propose a comprehensive deepfake detection and attribution benchmark named DeepfakeGenome (DFG). It contains 100 facial forgery algorithms and 2M images in total, achieving 4× to 100× larger than prior DFA benchmarks. We further designed 4 protocols for practical evaluation, including a novel retrieval-based attribution paradigm. Unlike previous open-set evaluation metrics, the proposed retrieval metrics are more aligned with the real-world active defense situation of blacklist registration mechanisms.

Based on these elaborate designs, we investigate the performance ceiling of deepfake attribution task. Over 2k+ experimental evaluations are conducted, and 10 insightful findings are derived. We hope this work can provide new insights into the DFA research field.

DeepfakeGenome Benchmark

We systematically integrate and deduplicate existing deepfake and DFA datasets (FF++, DF40, ForgeryNet, DFFD, DNA-Det, OSMA, Wild-20, DiFF, DiffusionFace), and sorted out 92 deepfake algorithms. After that, we synthesized 8 kinds of recently accessible Text-to-Image algorithms: Qwen-Image, BAGEL, HunyuanImage-2.1/3.0, Infinity, Hart, Nanao Banana and GPT-4o. The DeepfakeGenome benchmark includes 100 kinds of face forgery algorithms. The total number of frames of the DFG has reached 2M. It surpasses the existing DFA datasets in the diversity of algorithms, the volume of frames and the novelty of forgery algorithms.

In terms of the forgery types, The DFG Benchmark contains four forgery types: 18 Face Swapping (FS), 18 Face Reenactment (FR), 48 Entire Face Synthesis (EFS), and 16 Face Editing (FE) algorithms.

In terms of forgery algorithm architectures, the DFG contains 6 classic Convolutional Neural Network (CNN), 3 Computer Graphics (CG), 41 Generative Adversarial Network (GAN), 35 Diffusion, 13 Close-source, and 2 Visual Autoregressive (VAR) algorithms.

All 100 Algorithms in DFG

Summary of 100 facial forgery algorithms

4 Protocols

To fully exploit the abundant data resources contained in DFG, we designed 4 data partitioning protocols to conduct experiments. The overall design criteria are: data in FF++ domain is for training, and data in other domains is for performance evaluation. In Protocol-1, we follow the traditional deepfake settings: training on vanilla FF++ dataset (DeepFakes, Face2Face, FaceSwap and NeuralTextures), evaluating ACC and AUC on 59 new algorithms of DFG (note that all algorithms in DF40 are removed because the performance has been explored in DF40 paper). In Protocol-2 & Protocol-3, 36 algorithms generated in FF++ domain are adopted as the training set, 36 and 34 algorithms in FF++ and CDF domain are adopted as the testing set, respectively. In this way, Same Domain, Same Algorithm and Cross Domain, Same Algorithm settings can be constructed. In Protocol-4, we adopt 36 algorithms in FF++ domain as training set, and conduct retrieval test for 65 algorithms out of FF++ domain and 36 training algorithms. The Cross Domain, Cross Algorithm setting is most suitable for real-world application scenarios.

10 Insightful Findings

Finding-1. Traditional Deepfake training settings may not be suitable for evaluating emerging algorithms.
Finding-2. The deviation of clip-style models in sub-class algorithm is serious.
Finding-3. Latent space augmentation and representation learning strategies can significantly improve the generalization on new attacks.
Finding-4. DINOv2 performed best in attribution, while CLIP performed best in OOD retrieval.
Finding-5. RepMix and NPR achieve the best performance among the existing attribution methods.
Finding-6. CLIP generalizes while DINOv2 collapses in OOD settings.
Finding-7. Frequency and structural traces generalize better than specific spatial features.
Finding-8. Contrastive learning biases the feature space and hurts OOD generalization performance.
Finding-9. The OOD generalization ability of CLIP-ViTs relies heavily on pre-trained semantic priors.
Finding-10. Low-level and high-level features are complementary for OOD generalization.

Visualization of Synthetic 8 Algorithms in DFG