In recent years, AIGC technologies are capable of generating hyper-realistic forged facial images, posing severe threats to facial security safety. To tackle these challenges, early Deepfake research primarily centered on real-fake classification. Recently, growing efforts have been devoted to the Deepfake Attribution (DFA) of generated content.
Nevertheless, existing research on Deepfake detection and attribution faces saturated performance in binary classification, limited diversity in datasets and algorithms, and imperfect evaluation protocols, which severely impede practical application.
To address these limitations, we propose a comprehensive deepfake detection and attribution benchmark named DeepfakeGenome (DFG). It contains 100 facial forgery algorithms and 2M images in total, achieving 4× to 100× larger than prior DFA benchmarks. We further designed 4 protocols for practical evaluation, including a novel retrieval-based attribution paradigm. Unlike previous open-set evaluation metrics, the proposed retrieval metrics are more aligned with the real-world active defense situation of blacklist registration mechanisms.
Based on these elaborate designs, we investigate the performance ceiling of deepfake attribution task. Over 2k+ experimental evaluations are conducted, and 10 insightful findings are derived. We hope this work can provide new insights into the DFA research field.
We systematically integrate and deduplicate existing deepfake and DFA datasets (FF++, DF40, ForgeryNet, DFFD, DNA-Det, OSMA, Wild-20, DiFF, DiffusionFace), and sorted out 92 deepfake algorithms. After that, we synthesized 8 kinds of recently accessible Text-to-Image algorithms: Qwen-Image, BAGEL, HunyuanImage-2.1/3.0, Infinity, Hart, Nanao Banana and GPT-4o. The DeepfakeGenome benchmark includes 100 kinds of face forgery algorithms. The total number of frames of the DFG has reached 2M. It surpasses the existing DFA datasets in the diversity of algorithms, the volume of frames and the novelty of forgery algorithms.
In terms of the forgery types, The DFG Benchmark contains four forgery types: 18 Face Swapping (FS), 18 Face Reenactment (FR), 48 Entire Face Synthesis (EFS), and 16 Face Editing (FE) algorithms.
In terms of forgery algorithm architectures, the DFG contains 6 classic Convolutional Neural Network (CNN), 3 Computer Graphics (CG), 41 Generative Adversarial Network (GAN), 35 Diffusion, 13 Close-source, and 2 Visual Autoregressive (VAR) algorithms.
To fully exploit the abundant data resources contained in DFG, we designed 4 data partitioning protocols to conduct experiments. The overall design criteria are: data in FF++ domain is for training, and data in other domains is for performance evaluation. In Protocol-1, we follow the traditional deepfake settings: training on vanilla FF++ dataset (DeepFakes, Face2Face, FaceSwap and NeuralTextures), evaluating ACC and AUC on 59 new algorithms of DFG (note that all algorithms in DF40 are removed because the performance has been explored in DF40 paper). In Protocol-2 & Protocol-3, 36 algorithms generated in FF++ domain are adopted as the training set, 36 and 34 algorithms in FF++ and CDF domain are adopted as the testing set, respectively. In this way, Same Domain, Same Algorithm and Cross Domain, Same Algorithm settings can be constructed. In Protocol-4, we adopt 36 algorithms in FF++ domain as training set, and conduct retrieval test for 65 algorithms out of FF++ domain and 36 training algorithms. The Cross Domain, Cross Algorithm setting is most suitable for real-world application scenarios.
@article{anonymous2026deepfakegenome,
author = {Anonymous},
title = {DeepfakeGenome: Toward Next-Generation Deepfake Attribution},
year = {2026},
}