Deepfake Detection with Spatio-Temporal Consistency and Attention

12 February 2025

Abstract

Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-artthis http URL, our technique also provides memory and computational advantages over the competitive techniques.

View on arXiv

@article{chen2025_2502.08216,
  title={ Deepfake Detection with Spatio-Temporal Consistency and Attention },
  author={ Yunzhuo Chen and Naveed Akhtar and Nur Al Hasan Haldar and Ajmal Mian },
  journal={arXiv preprint arXiv:2502.08216},
  year={ 2025 }
}

Comments on this paper