24
11

FaceXFormer: A Unified Transformer for Facial Analysis

Abstract

In this work, we introduce FaceXFormer, an end-to-end unified transformer model capable of performing ten facial analysis tasks within a single framework. These tasks include face parsing, landmark detection, head pose estimation, attribute prediction, age, gender, and race estimation, facial expression recognition, face recognition, and face visibility. Traditional face analysis approaches rely on task-specific architectures and pre-processing techniques, limiting scalability and integration. In contrast, FaceXFormer employs a transformer-based encoder-decoder architecture, where each task is represented as a learnable token, enabling seamless multi-task processing within a unified model. To enhance efficiency, we introduce FaceX, a lightweight decoder with a novel bi-directional cross-attention mechanism, which jointly processes face and task tokens to learn robust and generalized facial representations. We train FaceXFormer on ten diverse face perception datasets and evaluate it against both specialized and multi-task models across multiple benchmarks, demonstrating state-of-the-art or competitive performance. Additionally, we analyze the impact of various components of FaceXFormer on performance, assess real-world robustness in "in-the-wild" settings, and conduct a computational performance evaluation. To the best of our knowledge, FaceXFormer is the first model capable of handling ten facial analysis tasks while maintaining real-time performance at 33.21 FPS. Code:this https URL

View on arXiv
@article{narayan2025_2403.12960,
  title={ FaceXFormer: A Unified Transformer for Facial Analysis },
  author={ Kartik Narayan and Vibashan VS and Rama Chellappa and Vishal M. Patel },
  journal={arXiv preprint arXiv:2403.12960},
  year={ 2025 }
}
Comments on this paper