DMAGaze: Gaze Estimation Based on Feature Disentanglement and Multi-Scale Attention

Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.
View on arXiv@article{chen2025_2504.11160, title={ DMAGaze: Gaze Estimation Based on Feature Disentanglement and Multi-Scale Attention }, author={ Haohan Chen and Hongjia Liu and Shiyong Lan and Wenwu Wang and Yixin Qiao and Yao Li and Guonan Deng }, journal={arXiv preprint arXiv:2504.11160}, year={ 2025 } }