116
v1v2 (latest)

Combining Facial Videos and Biosignals for Stress Estimation During Driving

Paraskevi Valergaki
Vassilis C. Nicodemou
Iason Oikonomidis
Antonis Argyros
Anastasios Roussos
Main:12 Pages
9 Figures
Bibliography:3 Pages
3 Tables
Abstract

Reliable stress recognition is critical in applications such as medical monitoring and safety-critical systems, including real-world driving. While stress is commonly detected using physiological signals such as perinasal perspiration and heart rate, facial activity provides complementary cues that can be captured unobtrusively from video. We propose a multimodal stress estimation framework that combines facial videos and physiological signals, remaining effective even when biosignal acquisition is challenging. Facial behavior is represented using a dense 3D Morphable Model, yielding a 56-dimensional descriptor that captures subtle expression and head-pose dynamics over time. To study how stress modulates facial motion, we perform extensive experiments alongside established physiological markers. Paired hypothesis tests between baseline and stressor phases show that 38 of 56 facial components exhibit consistent, phase-specific stress responses comparable to physiological markers. Building on these findings, we introduce a Transformer-based temporal modeling framework and evaluate unimodal, early-fusion, and cross-modal attention strategies. Cross-modal attention fusion of 3D-derived facial features with physiological signals substantially improves performance over physiological signals alone, increasing AUROC from 52.7% and accuracy from 51.0% to 92.0% and 86.7%, respectively. Although evaluated on driving data, the proposed framework and protocol may generalize to other stress estimation settings.

View on arXiv
Comments on this paper