Cross-Modal Dual-Causal Learning for Long-Term Action Recognition

9 July 2025

Xu Shaowu

ArXiv (abs)PDF HTML Github (1★)

Main:8 Pages

7 Figures

Bibliography:2 Pages

3 Tables

Abstract

Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts.

View on arXiv

Comments on this paper