Recur, Attend or Convolve? On Whether Temporal Modeling Matters for
Cross-Domain Robustness in Action Recognition
Most action recognition models today are highly parameterized, and evaluated on datasets with predominantly spatially distinct classes. It has also been shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape in still image recognition tasks. Taken together, this raises suspicion that large video models partly learn spurious correlations rather than to track relevant shapes over time to infer generalizable semantics from their movement. A natural way to avoid parameter explosion when learning visual patterns over time is to make use of recurrence. In this article, we empirically study whether the choice of low-level temporal modeling has consequences for texture bias and cross-domain robustness. In order to enable a light-weight and systematic assessment of the ability to capture temporal structure, not revealed from single frames, we provide the Temporal Shape (TS) dataset, as well as modified domains of Diving48 allowing for the investigation of texture bias for video models. We find that across a variety of model sizes, convolutional-recurrent and attention-based models show better out-of-domain robustness on TS than 3D CNNs. In domain shift experiments on Diving48, our experiments indicate that 3D CNNs and attention-based models exhibit more texture bias than convolutional-recurrent models. Moreover, qualitative examples suggest that convolutional-recurrent models learn more correct class attributes from the diving data when compared to the other two types of models at the same global validation performance.
View on arXiv