177

TRecViT: A Recurrent Video Transformer

Main:8 Pages
10 Figures
Bibliography:3 Pages
8 Tables
Appendix:3 Pages
Abstract

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having 3×3\times less parameters, 12×12\times smaller memory footprint, and 5×5\times lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

View on arXiv
Comments on this paper