From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers

16 July 2021

K. Choromanski

Han Lin

Haoxian Chen

Tianyi Zhang

Arijit Sehanobish

Valerii Likhosherstov

Abstract

In this paper we provide, to the best of our knowl-edge, the first comprehensive approach for in-corporating various masking mechanisms intoTransformers architectures in a scalable way. Weshow that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of thisgeneral mechanism. However by casting the prob-lem as a topological (graph-based) modulation ofunmasked attention, we obtain several results un-known before, including efficientd-dimensionalRPE-masking and graph-kernel masking. Weleverage many mathematical techniques rangingfrom spectral analysis through dynamic program-ming and random walks to new algorithms forsolving Markov processes on graphs. We providea corresponding empirical evaluation.

View on arXiv

Comments on this paper