In this paper we provide, to the best of our knowl-edge, the first comprehensive approach for in-corporating various masking mechanisms intoTransformers architectures in a scalable way. Weshow that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of thisgeneral mechanism. However by casting the prob-lem as a topological (graph-based) modulation ofunmasked attention, we obtain several results un-known before, including efficientd-dimensionalRPE-masking and graph-kernel masking. Weleverage many mathematical techniques rangingfrom spectral analysis through dynamic program-ming and random walks to new algorithms forsolving Markov processes on graphs. We providea corresponding empirical evaluation.
View on arXiv