Beyond Self Attention: A Subquadratic Fourier Wavelet Transformer with Multi Modal Fusion
We revisit the use of spectral techniques to replaces the attention mechanism in Transformers through Fourier Transform based token mixing, and present a comprehensive and novel reformulation of this technique in next generation transformer models. We provide expanded literature context, detailed mathematical formulations of Fourier mixing and causal masking, and introduce a novel MultiDomain Fourier Wavelet Attention(MDFWA) that integrates frequency and time localized transforms to capture both global and local dependencies efficiently. We derive the complexity bounds, gradient formulas, and show that MDFWA achieves sub quadratic time and memory cost while improving expressive power. We validate our design on an abstractive summarization task using PubMed dataset, by enhancing the proposed approach with learned frequency bases, adaptive scale selection, and multi-modal extensions.
View on arXiv