50
v1v2v3v4 (latest)

Robust Filter Attention: Self-Attention as a Parallel State Estimator

Main:6 Pages
17 Figures
Bibliography:4 Pages
3 Tables
Appendix:32 Pages
Abstract

We introduce Robust Filter Attention (RFA), an attention mechanism that reformulates self-attention as parallel robust filtering under a latent stochastic differential equation (SDE) prior, where analytically propagated uncertainty defines a time-dependent precision prior over attention weights. This formulation integrates key advantages of existing positional encodings: it preserves RoPE-style rotational structure while achieving long-context stability through explicit modeling of dissipation and diffusion. By imposing isotropic constraints on the dynamics and noise, RFA matches the O(N2d)O(N^2 d) time and O(N2+Nd)O(N^2 + Nd) memory complexity of standard attention. Empirically, we find that uncertainty-aware weighting induces specialization into distinct filtering regimes across heads, improving temporal consistency and extrapolation across varying context lengths.

View on arXiv
Comments on this paper