v1v2v3v4 (latest)

Robust Filter Attention: Self-Attention as a Parallel State Estimator

4 September 2025

Peter Racioppo

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (1★)

Main:6 Pages

17 Figures

Bibliography:4 Pages

3 Tables

Appendix:32 Pages

Abstract

We introduce Robust Filter Attention (RFA), an attention mechanism that reformulates self-attention as parallel robust filtering under a latent stochastic differential equation (SDE) prior, where analytically propagated uncertainty defines a time-dependent precision prior over attention weights. This formulation integrates key advantages of existing positional encodings: it preserves RoPE-style rotational structure while achieving long-context stability through explicit modeling of dissipation and diffusion. By imposing isotropic constraints on the dynamics and noise, RFA matches the $O(N^2 d)$ time and $O(N^2 + Nd)$ memory complexity of standard attention. Empirically, we find that uncertainty-aware weighting induces specialization into distinct filtering regimes across heads, improving temporal consistency and extrapolation across varying context lengths.

View on arXiv

Comments on this paper