LipShiFT: A Certifiably Robust Shift-based Vision Transformer

Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. We provide an upper bound estimate for the Lipschitz constants of this model using the norm on common image classification datasets. Ultimately, we demonstrate that our method scales to larger models and advances the state-of-the-art in certified robustness for transformer-based architectures.
View on arXiv@article{menon2025_2503.14751, title={ LipShiFT: A Certifiably Robust Shift-based Vision Transformer }, author={ Rohan Menon and Nicola Franco and Stephan Günnemann }, journal={arXiv preprint arXiv:2503.14751}, year={ 2025 } }