Two heads are better than one: simulating large transformers with small ones

13 June 2025

Hantao Yu

Josh Alman

ArXiv (abs)PDF HTML

Main:9 Pages

1 Figures

Bibliography:4 Pages

Appendix:12 Pages

Abstract

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences?

View on arXiv

@article{yu2025_2506.12220,
  title={ Two Heads Are Better than One: Simulating Large Transformers with Small Ones },
  author={ Hantao Yu and Josh Alman },
  journal={arXiv preprint arXiv:2506.12220},
  year={ 2025 }
}

Comments on this paper