The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences?
View on arXiv@article{yu2025_2506.12220, title={ Two Heads Are Better than One: Simulating Large Transformers with Small Ones }, author={ Hantao Yu and Josh Alman }, journal={arXiv preprint arXiv:2506.12220}, year={ 2025 } }