ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.12220
10
0
v1v2 (latest)

Two Heads Are Better than One: Simulating Large Transformers with Small Ones

13 June 2025
Hantao Yu
Josh Alman
ArXiv (abs)PDFHTML
Main:9 Pages
1 Figures
Bibliography:4 Pages
Appendix:12 Pages
Abstract

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences?In this paper, we show that transformers with long input sequences (large transformers) can be efficiently simulated by transformers that can only take short input sequences (small transformers). Specifically, we prove that any transformer with input length NNN can be efficiently simulated by only O((N/M)2)O((N/M)^2)O((N/M)2) transformers with input length M≪NM \ll NM≪N, and that this cannot be improved in the worst case. However, we then prove that in various natural scenarios including average-case inputs, sliding window masking and attention sinks, the optimal number O(N/M)O(N/M)O(N/M) of small transformers suffice.

View on arXiv
@article{yu2025_2506.12220,
  title={ Two Heads Are Better than One: Simulating Large Transformers with Small Ones },
  author={ Hantao Yu and Josh Alman },
  journal={arXiv preprint arXiv:2506.12220},
  year={ 2025 }
}
Comments on this paper