ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.12220
12
0
v1v2 (latest)

Two heads are better than one: simulating large transformers with small ones

13 June 2025
Hantao Yu
Josh Alman
ArXiv (abs)PDFHTML
Main:9 Pages
1 Figures
Bibliography:4 Pages
Appendix:12 Pages
Abstract

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences?

View on arXiv
@article{yu2025_2506.12220,
  title={ Two Heads Are Better than One: Simulating Large Transformers with Small Ones },
  author={ Hantao Yu and Josh Alman },
  journal={arXiv preprint arXiv:2506.12220},
  year={ 2025 }
}
Comments on this paper