Scaling Embedding Layers in Language Models

We propose SCONE (calable, ontextualized, ffloaded, -gram mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent -grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached -gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.
View on arXiv@article{yu2025_2502.01637, title={ Scaling Embedding Layers in Language Models }, author={ Da Yu and Edith Cohen and Badih Ghazi and Yangsibo Huang and Pritish Kamath and Ravi Kumar and Daogao Liu and Chiyuan Zhang }, journal={arXiv preprint arXiv:2502.01637}, year={ 2025 } }