Scaling Embedding Layers in Language Models

3 February 2025

Abstract

We propose SCONE ( $\textbf{S}$ calable, $\textbf{C}$ ontextualized, $\textbf{O}$ ffloaded, $\textbf{N}$ -gram $\textbf{E}$ mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent $n$ -grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached $n$ -gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

View on arXiv

@article{yu2025_2502.01637,
  title={ Scaling Embedding Layers in Language Models },
  author={ Da Yu and Edith Cohen and Badih Ghazi and Yangsibo Huang and Pritish Kamath and Ravi Kumar and Daogao Liu and Chiyuan Zhang },
  journal={arXiv preprint arXiv:2502.01637},
  year={ 2025 }
}

Comments on this paper