72
0

Scaling Embedding Layers in Language Models

Abstract

We propose SCONE (S\textbf{S}calable, C\textbf{C}ontextualized, O\textbf{O}ffloaded, N\textbf{N}-gram E\textbf{E}mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent nn-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached nn-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

View on arXiv
@article{yu2025_2502.01637,
  title={ Scaling Embedding Layers in Language Models },
  author={ Da Yu and Edith Cohen and Badih Ghazi and Yangsibo Huang and Pritish Kamath and Ravi Kumar and Daogao Liu and Chiyuan Zhang },
  journal={arXiv preprint arXiv:2502.01637},
  year={ 2025 }
}
Comments on this paper