523

Scaling Embedding Layers in Language Models

Main:8 Pages
20 Figures
Bibliography:7 Pages
8 Tables
Appendix:12 Pages
Abstract

We propose SCONE (S\textbf{S}calable, C\textbf{C}ontextualized, O\textbf{O}ffloaded, N\textbf{N}-gram E\textbf{E}mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent nn-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached nn-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

View on arXiv
Comments on this paper