Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

20 July 2024

Antonis Antoniades

ArXiv (abs)PDF HTML Github (8★)

Main:10 Pages

14 Figures

Bibliography:4 Pages

3 Tables

Appendix:7 Pages

Abstract

Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pretrained LLMs at scale, through a comprehensive $n$ -gram analysis of their training data. Our experiments focus on three general task types: translation, question-answering, and multiple-choice reasoning. With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant $n$ -gram pair data becomes increasingly important, leading to improved task performance, decreased memorization, stronger generalization, and emergent abilities. Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data, and point the way to larger-scale analyses that could further improve our understanding of these models.

View on arXiv

Comments on this paper