176
v1v2 (latest)

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Main:8 Pages
17 Figures
Bibliography:3 Pages
9 Tables
Appendix:11 Pages
Abstract

Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT) -- temporarily updating model parameters during inference using a loss derived from input data -- as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to 6×6\times higher accuracy compared to fine-tuned baselines -- reaching 53.0%53.0\% on the public validation set with an 8B-parameter LM and 61.9%61.9\% when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the 1010-shot setting by 7.37.3 percentage points (50.5%50.5\% to 57.8%57.8\%). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.

View on arXiv
Comments on this paper