On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning

30 January 2025

Haoyuan Sun

Ali Jadbabaie

Navid Azizan

ArXiv (abs)PDF HTML Github

Main:13 Pages

10 Figures

Bibliography:3 Pages

Appendix:25 Pages

Abstract

Transformer-based models demonstrate a remarkable ability for in-context learning (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Notably, recent research has provided insight into how the Transformer architecture can perform ICL, showing that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks.

View on arXiv

Comments on this paper