On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning
Main:13 Pages
10 Figures
Bibliography:3 Pages
Appendix:25 Pages
Abstract
Transformer-based models demonstrate a remarkable ability for in-context learning (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Notably, recent research has provided insight into how the Transformer architecture can perform ICL, showing that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks.
View on arXivComments on this paper
