Pretraining Without Attention

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

20 December 2022

ArXiv (abs)PDF HTML Github (112★)

Abstract

Transformers have been essential to pretraining success in NLP. Other architectures have been used, but require attention layers to match benchmark accuracy. This work explores pretraining without attention. We test recently developed routing layers based on state-space models (SSM) and model architectures based on multiplicative gating. Used together these modeling choices have a large impact on pretraining accuracy. Empirically the proposed Bidirectional Gated SSM (BiGS) replicates BERT pretraining results without attention and can be extended to long-form pretraining of 4096 tokens without approximation.

View on arXiv

Comments on this paper