20
3

ENTP: Encoder-only Next Token Prediction

Abstract

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the Count3\operatorname{Count3} task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.

View on arXiv
@article{ewer2025_2410.01600,
  title={ ENTP: Encoder-only Next Token Prediction },
  author={ Ethan Ewer and Daewon Chae and Thomas Zeng and Jinkyu Kim and Kangwook Lee },
  journal={arXiv preprint arXiv:2410.01600},
  year={ 2025 }
}
Comments on this paper