Approximating Two-Layer Feedforward Networks for Efficient Transformers

16 October 2023

Papers citing "Approximating Two-Layer Feedforward Networks for Efficient Transformers"

5 / 5 papers shown

Title
Improving Routing in Sparse Mixture of Experts with Graph of Tokens Tam Minh Nguyen Ngoc N. Tran Khai Nguyen Richard G. Baraniuk MoE 59 0 0 01 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing Piotr Piekos Róbert Csordás Jürgen Schmidhuber MoE VLM 94 1 0 01 May 2025
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 316 0 21 Sep 2022
Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts W. Kool Chris J. Maddison A. Mnih 26 10 0 24 Sep 2021
A Decomposable Attention Model for Natural Language Inference Ankur P. Parikh Oscar Täckström Dipanjan Das Jakob Uszkoreit 196 1,363 0 06 Jun 2016