v1v2v3v4 (latest)

Masked Mixers for Language Generation and Retrieval

2 September 2024

Benjamin L. Badger

ArXiv (abs)PDF HTML Github (5★)

Papers citing "Masked Mixers for Language Generation and Retrieval"

28 / 28 papers shown

Know Your Limits: Entropy Estimation Modeling for Compression and Generalization

Benjamin L. Badger

Matthew Neligeorge

172

13 Nov 2025

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal

Anton Lozhkov

Elie Bakouch

Gabriel Martín Blázquez

...

619

215

04 Feb 2025

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

523

769

25 Jun 2024

Better & Faster Large Language Models via Multi-token Prediction

Baptiste Rozière

353

269

30 Apr 2024

Improving Text Embeddings with Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Liang Wang

626

344

31 Dec 2023

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu

Tri Dao

Mamba

820

6,496

01 Dec 2023

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based ArchitectureNeural Information Processing Systems (NeurIPS), 2023

208

18 Oct 2023

Llama 2: Open Foundation and Fine-Tuned Chat Models

Louis Martin

...

Sharan Narang

Sergey Edunov

12.4K

16,448

18 Jul 2023

FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningInternational Conference on Learning Representations (ICLR), 2023

Tri Dao

LRM

635

2,480

17 Jul 2023

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan

Yuan-Fang Li

SyDa LRM

478

441

12 May 2023

Hyena Hierarchy: Towards Larger Convolutional Language ModelsInternational Conference on Machine Learning (ICML), 2023

656

469

21 Feb 2023

Why Deep Learning Generalizes

Benjamin L. Badger

TDI AI4CE

178

17 Nov 2022

Depth and Representation in Vision Models

Benjamin L. Badger

SSL VLM FAtt

175

11 Nov 2022

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Angela Fan

...

1.0K

2,879

09 Nov 2022

Small Language Models for Tabular Data

Benjamin L. Badger

LMTD

231

05 Nov 2022

In-context Learning and Induction Heads

...

736

804

24 Sep 2022

pNLP-Mixer: an Efficient all-MLP Architecture for LanguageAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

255

09 Feb 2022

8-bit Optimizers via Block-wise Quantization

Tim Dettmers

M. Lewis

Sam Shleifer

Luke Zettlemoyer

605

445

06 Oct 2021

Liang Zheng

224

16 Jun 2021

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Luke Melas-Kyriazi

ViT

223

116

06 May 2021

MLP-Mixer: An all-MLP Architecture for VisionNeural Information Processing Systems (NeurIPS), 2021

...

Alexey Dosovitskiy

1.5K

3,493

04 May 2021

RoFormer: Enhanced Transformer with Rotary Position Embedding

1.2K

4,895

20 Apr 2021

PyTorch: An Imperative Style, High-Performance Deep Learning LibraryNeural Information Processing Systems (NeurIPS), 2019

...

1.2K

51,304

03 Dec 2019

A mathematical theory of semantic development in deep neural networks

Andrew M. Saxe

James L. McClelland

Surya Ganguli

243

323

23 Oct 2018

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

3.1K

113,499

11 Oct 2018

Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017

8.4K

172,602

12 Jun 2017

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj

Daniel Fernando Campos

Li Deng

...

Xia Song

998

3,321

28 Nov 2016

Understanding Deep Image Representations by Inverting ThemComputer Vision and Pattern Recognition (CVPR), 2014

Aravindh Mahendran

Andrea Vedaldi

FAtt

698

2,066

26 Nov 2014