ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.01482
  4. Cited By
Masked Mixers for Language Generation and Retrieval
v1v2v3v4 (latest)

Masked Mixers for Language Generation and Retrieval

2 September 2024
Benjamin L. Badger
ArXiv (abs)PDFHTMLGithub (5★)

Papers citing "Masked Mixers for Language Generation and Retrieval"

28 / 28 papers shown
Know Your Limits: Entropy Estimation Modeling for Compression and Generalization
Know Your Limits: Entropy Estimation Modeling for Compression and Generalization
Benjamin L. Badger
Matthew Neligeorge
170
1
0
13 Nov 2025
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Loubna Ben Allal
Anton Lozhkov
Elie Bakouch
Gabriel Martín Blázquez
Guilherme Penedo
...
Cyril Zakka
Mathieu Morlon
Colin Raffel
Leandro von Werra
Thomas Wolf
MoE
613
208
0
04 Feb 2025
The FineWeb Datasets: Decanting the Web for the Finest Text Data at
  Scale
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
513
744
0
25 Jun 2024
Better & Faster Large Language Models via Multi-token Prediction
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle
Badr Youbi Idrissi
Baptiste Rozière
David Lopez-Paz
Gabriele Synnaeve
349
262
0
30 Apr 2024
Improving Text Embeddings with Large Language Models
Improving Text Embeddings with Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Liang Wang
Nan Yang
Xiaolong Huang
Linjun Yang
Rangan Majumder
Furu Wei
SyDa
607
337
0
31 Dec 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu
Tri Dao
Mamba
809
6,333
0
01 Dec 2023
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based ArchitectureNeural Information Processing Systems (NeurIPS), 2023
Daniel Y. Fu
Simran Arora
Jessica Grogan
Isys Johnson
Sabri Eyuboglu
Armin W. Thomas
Benjamin Spector
Michael Poli
Atri Rudra
Christopher Ré
MoE
202
70
0
18 Oct 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
12.2K
16,310
0
18 Jul 2023
FlashAttention-2: Faster Attention with Better Parallelism and Work
  Partitioning
FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningInternational Conference on Learning Representations (ICLR), 2023
Tri Dao
LRM
620
2,426
0
17 Jul 2023
TinyStories: How Small Can Language Models Be and Still Speak Coherent
  English?
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Ronen Eldan
Yuan-Fang Li
SyDaLRM
463
435
0
12 May 2023
Hyena Hierarchy: Towards Larger Convolutional Language Models
Hyena Hierarchy: Towards Larger Convolutional Language ModelsInternational Conference on Machine Learning (ICML), 2023
Michael Poli
Stefano Massaroli
Eric Q. Nguyen
Daniel Y. Fu
Tri Dao
S. Baccus
Yoshua Bengio
Stefano Ermon
Christopher Ré
VLM
652
466
0
21 Feb 2023
Why Deep Learning Generalizes
Why Deep Learning Generalizes
Benjamin L. Badger
TDIAI4CE
168
4
0
17 Nov 2022
Depth and Representation in Vision Models
Depth and Representation in Vision Models
Benjamin L. Badger
SSLVLMFAtt
175
3
0
11 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
1.0K
2,869
0
09 Nov 2022
Small Language Models for Tabular Data
Small Language Models for Tabular Data
Benjamin L. Badger
LMTD
231
2
0
05 Nov 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
731
789
0
24 Sep 2022
pNLP-Mixer: an Efficient all-MLP Architecture for Language
pNLP-Mixer: an Efficient all-MLP Architecture for LanguageAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Francesco Fusco
Damian Pascual
Peter W. J. Staar
Diego Antognini
243
35
0
09 Feb 2022
8-bit Optimizers via Block-wise Quantization
8-bit Optimizers via Block-wise Quantization
Tim Dettmers
M. Lewis
Sam Shleifer
Luke Zettlemoyer
MQ
543
440
0
06 Oct 2021
Invertible Attention
Invertible Attention
Jiajun Zha
Yiran Zhong
Jing Zhang
Leonid Sigal
Liang Zheng
219
7
0
16 Jun 2021
Do You Even Need Attention? A Stack of Feed-Forward Layers Does
  Surprisingly Well on ImageNet
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
Luke Melas-Kyriazi
ViT
195
116
0
06 May 2021
MLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for VisionNeural Information Processing Systems (NeurIPS), 2021
Ilya O. Tolstikhin
N. Houlsby
Alexander Kolesnikov
Lucas Beyer
Xiaohua Zhai
...
Andreas Steiner
Daniel Keysers
Jakob Uszkoreit
Mario Lucic
Alexey Dosovitskiy
1.4K
3,468
0
04 May 2021
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su
Yu Lu
Shengfeng Pan
Ahmed Murtadha
Bo Wen
Yunfeng Liu
1.2K
4,768
0
20 Apr 2021
PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch: An Imperative Style, High-Performance Deep Learning LibraryNeural Information Processing Systems (NeurIPS), 2019
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
1.1K
50,986
0
03 Dec 2019
A mathematical theory of semantic development in deep neural networks
A mathematical theory of semantic development in deep neural networks
Andrew M. Saxe
James L. McClelland
Surya Ganguli
234
320
0
23 Oct 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
3.1K
112,756
0
11 Oct 2018
Attention Is All You Need
Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
8.3K
171,167
0
12 Jun 2017
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Payal Bajaj
Daniel Fernando Campos
Nick Craswell
Li Deng
Jianfeng Gao
...
Mir Rosenberg
Xia Song
Alina Stoica
Saurabh Tiwary
Tong Wang
RALM
971
3,299
0
28 Nov 2016
Understanding Deep Image Representations by Inverting Them
Understanding Deep Image Representations by Inverting ThemComputer Vision and Pattern Recognition (CVPR), 2014
Aravindh Mahendran
Andrea Vedaldi
FAtt
691
2,060
0
26 Nov 2014
1
Page 1 of 1