v1v2 (latest)

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

8 June 2020

Srinadh Bhojanapalli

Sanjiv Kumar

Papers citing "$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers"

48 / 48 papers shown

Rectifying LLM Thought from Lens of Optimization

128

01 Dec 2025

On the Capacity of Self-Attention

Micah Adler

193

26 Sep 2025

Transformers in Pseudo-Random Number Generation: A Dual Perspective on Theory and Practice

Ran Li

Lingshu Zeng

114

02 Aug 2025

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Patrik Okanovic

Sameer Deshmukh

Grzegorz Kwa'sniewski

...

204

03 Jul 2025

Two Heads Are Better than One: Simulating Large Transformers with Small Ones

Hantao Yu

Josh Alman

232

13 Jun 2025

DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference AccelerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

191

06 Jun 2025

Approximation Rate of the Transformer Architecture for Sequence ModelingNeural Information Processing Systems (NeurIPS), 2023

Hao Jiang

Qianxiao Li

529

03 Jan 2025

How Numerical Precision Affects Arithmetical Reasoning Capabilities of LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

315

17 Oct 2024

Snuffy: Efficient Whole Slide Image ClassifierEuropean Conference on Computer Vision (ECCV), 2024

330

15 Aug 2024

Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Ali Khaleghi Rahimian

237

27 Jun 2024

FrameQuant: Flexible Low-Bit Quantization for TransformersInternational Conference on Machine Learning (ICML), 2024

174

10 Mar 2024

Transformers are Expressive, But Are They Expressive Enough for Regression?

Swaroop Nath

H. Khadilkar

Pushpak Bhattacharyya

216

23 Feb 2024

Implicit Bias and Fast Convergence Rates for Self-attention

Bhavya Vasudeva

Puneesh Deora

Christos Thrampoulidis

398

08 Feb 2024

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

271

03 Feb 2024

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

...

370

101

21 Nov 2023

The Expressive Power of Low-Rank AdaptationInternational Conference on Learning Representations (ICLR), 2023

Yuchen Zeng

Kangwook Lee

490

26 Oct 2023

On the Optimization and Generalization of Multi-head Attention

Puneesh Deora

Rouzbeh Ghaderi

Hossein Taheri

Christos Thrampoulidis

MLT

284

19 Oct 2023

Do Generative Large Language Models need billions of parameters?

Sia Gholami

Marwan Omar

188

12 Sep 2023

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?International Conference on Learning Representations (ICLR), 2023

T. Kajitsuka

Issei Sato

456

26 Jul 2023

Trained Transformers Learn Linear Models In-ContextJournal of machine learning research (JMLR), 2023

Ruiqi Zhang

Spencer Frei

Peter L. Bartlett

412

281

16 Jun 2023

Dynamic Context Pruning for Efficient and Interpretable Autoregressive TransformersNeural Information Processing Systems (NeurIPS), 2023

367

25 May 2023

Towards Revealing the Mystery behind Chain of Thought: A Theoretical PerspectiveNeural Information Processing Systems (NeurIPS), 2023

656

356

24 May 2023

Sampled Transformer for Point Sets

177

28 Feb 2023

A Brief Survey on the Approximation Theory for Sequence ModellingJournal of Machine Learning (JML), 2023

260

27 Feb 2023

One Fits All:Power General Time Series Analysis by Pretrained LMNeural Information Processing Systems (NeurIPS), 2023

Liang Sun

501

740

23 Feb 2023

Learning a Fourier Transform for Linear Relative Positional Encodings in TransformersInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2023

K. Choromanski

Shanda Li

Valerii Likhosherstov

324

03 Feb 2023

Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and CostNeural Information Processing Systems (NeurIPS), 2022

228

27 Oct 2022

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long SequencesAAAI Conference on Artificial Intelligence (AAAI), 2022

207

21 Oct 2022

Treeformer: Dense Gradient Trees for Efficient Attention ComputationInternational Conference on Learning Representations (ICLR), 2022

Lovish Madaan

Srinadh Bhojanapalli

Himanshu Jain

Prateek Jain

170

18 Aug 2022

Your Transformer May Not be as Powerful as You ExpectNeural Information Processing Systems (NeurIPS), 2022

312

26 May 2022

Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision TransformersInternational Conference on Machine Learning (ICML), 2022

Arda Sahiner

Tolga Ergen

Batu Mehmet Ozturkler

John M. Pauly

Morteza Mardani

Mert Pilanci

314

17 May 2022

Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in PracticeAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Andreas Grivas

Nikolay Bogoychev

Adam Lopez

183

12 Mar 2022

Attention Enables Zero Approximation Error

159

24 Feb 2022

Revisiting Over-smoothing in BERT from the Perspective of GraphInternational Conference on Learning Representations (ICLR), 2022

Han Shi

Jiahui Gao

Hang Xu

Xiaodan Liang

Zhenguo Li

Lingpeng Kong

Stephen M. S. Lee

James T. Kwok

216

17 Feb 2022

Can Vision Transformers Perform Convolution?

210

02 Nov 2021

Leveraging redundancy in attention with Reuse Transformers

Srinadh Bhojanapalli

Sanjiv Kumar

155

13 Oct 2021

Universal Approximation Under Constraints is Possible with Transformers

308

07 Oct 2021

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

118

17 Sep 2021

MATE: Multi-view Attention for Table Transformer EfficiencyConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Julian Martin Eisenschlos

185

100

09 Sep 2021

Combiner: Full Attention Transformer with Sparse Computation Cost

338

12 Jul 2021

Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation

Srinadh Bhojanapalli

Sanjiv Kumar

110

16 Jun 2021

Rethinking Graph Transformers with Spectral AttentionNeural Information Processing Systems (NeurIPS), 2021

485

691

07 Jun 2021

On the Expressive Power of Self-Attention Matrices

Valerii Likhosherstov

K. Choromanski

Adrian Weller

350

07 Jun 2021

Learning and Generalization in RNNsNeural Information Processing Systems (NeurIPS), 2021

A. Panigrahi

Navin Goyal

233

31 May 2021

SparseBERT: Rethinking the Importance Analysis in Self-attentionInternational Conference on Machine Learning (ICML), 2021

Han Shi

Jiahui Gao

Xiaozhe Ren

Hang Xu

Xiaodan Liang

Zhenguo Li

James T. Kwok

197

25 Feb 2021

A Survey on Visual TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020

...

1.1K

3,095

23 Dec 2020

Efficient Transformers: A SurveyACM Computing Surveys (ACM CSUR), 2020

866

1,362

14 Sep 2020

Big Bird: Transformers for Longer SequencesNeural Information Processing Systems (NeurIPS), 2020

Joshua Ainslie

...

1.3K

2,532

28 Jul 2020

O(n)O(n)O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Papers citing "$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers"

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers