The Depth-to-Width Interplay in Self-Attention

22 June 2020

Papers citing "The Depth-to-Width Interplay in Self-Attention"

30 / 30 papers shown

Title
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework Thomson Yen Andrew Siah Haozhe Chen Tianyi Peng Daniel Guetta Hongseok Namkoong 48 0 0 26 Mar 2025
NeoBERT: A Next-Generation BERT Lola Le Breton Quentin Fournier Mariam El Mezouar Sarath Chandar AI4TS 60 1 0 26 Feb 2025
Distributional Scaling Laws for Emergent Capabilities Rosie Zhao Tian Qin David Alvarez-Melis Sham Kakade Naomi Saphra LRM 37 0 0 24 Feb 2025
Approximation Rate of the Transformer Architecture for Sequence Modeling Hao Jiang Qianxiao Li 48 9 0 03 Jan 2025
Introducing Hybrid Modeling with Time-series-Transformers: A Comparative Study of Series and Parallel Approach in Batch Crystallization Niranjan Sitapure J. Kwon 27 32 0 25 Jul 2023
A Comprehensive Overview of Large Language Models Humza Naveed Asad Ullah Khan Shi Qiu Muhammad Saqib Saeed Anwar Muhammad Usman Naveed Akhtar Nick Barnes Ajmal Saeed Mian OffRL 57 523 0 12 Jul 2023
Max-Margin Token Selection in Attention Mechanism Davoud Ataee Tarzanagh Yingcong Li Xuechen Zhang Samet Oymak 32 38 0 23 Jun 2023
CrystalGPT: Enhancing system-to-system transferability in crystallization prediction and control using time-series-transformers Niranjan Sitapure J. Kwon 21 51 0 31 May 2023
BloombergGPT: A Large Language Model for Finance Shijie Wu Ozan Irsoy Steven Lu Vadim Dabravolski Mark Dredze Sebastian Gehrmann P. Kambadur David S. Rosenberg Gideon Mann AIFin 68 785 0 30 Mar 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity Hongkang Li M. Wang Sijia Liu Pin-Yu Chen ViT MLT 35 56 0 12 Feb 2023
Controlling Personality Style in Dialogue with Zero-Shot Prompt-Based Learning Angela Ramirez Mamon Alsalihy Kartik Aggarwal Cecilia Li Liren Wu M. Walker 25 13 0 08 Feb 2023
Exploring the Approximation Capabilities of Multiplicative Neural Networks for Smooth Functions Ido Ben-Shaul Tomer Galanti S. Dekel 15 3 0 11 Jan 2023
On the Ability of Graph Neural Networks to Model Interactions Between Vertices Noam Razin Tom Verbin Nadav Cohen 19 10 0 29 Nov 2022
What Language Model to Train if You Have One Million GPU Hours? Teven Le Scao Thomas Wang Daniel Hesslow Lucile Saulnier Stas Bekman ... Lintang Sutawika Jaesung Tae Zheng-Xin Yong Julien Launay Iz Beltagy MoE AI4CE 225 103 0 27 Oct 2022
Transformer Vs. MLP-Mixer: Exponential Expressive Gap For NLP Problems D. Navon A. Bronstein MoE 36 0 0 17 Aug 2022
GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records Xi Yang Aokun Chen Nima M. Pournejatian Hoo-Chang Shin Kaleb E. Smith ... Duane A. Mitchell W. Hogan E. Shenkman Jiang Bian Yonghui Wu AI4MH LM&MA 37 499 0 02 Feb 2022
Examining Scaling and Transfer of Language Model Architectures for Machine Translation Biao Zhang Behrooz Ghorbani Ankur Bapna Yong Cheng Xavier Garcia Jonathan Shen Orhan Firat 20 21 0 01 Feb 2022
Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks Noam Razin Asaf Maman Nadav Cohen 33 29 0 27 Jan 2022
Few-shot Learning with Multilingual Language Models Xi Victoria Lin Todor Mihaylov Mikel Artetxe Tianlu Wang Shuohui Chen ... Luke Zettlemoyer Zornitsa Kozareva Mona T. Diab Ves Stoyanov Xian Li BDL ELM LRM 64 284 0 20 Dec 2021
Leveraging redundancy in attention with Reuse Transformers Srinadh Bhojanapalli Ayan Chakrabarti Andreas Veit Michal Lukasik Himanshu Jain Frederick Liu Yin-Wen Chang Sanjiv Kumar 18 23 0 13 Oct 2021
Scaling Laws for Neural Machine Translation Behrooz Ghorbani Orhan Firat Markus Freitag Ankur Bapna M. Krikun Xavier Garcia Ciprian Chelba Colin Cherry 32 99 0 16 Sep 2021
Which transformer architecture fits my data? A vocabulary bottleneck in self-attention Noam Wies Yoav Levine Daniel Jannai Amnon Shashua 40 20 0 09 May 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth Yihe Dong Jean-Baptiste Cordonnier Andreas Loukas 32 373 0 05 Mar 2021
Implicit Regularization in Tensor Factorization Noam Razin Asaf Maman Nadav Cohen 17 48 0 19 Feb 2021
On the Regularity of Attention James Vuckovic A. Baratin Rémi Tachet des Combes 11 6 0 10 Feb 2021
On the Computational Power of Transformers and its Implications in Sequence Modeling S. Bhattamishra Arkil Patel Navin Goyal 25 63 0 16 Jun 2020
Quasi-Equivalence of Width and Depth of Neural Networks Fenglei Fan Rongjie Lai Ge Wang 11 11 0 06 Feb 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 226 4,460 0 23 Jan 2020
Wider or Deeper: Revisiting the ResNet Model for Visual Recognition Zifeng Wu Chunhua Shen A. Hengel SSeg 247 1,491 0 30 Nov 2016
A Decomposable Attention Model for Natural Language Inference Ankur P. Parikh Oscar Täckström Dipanjan Das Jakob Uszkoreit 196 1,367 0 06 Jun 2016