ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.12467
  4. Cited By
The Depth-to-Width Interplay in Self-Attention

The Depth-to-Width Interplay in Self-Attention

22 June 2020
Yoav Levine
Noam Wies
Or Sharir
Hofit Bata
Amnon Shashua
ArXivPDFHTML

Papers citing "The Depth-to-Width Interplay in Self-Attention"

30 / 30 papers shown
Title
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Thomson Yen
Andrew Siah
Haozhe Chen
Tianyi Peng
Daniel Guetta
Hongseok Namkoong
48
0
0
26 Mar 2025
NeoBERT: A Next-Generation BERT
NeoBERT: A Next-Generation BERT
Lola Le Breton
Quentin Fournier
Mariam El Mezouar
Sarath Chandar
AI4TS
60
1
0
26 Feb 2025
Distributional Scaling Laws for Emergent Capabilities
Distributional Scaling Laws for Emergent Capabilities
Rosie Zhao
Tian Qin
David Alvarez-Melis
Sham Kakade
Naomi Saphra
LRM
37
0
0
24 Feb 2025
Approximation Rate of the Transformer Architecture for Sequence Modeling
Approximation Rate of the Transformer Architecture for Sequence Modeling
Hao Jiang
Qianxiao Li
48
9
0
03 Jan 2025
Introducing Hybrid Modeling with Time-series-Transformers: A Comparative
  Study of Series and Parallel Approach in Batch Crystallization
Introducing Hybrid Modeling with Time-series-Transformers: A Comparative Study of Series and Parallel Approach in Batch Crystallization
Niranjan Sitapure
J. Kwon
27
32
0
25 Jul 2023
A Comprehensive Overview of Large Language Models
A Comprehensive Overview of Large Language Models
Humza Naveed
Asad Ullah Khan
Shi Qiu
Muhammad Saqib
Saeed Anwar
Muhammad Usman
Naveed Akhtar
Nick Barnes
Ajmal Saeed Mian
OffRL
57
523
0
12 Jul 2023
Max-Margin Token Selection in Attention Mechanism
Max-Margin Token Selection in Attention Mechanism
Davoud Ataee Tarzanagh
Yingcong Li
Xuechen Zhang
Samet Oymak
32
38
0
23 Jun 2023
CrystalGPT: Enhancing system-to-system transferability in
  crystallization prediction and control using time-series-transformers
CrystalGPT: Enhancing system-to-system transferability in crystallization prediction and control using time-series-transformers
Niranjan Sitapure
J. Kwon
21
51
0
31 May 2023
BloombergGPT: A Large Language Model for Finance
BloombergGPT: A Large Language Model for Finance
Shijie Wu
Ozan Irsoy
Steven Lu
Vadim Dabravolski
Mark Dredze
Sebastian Gehrmann
P. Kambadur
David S. Rosenberg
Gideon Mann
AIFin
68
785
0
30 Mar 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning,
  Generalization, and Sample Complexity
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
Hongkang Li
M. Wang
Sijia Liu
Pin-Yu Chen
ViT
MLT
35
56
0
12 Feb 2023
Controlling Personality Style in Dialogue with Zero-Shot Prompt-Based
  Learning
Controlling Personality Style in Dialogue with Zero-Shot Prompt-Based Learning
Angela Ramirez
Mamon Alsalihy
Kartik Aggarwal
Cecilia Li
Liren Wu
M. Walker
25
13
0
08 Feb 2023
Exploring the Approximation Capabilities of Multiplicative Neural
  Networks for Smooth Functions
Exploring the Approximation Capabilities of Multiplicative Neural Networks for Smooth Functions
Ido Ben-Shaul
Tomer Galanti
S. Dekel
15
3
0
11 Jan 2023
On the Ability of Graph Neural Networks to Model Interactions Between
  Vertices
On the Ability of Graph Neural Networks to Model Interactions Between Vertices
Noam Razin
Tom Verbin
Nadav Cohen
19
10
0
29 Nov 2022
What Language Model to Train if You Have One Million GPU Hours?
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao
Thomas Wang
Daniel Hesslow
Lucile Saulnier
Stas Bekman
...
Lintang Sutawika
Jaesung Tae
Zheng-Xin Yong
Julien Launay
Iz Beltagy
MoE
AI4CE
225
103
0
27 Oct 2022
Transformer Vs. MLP-Mixer: Exponential Expressive Gap For NLP Problems
Transformer Vs. MLP-Mixer: Exponential Expressive Gap For NLP Problems
D. Navon
A. Bronstein
MoE
36
0
0
17 Aug 2022
GatorTron: A Large Clinical Language Model to Unlock Patient Information
  from Unstructured Electronic Health Records
GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records
Xi Yang
Aokun Chen
Nima M. Pournejatian
Hoo-Chang Shin
Kaleb E. Smith
...
Duane A. Mitchell
W. Hogan
E. Shenkman
Jiang Bian
Yonghui Wu
AI4MH
LM&MA
37
499
0
02 Feb 2022
Examining Scaling and Transfer of Language Model Architectures for
  Machine Translation
Examining Scaling and Transfer of Language Model Architectures for Machine Translation
Biao Zhang
Behrooz Ghorbani
Ankur Bapna
Yong Cheng
Xavier Garcia
Jonathan Shen
Orhan Firat
20
21
0
01 Feb 2022
Implicit Regularization in Hierarchical Tensor Factorization and Deep
  Convolutional Neural Networks
Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks
Noam Razin
Asaf Maman
Nadav Cohen
33
29
0
27 Jan 2022
Few-shot Learning with Multilingual Language Models
Few-shot Learning with Multilingual Language Models
Xi Victoria Lin
Todor Mihaylov
Mikel Artetxe
Tianlu Wang
Shuohui Chen
...
Luke Zettlemoyer
Zornitsa Kozareva
Mona T. Diab
Ves Stoyanov
Xian Li
BDL
ELM
LRM
64
284
0
20 Dec 2021
Leveraging redundancy in attention with Reuse Transformers
Leveraging redundancy in attention with Reuse Transformers
Srinadh Bhojanapalli
Ayan Chakrabarti
Andreas Veit
Michal Lukasik
Himanshu Jain
Frederick Liu
Yin-Wen Chang
Sanjiv Kumar
18
23
0
13 Oct 2021
Scaling Laws for Neural Machine Translation
Scaling Laws for Neural Machine Translation
Behrooz Ghorbani
Orhan Firat
Markus Freitag
Ankur Bapna
M. Krikun
Xavier Garcia
Ciprian Chelba
Colin Cherry
32
99
0
16 Sep 2021
Which transformer architecture fits my data? A vocabulary bottleneck in
  self-attention
Which transformer architecture fits my data? A vocabulary bottleneck in self-attention
Noam Wies
Yoav Levine
Daniel Jannai
Amnon Shashua
40
20
0
09 May 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly
  Exponentially with Depth
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
32
373
0
05 Mar 2021
Implicit Regularization in Tensor Factorization
Implicit Regularization in Tensor Factorization
Noam Razin
Asaf Maman
Nadav Cohen
17
48
0
19 Feb 2021
On the Regularity of Attention
On the Regularity of Attention
James Vuckovic
A. Baratin
Rémi Tachet des Combes
11
6
0
10 Feb 2021
On the Computational Power of Transformers and its Implications in
  Sequence Modeling
On the Computational Power of Transformers and its Implications in Sequence Modeling
S. Bhattamishra
Arkil Patel
Navin Goyal
25
63
0
16 Jun 2020
Quasi-Equivalence of Width and Depth of Neural Networks
Quasi-Equivalence of Width and Depth of Neural Networks
Fenglei Fan
Rongjie Lai
Ge Wang
11
11
0
06 Feb 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
226
4,460
0
23 Jan 2020
Wider or Deeper: Revisiting the ResNet Model for Visual Recognition
Wider or Deeper: Revisiting the ResNet Model for Visual Recognition
Zifeng Wu
Chunhua Shen
A. Hengel
SSeg
247
1,491
0
30 Nov 2016
A Decomposable Attention Model for Natural Language Inference
A Decomposable Attention Model for Natural Language Inference
Ankur P. Parikh
Oscar Täckström
Dipanjan Das
Jakob Uszkoreit
196
1,367
0
06 Jun 2016
1