Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1910.05895
Cited By
Transformers without Tears: Improving the Normalization of Self-Attention
14 October 2019
Toan Q. Nguyen
Julian Salazar
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Transformers without Tears: Improving the Normalization of Self-Attention"
50 / 149 papers shown
Title
A Generative Re-ranking Model for List-level Multi-objective Optimization at Taobao
Yue Meng
Cheng Guo
Yi Cao
Tong Liu
Bo Zheng
16
0
0
12 May 2025
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Ashwinee Panda
Vatsal Baherwani
Zain Sarwar
Benjamin Thérien
Supriyo Chakraborty
Tom Goldstein
MoE
37
0
0
16 Apr 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Ryan Cotterell
38
106
0
10 Apr 2025
Transformers without Normalization
Jiachen Zhu
Xinlei Chen
Kaiming He
Yann LeCun
Zhuang Liu
ViT
OffRL
51
7
0
13 Mar 2025
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Zhijian Zhuo
Yutao Zeng
Ya Wang
Sijun Zhang
Jian Yang
Xiaoqing Li
Xun Zhou
Jinwen Ma
46
0
0
06 Mar 2025
SAGE-Amine: Generative Amine Design with Multi-Property Optimization for Efficient CO2 Capture
Hocheol Lim
Hyein Cho
Jeonghoon Kim
67
0
0
04 Mar 2025
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Tianjin Huang
Haotian Hu
Zhenyu (Allen) Zhang
Gaojie Jin
X. Li
...
Tianlong Chen
Lu Liu
Qingsong Wen
Zhangyang Wang
Shiwei Liu
MQ
33
0
0
24 Feb 2025
ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition
Muhammad Waseem Akram
Stefano Dettori
V. Colla
Giorgio Buttazzo
52
0
0
17 Feb 2025
The Curse of Depth in Large Language Models
Wenfang Sun
Xinyuan Song
Pengxiang Li
Lu Yin
Yefeng Zheng
Shiwei Liu
62
4
0
09 Feb 2025
Merino: Entropy-driven Design for Generative Language Models on IoT Devices
Youpeng Zhao
Ming Lin
Huadong Tang
Qiang Wu
Jun Wang
75
0
0
28 Jan 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Tianjin Huang
Ziquan Zhu
Gaojie Jin
Lu Liu
Zhangyang Wang
Shiwei Liu
36
1
0
12 Jan 2025
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
Pengxiang Li
Lu Yin
Shiwei Liu
70
4
0
18 Dec 2024
Reducing Reasoning Costs: The Path of Optimization for Chain of Thought via Sparse Attention Mechanism
Libo Wang
LRM
AI4CE
46
3
0
14 Nov 2024
Training Neural Networks as Recognizers of Formal Languages
Alexandra Butoi
Ghazal Khalighinejad
Anej Svete
Josef Valvoda
Ryan Cotterell
Brian DuSell
NAI
36
2
0
11 Nov 2024
SeisLM: a Foundation Model for Seismic Waveforms
Tianlin Liu
Jannes Münchmeyer
Laura Laurenti
C. Marone
Maarten V. de Hoop
Ivan Dokmanić
VLM
16
4
0
21 Oct 2024
Extracting Finite State Machines from Transformers
Rik Adriaensen
Jaron Maene
AI4CE
19
0
0
08 Oct 2024
DimOL: Dimensional Awareness as A New 'Dimension' in Operator Learning
Yichen Song
Yunbo Wang
Xiaokang Yang
Xiaokang Yang
AI4CE
48
0
0
08 Oct 2024
Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes
Kosuke Nishida
Kyosuke Nishida
Kuniko Saito
18
1
0
07 Oct 2024
Apple Intelligence Foundation Language Models
Tom Gunter
Zirui Wang
Chong-Jun Wang
Ruoming Pang
Andy Narayanan
...
Xinwen Liu
Yang Zhao
Yin Xia
Zhile Ren
Zhongzheng Ren
32
32
0
29 Jul 2024
Transformer Normalisation Layers and the Independence of Semantic Subspaces
S. Menary
Samuel Kaski
Andre Freitas
41
2
0
25 Jun 2024
Accelerating evolutionary exploration through language model-based transfer learning
M. Reissmann
Yuan Fang
Andrew S. H. Ooi
R. D. Sandberg
26
2
0
07 Jun 2024
When predict can also explain: few-shot prediction to select better neural latents
Kabir Dabholkar
Omri Barak
BDL
47
0
0
23 May 2024
PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin
Stephen Lawrence Bothwell
Brian DuSell
David Chiang
Brian Krostenko
33
0
0
25 Apr 2024
Neural Shrödinger Bridge Matching for Pansharpening
Zihan Cao
Xiao Wu
Liang-Jian Deng
DiffM
53
2
0
17 Apr 2024
Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation
Stephen Lawrence Bothwell
Abigail Swenor
David Chiang
20
1
0
11 Apr 2024
Low-resource neural machine translation with morphological modeling
Antoine Nzeyimana
22
3
0
03 Apr 2024
ChatGPT Alternative Solutions: Large Language Models Survey
H. Alipour
Nick Pendar
Kohinoor Roy
LM&MA
22
4
0
21 Mar 2024
JointMotion: Joint Self-supervision for Joint Motion Prediction
Royden Wagner
Ömer Sahin Tas
Marvin Klemp
Carlos Fernandez Lopez
TTA
31
1
0
08 Mar 2024
Compact Speech Translation Models via Discrete Speech Units Pretraining
Tsz Kin Lam
Alexandra Birch
Barry Haddow
45
2
0
29 Feb 2024
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Ruoyu Sun
Zhimin Luo
24
39
0
26 Feb 2024
Scalable Normalizing Flows Enable Boltzmann Generators for Macromolecules
Joseph C. Kim
David Bloore
Karan Kapoor
Jun Feng
Ming-Hong Hao
Mengdi Wang
27
7
0
08 Jan 2024
ClusterComm: Discrete Communication in Decentralized MARL using Internal Representation Clustering
Robert Muller
Hasan Turalic
Thomy Phan
Michael Kolle
Jonas Nusslein
Claudia Linnhoff-Popien
OffRL
28
1
0
07 Jan 2024
Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems
M. Seiler
P. Kerschke
Heike Trautmann
13
6
0
02 Jan 2024
Spike No More: Stabilizing the Pre-training of Large Language Models
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
13
13
0
28 Dec 2023
Heterogeneous Encoders Scaling In The Transformer For Neural Machine Translation
J. Hu
Roberto Cavicchioli
Giulia Berardinelli
Alessandro Capotondi
23
2
0
26 Dec 2023
GenCast: Diffusion-based ensemble forecasting for medium-range weather
Ilan Price
Alvaro Sanchez-Gonzalez
Ferran Alet
Tom R. Andersson
Andrew El-Kadi
...
Jacklynn Stott
Shakir Mohamed
Peter W. Battaglia
Rémi R. Lam
Matthew Willson
26
105
0
25 Dec 2023
Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars
Kaiyue Wen
Yuchen Li
Bing Liu
Andrej Risteski
13
21
0
03 Dec 2023
Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines
Stephen Lawrence Bothwell
Justin DeBenedetto
Theresa Crnkovich
Hildegund Müller
David Chiang
ObjD
19
2
0
30 Nov 2023
Global Transformer Architecture for Indoor Room Temperature Forecasting
Alfredo V. Clemente
A. Nocente
Massimiliano Ruocco
AI4CE
6
1
0
31 Oct 2023
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents
Michael Gunther
Jackmin Ong
Isabelle Mohr
Alaeddine Abdessalem
Tanguy Abel
...
Saba Sturua
Bo Wang
Maximilian Werk
Nan Wang
Han Xiao
RALM
27
56
0
30 Oct 2023
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents
Jake Grigsby
Linxi Fan
Yuke Zhu
OffRL
LM&Ro
27
10
0
15 Oct 2023
Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns
Brian DuSell
David Chiang
20
12
0
03 Oct 2023
On Separate Normalization in Self-supervised Transformers
Xiaohui Chen
Yinkai Wang
Yuanqi Du
S. Hassoun
Liping Liu
ViT
19
1
0
22 Sep 2023
KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods
Antoine Nzeyimana
SSL
11
0
0
23 Aug 2023
A Comprehensive Overview of Large Language Models
Humza Naveed
Asad Ullah Khan
Shi Qiu
Muhammad Saqib
Saeed Anwar
Muhammad Usman
Naveed Akhtar
Nick Barnes
Ajmal Saeed Mian
OffRL
46
514
0
12 Jul 2023
Generalized Power Attacks against Crypto Hardware using Long-Range Deep Learning
Elie Bursztein
Luca Invernizzi
Karel Král
D. Moghimi
J. Picod
Marina Zhang
AAML
26
5
0
12 Jun 2023
KIT's Multilingual Speech Translation System for IWSLT 2023
Danni Liu
Thai-Binh Nguyen
Sai Koneru
Enes Yavuz Ugan
Ngoc-Quan Pham
Tuan-Nam Nguyen
Tu Anh Dinh
Carlos Mullov
A. Waibel
J. Niehues
13
6
0
08 Jun 2023
MobileNMT: Enabling Translation in 15MB and 30ms
Ye Lin
Xiaohui Wang
Zhexi Zhang
Mingxuan Wang
Tong Xiao
Jingbo Zhu
MQ
17
1
0
07 Jun 2023
Using Sequences of Life-events to Predict Human Lives
Germans Savcisens
Tina Eliassi-Rad
L. K. Hansen
L. Mortensen
Lau Lilleholt
Anna Rogers
Ingo Zettler
Sune Lehmann
AI4TS
26
35
0
05 Jun 2023
Centered Self-Attention Layers
Ameen Ali
Tomer Galanti
Lior Wolf
28
6
0
02 Jun 2023
1
2
3
Next