ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.05895
  4. Cited By
Transformers without Tears: Improving the Normalization of
  Self-Attention

Transformers without Tears: Improving the Normalization of Self-Attention

14 October 2019
Toan Q. Nguyen
Julian Salazar
ArXivPDFHTML

Papers citing "Transformers without Tears: Improving the Normalization of Self-Attention"

50 / 149 papers shown
Title
A Generative Re-ranking Model for List-level Multi-objective Optimization at Taobao
A Generative Re-ranking Model for List-level Multi-objective Optimization at Taobao
Yue Meng
Cheng Guo
Yi Cao
Tong Liu
Bo Zheng
16
0
0
12 May 2025
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Ashwinee Panda
Vatsal Baherwani
Zain Sarwar
Benjamin Thérien
Supriyo Chakraborty
Tom Goldstein
MoE
37
0
0
16 Apr 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Ryan Cotterell
38
106
0
10 Apr 2025
Transformers without Normalization
Jiachen Zhu
Xinlei Chen
Kaiming He
Yann LeCun
Zhuang Liu
ViT
OffRL
51
7
0
13 Mar 2025
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Zhijian Zhuo
Yutao Zeng
Ya Wang
Sijun Zhang
Jian Yang
Xiaoqing Li
Xun Zhou
Jinwen Ma
46
0
0
06 Mar 2025
SAGE-Amine: Generative Amine Design with Multi-Property Optimization for Efficient CO2 Capture
Hocheol Lim
Hyein Cho
Jeonghoon Kim
67
0
0
04 Mar 2025
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Tianjin Huang
Haotian Hu
Zhenyu (Allen) Zhang
Gaojie Jin
X. Li
...
Tianlong Chen
Lu Liu
Qingsong Wen
Zhangyang Wang
Shiwei Liu
MQ
33
0
0
24 Feb 2025
ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition
ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition
Muhammad Waseem Akram
Stefano Dettori
V. Colla
Giorgio Buttazzo
52
0
0
17 Feb 2025
The Curse of Depth in Large Language Models
The Curse of Depth in Large Language Models
Wenfang Sun
Xinyuan Song
Pengxiang Li
Lu Yin
Yefeng Zheng
Shiwei Liu
62
4
0
09 Feb 2025
Merino: Entropy-driven Design for Generative Language Models on IoT Devices
Merino: Entropy-driven Design for Generative Language Models on IoT Devices
Youpeng Zhao
Ming Lin
Huadong Tang
Qiang Wu
Jun Wang
75
0
0
28 Jan 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Tianjin Huang
Ziquan Zhu
Gaojie Jin
Lu Liu
Zhangyang Wang
Shiwei Liu
36
1
0
12 Jan 2025
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and
  Post-LN
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
Pengxiang Li
Lu Yin
Shiwei Liu
70
4
0
18 Dec 2024
Reducing Reasoning Costs: The Path of Optimization for Chain of Thought via Sparse Attention Mechanism
Reducing Reasoning Costs: The Path of Optimization for Chain of Thought via Sparse Attention Mechanism
Libo Wang
LRM
AI4CE
46
3
0
14 Nov 2024
Training Neural Networks as Recognizers of Formal Languages
Training Neural Networks as Recognizers of Formal Languages
Alexandra Butoi
Ghazal Khalighinejad
Anej Svete
Josef Valvoda
Ryan Cotterell
Brian DuSell
NAI
36
2
0
11 Nov 2024
SeisLM: a Foundation Model for Seismic Waveforms
SeisLM: a Foundation Model for Seismic Waveforms
Tianlin Liu
Jannes Münchmeyer
Laura Laurenti
C. Marone
Maarten V. de Hoop
Ivan Dokmanić
VLM
16
4
0
21 Oct 2024
Extracting Finite State Machines from Transformers
Extracting Finite State Machines from Transformers
Rik Adriaensen
Jaron Maene
AI4CE
19
0
0
08 Oct 2024
DimOL: Dimensional Awareness as A New 'Dimension' in Operator Learning
DimOL: Dimensional Awareness as A New 'Dimension' in Operator Learning
Yichen Song
Yunbo Wang
Xiaokang Yang
Xiaokang Yang
AI4CE
48
0
0
08 Oct 2024
Initialization of Large Language Models via Reparameterization to
  Mitigate Loss Spikes
Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes
Kosuke Nishida
Kyosuke Nishida
Kuniko Saito
18
1
0
07 Oct 2024
Apple Intelligence Foundation Language Models
Apple Intelligence Foundation Language Models
Tom Gunter
Zirui Wang
Chong-Jun Wang
Ruoming Pang
Andy Narayanan
...
Xinwen Liu
Yang Zhao
Yin Xia
Zhile Ren
Zhongzheng Ren
32
32
0
29 Jul 2024
Transformer Normalisation Layers and the Independence of Semantic
  Subspaces
Transformer Normalisation Layers and the Independence of Semantic Subspaces
S. Menary
Samuel Kaski
Andre Freitas
41
2
0
25 Jun 2024
Accelerating evolutionary exploration through language model-based transfer learning
Accelerating evolutionary exploration through language model-based transfer learning
M. Reissmann
Yuan Fang
Andrew S. H. Ooi
R. D. Sandberg
26
2
0
07 Jun 2024
When predict can also explain: few-shot prediction to select better neural latents
When predict can also explain: few-shot prediction to select better neural latents
Kabir Dabholkar
Omri Barak
BDL
47
0
0
23 May 2024
PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin
PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin
Stephen Lawrence Bothwell
Brian DuSell
David Chiang
Brian Krostenko
33
0
0
25 Apr 2024
Neural Shrödinger Bridge Matching for Pansharpening
Neural Shrödinger Bridge Matching for Pansharpening
Zihan Cao
Xiao Wu
Liang-Jian Deng
DiffM
53
2
0
17 Apr 2024
Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection
  through Data Augmentation
Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation
Stephen Lawrence Bothwell
Abigail Swenor
David Chiang
20
1
0
11 Apr 2024
Low-resource neural machine translation with morphological modeling
Low-resource neural machine translation with morphological modeling
Antoine Nzeyimana
22
3
0
03 Apr 2024
ChatGPT Alternative Solutions: Large Language Models Survey
ChatGPT Alternative Solutions: Large Language Models Survey
H. Alipour
Nick Pendar
Kohinoor Roy
LM&MA
22
4
0
21 Mar 2024
JointMotion: Joint Self-supervision for Joint Motion Prediction
JointMotion: Joint Self-supervision for Joint Motion Prediction
Royden Wagner
Ömer Sahin Tas
Marvin Klemp
Carlos Fernandez Lopez
TTA
31
1
0
08 Mar 2024
Compact Speech Translation Models via Discrete Speech Units Pretraining
Compact Speech Translation Models via Discrete Speech Units Pretraining
Tsz Kin Lam
Alexandra Birch
Barry Haddow
45
2
0
29 Feb 2024
Why Transformers Need Adam: A Hessian Perspective
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Ruoyu Sun
Zhimin Luo
24
39
0
26 Feb 2024
Scalable Normalizing Flows Enable Boltzmann Generators for
  Macromolecules
Scalable Normalizing Flows Enable Boltzmann Generators for Macromolecules
Joseph C. Kim
David Bloore
Karan Kapoor
Jun Feng
Ming-Hong Hao
Mengdi Wang
27
7
0
08 Jan 2024
ClusterComm: Discrete Communication in Decentralized MARL using Internal
  Representation Clustering
ClusterComm: Discrete Communication in Decentralized MARL using Internal Representation Clustering
Robert Muller
Hasan Turalic
Thomy Phan
Michael Kolle
Jonas Nusslein
Claudia Linnhoff-Popien
OffRL
28
1
0
07 Jan 2024
Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised
  Pretrained Transformers for Single- and Multi-Objective Continuous
  Optimization Problems
Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems
M. Seiler
P. Kerschke
Heike Trautmann
13
6
0
02 Jan 2024
Spike No More: Stabilizing the Pre-training of Large Language Models
Spike No More: Stabilizing the Pre-training of Large Language Models
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
13
13
0
28 Dec 2023
Heterogeneous Encoders Scaling In The Transformer For Neural Machine
  Translation
Heterogeneous Encoders Scaling In The Transformer For Neural Machine Translation
J. Hu
Roberto Cavicchioli
Giulia Berardinelli
Alessandro Capotondi
23
2
0
26 Dec 2023
GenCast: Diffusion-based ensemble forecasting for medium-range weather
GenCast: Diffusion-based ensemble forecasting for medium-range weather
Ilan Price
Alvaro Sanchez-Gonzalez
Ferran Alet
Tom R. Andersson
Andrew El-Kadi
...
Jacklynn Stott
Shakir Mohamed
Peter W. Battaglia
Rémi R. Lam
Matthew Willson
26
105
0
25 Dec 2023
Transformers are uninterpretable with myopic methods: a case study with
  bounded Dyck grammars
Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars
Kaiyue Wen
Yuchen Li
Bing Liu
Andrej Risteski
13
21
0
03 Dec 2023
Introducing Rhetorical Parallelism Detection: A New Task with Datasets,
  Metrics, and Baselines
Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines
Stephen Lawrence Bothwell
Justin DeBenedetto
Theresa Crnkovich
Hildegund Müller
David Chiang
ObjD
19
2
0
30 Nov 2023
Global Transformer Architecture for Indoor Room Temperature Forecasting
Global Transformer Architecture for Indoor Room Temperature Forecasting
Alfredo V. Clemente
A. Nocente
Massimiliano Ruocco
AI4CE
6
1
0
31 Oct 2023
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long
  Documents
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents
Michael Gunther
Jackmin Ong
Isabelle Mohr
Alaeddine Abdessalem
Tanguy Abel
...
Saba Sturua
Bo Wang
Maximilian Werk
Nan Wang
Han Xiao
RALM
27
56
0
30 Oct 2023
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents
Jake Grigsby
Linxi Fan
Yuke Zhu
OffRL
LM&Ro
27
10
0
15 Oct 2023
Stack Attention: Improving the Ability of Transformers to Model
  Hierarchical Patterns
Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns
Brian DuSell
David Chiang
20
12
0
03 Oct 2023
On Separate Normalization in Self-supervised Transformers
On Separate Normalization in Self-supervised Transformers
Xiaohui Chen
Yinkai Wang
Yuanqi Du
S. Hassoun
Liping Liu
ViT
19
1
0
22 Sep 2023
KinSPEAK: Improving speech recognition for Kinyarwanda via
  semi-supervised learning methods
KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods
Antoine Nzeyimana
SSL
11
0
0
23 Aug 2023
A Comprehensive Overview of Large Language Models
A Comprehensive Overview of Large Language Models
Humza Naveed
Asad Ullah Khan
Shi Qiu
Muhammad Saqib
Saeed Anwar
Muhammad Usman
Naveed Akhtar
Nick Barnes
Ajmal Saeed Mian
OffRL
46
514
0
12 Jul 2023
Generalized Power Attacks against Crypto Hardware using Long-Range Deep
  Learning
Generalized Power Attacks against Crypto Hardware using Long-Range Deep Learning
Elie Bursztein
Luca Invernizzi
Karel Král
D. Moghimi
J. Picod
Marina Zhang
AAML
26
5
0
12 Jun 2023
KIT's Multilingual Speech Translation System for IWSLT 2023
KIT's Multilingual Speech Translation System for IWSLT 2023
Danni Liu
Thai-Binh Nguyen
Sai Koneru
Enes Yavuz Ugan
Ngoc-Quan Pham
Tuan-Nam Nguyen
Tu Anh Dinh
Carlos Mullov
A. Waibel
J. Niehues
13
6
0
08 Jun 2023
MobileNMT: Enabling Translation in 15MB and 30ms
MobileNMT: Enabling Translation in 15MB and 30ms
Ye Lin
Xiaohui Wang
Zhexi Zhang
Mingxuan Wang
Tong Xiao
Jingbo Zhu
MQ
17
1
0
07 Jun 2023
Using Sequences of Life-events to Predict Human Lives
Using Sequences of Life-events to Predict Human Lives
Germans Savcisens
Tina Eliassi-Rad
L. K. Hansen
L. Mortensen
Lau Lilleholt
Anna Rogers
Ingo Zettler
Sune Lehmann
AI4TS
26
35
0
05 Jun 2023
Centered Self-Attention Layers
Centered Self-Attention Layers
Ameen Ali
Tomer Galanti
Lior Wolf
28
6
0
02 Jun 2023
123
Next