Mimetic Initialization of Self-Attention Layers

International Conference on Machine Learning (ICML), 2023

16 May 2023

Asher Trockman

J. Zico Kolter

ArXiv (abs)PDF HTML

Papers citing "Mimetic Initialization of Self-Attention Layers"

31 / 31 papers shown

Cutting the Skip: Training Residual-Free Transformers

Hemanth Saratchandran

Simon Lucey

159

30 Sep 2025

Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification

108

28 Aug 2025

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

...

264

09 Jun 2025

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

...

205

29 May 2025

Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning

Zachary Shinnick

Liangze Jiang

Hemanth Saratchandran

Anton Van Den Hengel

Damien Teney

158

28 May 2025

Structured Initialization for Vision Transformers

Jianqiao Zheng

Xueqian Li

Hemanth Saratchandran

Simon Lucey

ViT

206

26 May 2025

The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training

Matteo Saponati

Pascal Sager

Pau Vilimelis Aceituno

Thilo Stadelmann

Benjamin Grewe

203

15 Feb 2025

Freqformer: Frequency-Domain Transformer for 3-D Reconstruction and Quantification of Human Retinal Vasculature

211

17 Nov 2024

On the Surprising Effectiveness of Attention Transfer for Vision TransformersNeural Information Processing Systems (NeurIPS), 2024

208

14 Nov 2024

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

A. S. Rawat

Veeranjaneyulu Sadhanala

...

Sanjiv Kumar

465

24 Oct 2024

Mimetic Initialization Helps State Space Models Learn to Recall

Sanjiv Kumar

128

14 Oct 2024

FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

Xin Geng

200

28 Sep 2024

Kolmogorov-Arnold Transformer

Xingyi Yang

Xinchao Wang

251

16 Sep 2024

Reasoning in Large Language Models: A Geometric Perspective

Romain Cosentino

Sarath Shekkizhar

LRM

215

02 Jul 2024

Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers

Hongyi Pan

Emadeldeen Hamdan

Xin Zhu

Ahmet Enis Cetin

Ulas Bagci

ViT

180

22 May 2024

Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or MemorizingNeural Information Processing Systems (NeurIPS), 2024

600

08 May 2024

Structured Initialization for Attention in Vision Transformers

262

01 Apr 2024

SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

296

15 Feb 2024

Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random FeaturesInternational Conference on Machine Learning (ICML), 2024

Simone Bombari

Marco Mondelli

294

05 Feb 2024

Convolutional Initialization for Data-Efficient Vision Transformers

Jianqiao Zheng

Xueqian Li

Simon Lucey

261

23 Jan 2024

Setting the Record Straight on Transformer Oversmoothing

G. Dovonon

M. Bronstein

Matt J. Kusner

402

09 Jan 2024

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and GenerationInternational Conference on Machine Learning (ICML), 2023

Randall Balestriero

Romain Cosentino

Sarath Shekkizhar

328

04 Dec 2023

Initializing Models with Larger OnesInternational Conference on Learning Representations (ICLR), 2023

239

30 Nov 2023

Simplifying Transformer BlocksInternational Conference on Learning Representations (ICLR), 2023

Bobby He

Thomas Hofmann

397

03 Nov 2023

When can transformers reason with abstract symbols?

285

15 Oct 2023

LEMON: Lossless model expansionInternational Conference on Learning Representations (ICLR), 2023

Jianbo Yuan

Hongxia Yang

214

12 Oct 2023

Uncovering hidden geometry in Transformers via disentangling position and context

Jiajun Song

Yiqiao Zhong

237

07 Oct 2023

Robust 6DoF Pose Estimation Against Depth Noise and a Comprehensive Evaluation on a Mobile Dataset

353

24 Sep 2023

What can a Single Attention Layer Learn? A Study Through the Random Features LensNeural Information Processing Systems (NeurIPS), 2023

201

21 Jul 2023

Trained Transformers Learn Linear Models In-ContextJournal of machine learning research (JMLR), 2023

Ruiqi Zhang

Spencer Frei

Peter L. Bartlett

410

277

16 Jun 2023

On the Relationship between Self-Attention and Convolutional LayersInternational Conference on Learning Representations (ICLR), 2019

Jean-Baptiste Cordonnier

Andreas Loukas

Martin Jaggi

563

607

08 Nov 2019