Improving Transformers with Probabilistic Attention Keys

Improving Transformers with Probabilistic Attention Keys

16 October 2021

Duy Khuong Nguyen

Richard G. Baraniuk

Stanley J. Osher

Papers citing "Improving Transformers with Probabilistic Attention Keys"

10 / 10 papers shown

Title
Generalization Guarantees for Multi-View Representation Learning and Application to Regularization via Gaussian Product Mixture Prior Milad Sefidgaran Abdellatif Zaidi Piotr Krasnowski 44 0 0 25 Apr 2025
Transformer Meets Twicing: Harnessing Unattended Residual Information Laziz U. Abdullaev Tan M. Nguyen 37 2 0 02 Mar 2025
Generalization Guarantees for Representation Learning via Data-Dependent Gaussian Mixture Priors Milad Sefidgaran A. Zaidi Piotr Krasnowski 77 1 0 21 Feb 2025
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review Li Shen Yan Sun Zhiyuan Yu Liang Ding Xinmei Tian Dacheng Tao VLM 22 39 0 07 Apr 2023
Beyond EM Algorithm on Over-specified Two-Component Location-Scale Gaussian Mixtures Tongzheng Ren Fuheng Cui Sujay Sanghavi Nhat Ho 39 3 0 23 May 2022
An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models Nhat Ho Tongzheng Ren Sujay Sanghavi Purnamrita Sarkar Rachel A. Ward 23 3 0 16 May 2022
Architecture Agnostic Federated Learning for Neural Networks Disha Makhija Xing Han Nhat Ho Joydeep Ghosh FedML 13 40 0 15 Feb 2022
How Does Momentum Benefit Deep Neural Networks Architecture Design? A Few Case Studies Bao Wang Hedi Xia T. Nguyen Stanley Osher AI4CE 26 10 0 13 Oct 2021
Efficient Content-Based Sparse Attention with Routing Transformers Aurko Roy M. Saffar Ashish Vaswani David Grangier MoE 234 578 0 12 Mar 2020
A Decomposable Attention Model for Natural Language Inference Ankur P. Parikh Oscar Täckström Dipanjan Das Jakob Uszkoreit 190 1,358 0 06 Jun 2016