ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1905.09418
  4. Cited By
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
  Lifting, the Rest Can Be Pruned
v1v2 (latest)

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Annual Meeting of the Association for Computational Linguistics (ACL), 2019
23 May 2019
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
ArXiv (abs)PDFHTML

Papers citing "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"

50 / 741 papers shown
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
Atsuki Yamaguchi
Terufumi Morishita
Aline Villavicencio
Nikolaos Aletras
CLL
220
0
0
04 Dec 2025
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
N. Bui
Shubham Sharma
Simran Lamba
Saumitra Mishra
Rex Ying
84
1
0
03 Dec 2025
Efficient-Husformer: Efficient Multimodal Transformer Hyperparameter Optimization for Stress and Cognitive Loads
Efficient-Husformer: Efficient Multimodal Transformer Hyperparameter Optimization for Stress and Cognitive Loads
Merey Orazaly
Fariza Temirkhanova
Jurn-Gyu Park
68
0
0
27 Nov 2025
Multi-speaker Attention Alignment for Multimodal Social Interaction
Multi-speaker Attention Alignment for Multimodal Social Interaction
Liangyang Ouyang
Yifei Huang
Mingfang Zhang
Caixin Kang
Ryosuke Furuta
Yoichi Sato
112
0
0
22 Nov 2025
StableMorph: High-Quality Face Morph Generation with Stable Diffusion
StableMorph: High-Quality Face Morph Generation with Stable Diffusion
Wassim Kabbani
Kiran Raja
Raghavendra Ramachandra
C. Busch
82
0
0
11 Nov 2025
COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation
COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation
Kenji Sahay
Snigdha Pandya
Rohan Nagale
Anna Lin
Shikhar Shiromani
Kevin Zhu
Dev Sunishchal
69
0
0
05 Nov 2025
GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
Shijie Zhou
Viet Dac Lai
Hao Tan
Jihyung Kil
Wanrong Zhu
Changyou Chen
Ruiyi Zhang
174
1
0
02 Nov 2025
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
Omar Naim
Krish Sharma
Nicholas M. Asher
Nicholas Asher
92
0
0
26 Oct 2025
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
Lorenzo Basile
Valentino Maiorca
Diego Doimo
Francesco Locatello
Alberto Cazzaniga
124
2
0
24 Oct 2025
Benefits and Limitations of Communication in Multi-Agent Reasoning
Benefits and Limitations of Communication in Multi-Agent Reasoning
Michael Rizvi-Martel
S. Bhattamishra
Neil Rathi
Guillaume Rabusseau
Michael Hahn
LRM
97
0
0
14 Oct 2025
Cognitive Load Traces as Symbolic and Visual Accounts of Deep Model Cognition
Cognitive Load Traces as Symbolic and Visual Accounts of Deep Model Cognition
Dong Liu
Yanxuan Yu
146
0
0
13 Oct 2025
Algorithmic Primitives and Compositional Geometry of Reasoning in Language Models
Algorithmic Primitives and Compositional Geometry of Reasoning in Language Models
Samuel Lippl
Thomas McGee
Kimberly Lopez
Ziwen Pan
Pierce Zhang
Salma Ziadi
Oliver Eberle
Ida Momennejad
LRM
142
0
0
13 Oct 2025
Medical Interpretability and Knowledge Maps of Large Language Models
Medical Interpretability and Knowledge Maps of Large Language Models
Razvan Marinescu
Victoria-Elisabeth Gruber
Diego Fajardo
FAttAI4MH
239
0
0
13 Oct 2025
Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning
Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning
Minsik Choi
Hyegang Son
Changhoon Kim
Young Geun Kim
AAML
120
0
0
10 Oct 2025
How to Teach Large Multimodal Models New Skills
How to Teach Large Multimodal Models New Skills
Zhen Zhu
Yiming Gong
Yao Xiao
Yaoyao Liu
Derek Hoiem
MLLMCLLKELM
173
0
0
09 Oct 2025
HEMERA: A Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data
HEMERA: A Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data
Maria Mahbub
Robert J. Klein
Myvizhi Esai Selvan
Rowena Yip
Claudia Henschke
...
Eileen McAllister
Samuel M. Aguayo
Zeynep H. Gümüş
Ioana Danciu
VA Million Veteran Program
MedIm
112
0
0
08 Oct 2025
Enhancing Concept Localization in CLIP-based Concept Bottleneck Models
Enhancing Concept Localization in CLIP-based Concept Bottleneck Models
Rémi Kazmierczak
Steve Azzolin
Eloise Berthier
Goran Frehse
Gianni Franchi
168
0
0
08 Oct 2025
Downsized and Compromised?: Assessing the Faithfulness of Model Compression
Downsized and Compromised?: Assessing the Faithfulness of Model Compression
Moumita Kamal
Douglas A. Talbert
117
0
0
07 Oct 2025
HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks
HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks
Nghiem Tuong Diep
Dung D. Le
Tuan Truong
Tan Dinh
Huy Le Nguyen
Nhat Ho
124
0
0
05 Oct 2025
Contrastive Retrieval Heads Improve Attention-Based Re-Ranking
Contrastive Retrieval Heads Improve Attention-Based Re-Ranking
Linh Tran
Yulong Li
Radu Florian
Wei-Ju Sun
129
0
0
02 Oct 2025
Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors
Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors
Yen-Shan Chen
Sian-Yao Huang
Cheng-Lin Yang
Yun-Nung Chen
AAML
157
0
0
01 Oct 2025
Interpreting Language Models Through Concept Descriptions: A Survey
Interpreting Language Models Through Concept Descriptions: A Survey
Nils Feldhus
Laura Kopf
MILM
154
0
0
01 Oct 2025
Effective Model Pruning: Measure The Redundancy of Model Components
Effective Model Pruning: Measure The Redundancy of Model Components
Yixuan Wang
Dan Guralnik
Saiedeh Akbari
Warren E. Dixon
56
0
0
30 Sep 2025
The silence of the weights: an investigation of structural pruning strategies for attention-based audio signal architectures
The silence of the weights: an investigation of structural pruning strategies for attention-based audio signal architectures
Andrea Diecidue
C. Barbano
Piero Fraternali
Mathieu Fontaine
Enzo Tartaglione
72
0
0
30 Sep 2025
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training
Yein Park
Minbyul Jeong
Jaewoo Kang
LRM
1.5K
1
0
30 Sep 2025
Layer-wise dynamic rank for compressing large language models
Layer-wise dynamic rank for compressing large language models
Zhendong Mi
Bian Sun
Grace Li Zhang
Shaoyi Huang
ALM
208
0
0
30 Sep 2025
Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization
Differentiable Sparsity via DDD-Gating: Simple and Versatile Structured Penalization
Chris Kolb
Laetitia Frost
J. Herbinger
David Rügamer
392
0
0
28 Sep 2025
On the Capacity of Self-Attention
On the Capacity of Self-Attention
Micah Adler
193
0
0
26 Sep 2025
Multilingual Vision-Language Models, A Survey
Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea
Jindřich Libovický
VLM
147
1
0
26 Sep 2025
What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?
What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?
Mohammed Sabry
Anya Belz
98
0
0
26 Sep 2025
AIBA: Attention-based Instrument Band Alignment for Text-to-Audio Diffusion
AIBA: Attention-based Instrument Band Alignment for Text-to-Audio Diffusion
Junyoung Koh
Soo Yong Kim
Gyu Hyeong Choi
Yongwon Choi
171
0
0
25 Sep 2025
Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research
Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research
Richard Diehl Martinez
David Demitri Africa
Yuval Weiss
Suchir Salhan
Ryan Daniels
P. Buttery
144
1
0
19 Sep 2025
GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings
GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings
Yixuan Tang
Yi Yang
128
0
0
13 Sep 2025
Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
Cheng Li
Jiexiong Liu
Yixuan Chen
Jie ji
MoE
105
0
0
05 Sep 2025
Enhancing Fairness in Skin Lesion Classification for Medical Diagnosis Using Prune Learning
Enhancing Fairness in Skin Lesion Classification for Medical Diagnosis Using Prune Learning
Kuniko Paxton
Mohammed Naveed Akram
Dhavalkumar Thakker
Y. Papadopoulos
Tanaya Maslekar
107
1
0
31 Aug 2025
OASIS: Harnessing Diffusion Adversarial Network for Ocean Salinity Imputation using Sparse Drifter Trajectories
OASIS: Harnessing Diffusion Adversarial Network for Ocean Salinity Imputation using Sparse Drifter Trajectories
Bo Li
Yingqi Feng
Ming Jin
Xin-Yang Zheng
Yufei Tang
...
Qinghua Lu
Jingwei Yao
Shirui Pan
H. Zhang
Xingquan Zhu
DiffM
113
1
0
29 Aug 2025
Rethinking Layer-wise Model Merging through Chain of Merges
Rethinking Layer-wise Model Merging through Chain of Merges
Pietro Buzzega
Riccardo Salami
Angelo Porrello
Simone Calderara
MoMeAI4CE
201
0
0
29 Aug 2025
CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference
CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer InferenceIEEE transactions on computers (IEEE Trans. Comput.), 2025
Guanyu Xu
Zhiwei Hao
Li Shen
Yong Luo
Fuhui Sun
Xiaoyan Wang
Han Hu
Yonggang Wen
159
1
0
28 Aug 2025
Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models
Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models
Taibiao Zhao
Mingxuan Sun
Hao Wang
Xiaobing Chen
Xiangwei Zhou
AAML
151
0
0
14 Aug 2025
What are you sinking? A geometric approach on attention sink
What are you sinking? A geometric approach on attention sink
Valeria Ruscio
Umberto Nanni
Fabrizio Silvestri
122
2
0
04 Aug 2025
Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
Sushant Mehta
Raj Abhijit Dandekar
Rajat Dandekar
Sreedath Panat
MoE
161
2
0
02 Aug 2025
Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics
Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics
Tom Or
Omri Azencot
AAML
191
1
0
01 Aug 2025
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study
Yiran Huang
Lukas Thede
Goran Frehse
Wenjia Xu
Zeynep Akata
184
0
0
28 Jul 2025
Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers
Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text ClassifiersConference on Uncertainty in Artificial Intelligence (UAI), 2025
Sungmin Han
Jeonghyun Lee
Sangkyun Lee
232
1
0
27 Jul 2025
Attention (as Discrete-Time Markov) Chains
Attention (as Discrete-Time Markov) Chains
Yotam Erel
Olaf Dünkel
Rishabh Dabral
Vladislav Golyanik
Christian Theobalt
Amit H. Bermano
292
1
0
23 Jul 2025
Knowledge Fusion via Bidirectional Information Aggregation
Knowledge Fusion via Bidirectional Information Aggregation
Songlin Zhai
Guilin Qi
Yue Wang
Yuan Meng
134
0
0
11 Jul 2025
BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
Patrik Okanovic
Sameer Deshmukh
Grzegorz Kwa'sniewski
Yi Zhu
Haruto Fujii
...
Maciej Besta
Kentaro Katayama
Takumi Honda
Yusuke Nagasaka
Torsten Hoefler
204
0
0
03 Jul 2025
Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation
Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation
Feng Lin
Marco Chen
Haokui Zhang
Xiaotian Yu
Guangming Lu
Rong Xiao
110
0
0
01 Jul 2025
Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention
Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention
Haitz Sáez de Ocáriz Borde
112
0
0
28 Jun 2025
Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training
Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training
Aadim Nepal
Safal Shrestha
Anubhav Shrestha
Minwu Kim
Jalal Naghiyev
Ravid Shwartz-Ziv
Keith Ross
LRM
196
0
0
27 Jun 2025
1234...131415
Next
Page 1 of 15
Pageof 15