ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1905.09418
  4. Cited By
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
  Lifting, the Rest Can Be Pruned
v1v2 (latest)

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Annual Meeting of the Association for Computational Linguistics (ACL), 2019
23 May 2019
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
ArXiv (abs)PDFHTML

Papers citing "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"

50 / 741 papers shown
PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech RecognitionNeural Information Processing Systems (NeurIPS), 2021
Cheng-I Jeff Lai
Yang Zhang
Alexander H. Liu
Shiyu Chang
Yi-Lun Liao
Yung-Sung Chuang
Kaizhi Qian
Sameer Khurana
David D. Cox
James R. Glass
VLM
301
86
0
10 Jun 2021
Convolutions and Self-Attention: Re-interpreting Relative Positions in
  Pre-trained Language Models
Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Tyler A. Chang
Yifan Xu
Weijian Xu
Zhuowen Tu
ViT
111
17
0
10 Jun 2021
Patch Slimming for Efficient Vision Transformers
Patch Slimming for Efficient Vision TransformersComputer Vision and Pattern Recognition (CVPR), 2021
Yehui Tang
Kai Han
Yunhe Wang
Chang Xu
Jianyuan Guo
Chao Xu
Dacheng Tao
ViT
332
194
0
05 Jun 2021
On the Distribution, Sparsity, and Inference-time Quantization of
  Attention Values in Transformers
On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in TransformersFindings (Findings), 2021
Tianchu Ji
Shraddhan Jain
M. Ferdman
Peter Milder
H. Andrew Schwartz
Niranjan Balasubramanian
MQ
239
20
0
02 Jun 2021
Do Multilingual Neural Machine Translation Models Contain Language Pair
  Specific Attention Heads?
Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?Findings (Findings), 2021
Min Namgung
Laurent Besacier
Vassilina Nikoulina
D. Schwab
MILM
144
9
0
31 May 2021
Cascaded Head-colliding Attention
Cascaded Head-colliding AttentionAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Lin Zheng
Zhiyong Wu
Lingpeng Kong
174
3
0
31 May 2021
Greedy-layer Pruning: Speeding up Transformer Models for Natural
  Language Processing
Greedy-layer Pruning: Speeding up Transformer Models for Natural Language ProcessingPattern Recognition Letters (PR), 2021
David Peer
Sebastian Stabinger
Stefan Engl
A. Rodríguez-Sánchez
188
31
0
31 May 2021
On Compositional Generalization of Neural Machine Translation
On Compositional Generalization of Neural Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Yafu Li
Yongjing Yin
Yulong Chen
Yue Zhang
351
52
0
31 May 2021
On the Interplay Between Fine-tuning and Composition in Transformers
On the Interplay Between Fine-tuning and Composition in TransformersFindings (Findings), 2021
Lang-Chi Yu
Allyson Ettinger
231
14
0
31 May 2021
Cross-Lingual Abstractive Summarization with Limited Parallel Resources
Cross-Lingual Abstractive Summarization with Limited Parallel ResourcesAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Yu Bai
Yang Gao
Heyan Huang
209
54
0
28 May 2021
Inspecting the concept knowledge graph encoded by modern language models
Inspecting the concept knowledge graph encoded by modern language modelsFindings (Findings), 2021
Carlos Aspillaga
Marcelo Mendoza
Alvaro Soto
222
15
0
27 May 2021
How Does Distilled Data Complexity Impact the Quality and Confidence of
  Non-Autoregressive Machine Translation?
How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?Findings (Findings), 2021
Weijia Xu
Shuming Ma
Dongdong Zhang
Marine Carpuat
194
19
0
27 May 2021
LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and
  Beyond
LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and BeyondArtificial Intelligence (AI), 2021
Daniel Loureiro
A. Jorge
Jose Camacho-Collados
251
30
0
26 May 2021
Super Tickets in Pre-Trained Language Models: From Model Compression to
  Improving Generalization
Super Tickets in Pre-Trained Language Models: From Model Compression to Improving GeneralizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Chen Liang
Simiao Zuo
Minshuo Chen
Haoming Jiang
Xiaodong Liu
Pengcheng He
T. Zhao
Weizhu Chen
174
73
0
25 May 2021
A Non-Linear Structural Probe
A Non-Linear Structural ProbeNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Jennifer C. White
Tiago Pimentel
Naomi Saphra
Robert Bamler
144
33
0
21 May 2021
Medical Image Segmentation Using Squeeze-and-Expansion Transformers
Medical Image Segmentation Using Squeeze-and-Expansion TransformersInternational Joint Conference on Artificial Intelligence (IJCAI), 2021
Shaohua Li
Xiuchao Sui
Xiangde Luo
Xinxing Xu
Yong Liu
Rick Siow Mong Goh
ViTMedIm
163
188
0
20 May 2021
Rationalization through Concepts
Rationalization through ConceptsFindings (Findings), 2021
Diego Antognini
Boi Faltings
FAtt
214
24
0
11 May 2021
FNet: Mixing Tokens with Fourier Transforms
FNet: Mixing Tokens with Fourier TransformsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
James Lee-Thorp
Joshua Ainslie
Ilya Eckstein
Santiago Ontanon
643
641
0
09 May 2021
Long-Span Summarization via Local Attention and Content Selection
Long-Span Summarization via Local Attention and Content SelectionAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Potsawee Manakul
Mark Gales
231
46
0
08 May 2021
Let's Play Mono-Poly: BERT Can Reveal Words' Polysemy Level and
  Partitionability into Senses
Let's Play Mono-Poly: BERT Can Reveal Words' Polysemy Level and Partitionability into SensesTransactions of the Association for Computational Linguistics (TACL), 2021
Aina Garí Soler
Marianna Apidianaki
MILM
415
76
0
29 Apr 2021
Accounting for Agreement Phenomena in Sentence Comprehension with
  Transformer Language Models: Effects of Similarity-based Interference on
  Surprisal and Attention
Accounting for Agreement Phenomena in Sentence Comprehension with Transformer Language Models: Effects of Similarity-based Interference on Surprisal and AttentionWorkshop on Cognitive Modeling and Computational Linguistics (CMCL), 2021
S. Ryu
Richard L. Lewis
164
33
0
26 Apr 2021
Easy and Efficient Transformer : Scalable Inference Solution For large
  NLP model
Easy and Efficient Transformer : Scalable Inference Solution For large NLP modelNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
GongZheng Li
Yadong Xi
Jingzhen Ding
Duan Wang
Bai Liu
Changjie Fan
Xiaoxi Mao
Zeng Zhao
262
11
0
26 Apr 2021
Extract then Distill: Efficient and Effective Task-Agnostic BERT
  Distillation
Extract then Distill: Efficient and Effective Task-Agnostic BERT DistillationInternational Conference on Artificial Neural Networks (ICANN), 2021
Cheng Chen
Yichun Yin
Lifeng Shang
Zhi Wang
Xin Jiang
Xiao Chen
Qun Liu
FedML
139
9
0
24 Apr 2021
Code Structure Guided Transformer for Source Code Summarization
Code Structure Guided Transformer for Source Code SummarizationACM Transactions on Software Engineering and Methodology (TOSEM), 2021
Shuzheng Gao
Cuiyun Gao
Yulan He
Jichuan Zeng
L. Nie
Xin Xia
Michael R. Lyu
213
119
0
19 Apr 2021
BigGreen at SemEval-2021 Task 1: Lexical Complexity Prediction with
  Assembly Models
BigGreen at SemEval-2021 Task 1: Lexical Complexity Prediction with Assembly ModelsInternational Workshop on Semantic Evaluation (SemEval), 2021
A. Islam
Weicheng Ma
Soroush Vosoughi
125
4
0
19 Apr 2021
Cross-Attention is All You Need: Adapting Pretrained Transformers for
  Machine Translation
Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Mozhdeh Gheini
Xiang Ren
Jonathan May
LRM
311
162
0
18 Apr 2021
Knowledge Neurons in Pretrained Transformers
Knowledge Neurons in Pretrained TransformersAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Damai Dai
Li Dong
Y. Hao
Zhifang Sui
Baobao Chang
Furu Wei
KELMMU
547
577
0
18 Apr 2021
Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm
Rethinking Network Pruning -- under the Pre-train and Fine-tune ParadigmNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Dongkuan Xu
Ian En-Hsu Yen
Jinxi Zhao
Zhibin Xiao
VLMAAML
193
66
0
18 Apr 2021
Fast, Effective, and Self-Supervised: Transforming Masked Language
  Models into Universal Lexical and Sentence Encoders
Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence EncodersConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Fangyu Liu
Ivan Vulić
Anna Korhonen
Nigel Collier
VLMOffRL
324
132
0
16 Apr 2021
Effect of Post-processing on Contextualized Word Representations
Effect of Post-processing on Contextualized Word RepresentationsInternational Conference on Computational Linguistics (COLING), 2021
Hassan Sajjad
Firoj Alam
Fahim Dalvi
Nadir Durrani
173
12
0
15 Apr 2021
Sparse Attention with Linear Units
Sparse Attention with Linear UnitsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Biao Zhang
Ivan Titov
Rico Sennrich
263
52
0
14 Apr 2021
Domain Adaptation and Multi-Domain Adaptation for Neural Machine
  Translation: A Survey
Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A SurveyJournal of Artificial Intelligence Research (JAIR), 2021
Danielle Saunders
AI4CE
362
107
0
14 Apr 2021
DirectProbe: Studying Representations without Classifiers
DirectProbe: Studying Representations without ClassifiersNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Yichu Zhou
Vivek Srikumar
219
36
0
13 Apr 2021
UniDrop: A Simple yet Effective Technique to Improve Transformer without
  Extra Cost
UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra CostNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Zhen Wu
Lijun Wu
Qi Meng
Ziheng Lu
Shufang Xie
Tao Qin
Xinyu Dai
Tie-Yan Liu
208
25
0
11 Apr 2021
On Biasing Transformer Attention Towards Monotonicity
On Biasing Transformer Attention Towards MonotonicityNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Annette Rios Gonzales
Chantal Amrhein
Noëmi Aepli
Rico Sennrich
135
9
0
08 Apr 2021
How Transferable are Reasoning Patterns in VQA?
How Transferable are Reasoning Patterns in VQA?Computer Vision and Pattern Recognition (CVPR), 2021
Corentin Kervadec
Theo Jaunet
G. Antipov
M. Baccouche
Romain Vuillemot
Christian Wolf
LRM
149
29
0
08 Apr 2021
Attention Head Masking for Inference Time Content Selection in
  Abstractive Summarization
Attention Head Masking for Inference Time Content Selection in Abstractive SummarizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Shuyang Cao
Lu Wang
CVBM
129
15
0
06 Apr 2021
Efficient Attentions for Long Document Summarization
Efficient Attentions for Long Document SummarizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
L. Huang
Shuyang Cao
Nikolaus Nova Parulian
Heng Ji
Lu Wang
330
361
0
05 Apr 2021
VisQA: X-raying Vision and Language Reasoning in Transformers
VisQA: X-raying Vision and Language Reasoning in TransformersIEEE Transactions on Visualization and Computer Graphics (TVCG), 2021
Theo Jaunet
Corentin Kervadec
Romain Vuillemot
G. Antipov
M. Baccouche
Christian Wolf
301
32
0
02 Apr 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and
  Encoder-Decoder Transformers
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Hila Chefer
Shir Gur
Lior Wolf
ViT
358
412
0
29 Mar 2021
Learning on heterogeneous graphs using high-order relations
Learning on heterogeneous graphs using high-order relationsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
See Hian Lee
Feng Ji
Wee Peng Tay
116
4
0
29 Mar 2021
Dodrio: Exploring Transformer Models with Interactive Visualization
Dodrio: Exploring Transformer Models with Interactive VisualizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Zijie J. Wang
Robert Turko
Duen Horng Chau
202
46
0
26 Mar 2021
Understanding Robustness of Transformers for Image Classification
Understanding Robustness of Transformers for Image ClassificationIEEE International Conference on Computer Vision (ICCV), 2021
Srinadh Bhojanapalli
Ayan Chakrabarti
Daniel Glasner
Daliang Li
Thomas Unterthiner
Andreas Veit
ViT
313
472
0
26 Mar 2021
Pruning-then-Expanding Model for Domain Adaptation of Neural Machine
  Translation
Pruning-then-Expanding Model for Domain Adaptation of Neural Machine TranslationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Shuhao Gu
Yang Feng
Wanying Xie
CLLAI4CE
192
32
0
25 Mar 2021
Structured Co-reference Graph Attention for Video-grounded Dialogue
Structured Co-reference Graph Attention for Video-grounded DialogueAAAI Conference on Artificial Intelligence (AAAI), 2021
Junyeong Kim
Sunjae Yoon
Dahyun Kim
Chang D. Yoo
202
30
0
24 Mar 2021
The NLP Cookbook: Modern Recipes for Transformer based Deep Learning
  Architectures
The NLP Cookbook: Modern Recipes for Transformer based Deep Learning ArchitecturesIEEE Access (IEEE Access), 2021
Sushant Singh
A. Mahmood
AI4TS
325
120
0
23 Mar 2021
Learning Calibrated-Guidance for Object Detection in Aerial Images
Learning Calibrated-Guidance for Object Detection in Aerial ImagesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (J-STARS), 2021
Zongqi Wei
Dong Liang
Dong Zhang
Liyan Zhang
Qixiang Geng
Mingqiang Wei
Huiyu Zhou
326
38
0
21 Mar 2021
Interpretable Deep Learning: Interpretation, Interpretability,
  Trustworthiness, and Beyond
Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and BeyondKnowledge and Information Systems (KAIS), 2021
Xuhong Li
Haoyi Xiong
Xingjian Li
Xuanyu Wu
Xiao Zhang
Ji Liu
Jiang Bian
Dejing Dou
AAMLFaMLXAIHAI
294
440
0
19 Mar 2021
Approximating How Single Head Attention Learns
Approximating How Single Head Attention Learns
Charles Burton Snell
Ruiqi Zhong
Dan Klein
Jacob Steinhardt
MLT
169
33
0
13 Mar 2021
An empirical analysis of phrase-based and neural machine translation
An empirical analysis of phrase-based and neural machine translation
Hamidreza Ghader
115
1
0
04 Mar 2021
Previous
123...1112131415
Next
Page 12 of 15
Pageof 15