ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1905.09418
  4. Cited By
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
  Lifting, the Rest Can Be Pruned
v1v2 (latest)

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Annual Meeting of the Association for Computational Linguistics (ACL), 2019
23 May 2019
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
ArXiv (abs)PDFHTML

Papers citing "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"

50 / 741 papers shown
Repulsive Attention: Rethinking Multi-head Attention as Bayesian
  Inference
Repulsive Attention: Rethinking Multi-head Attention as Bayesian InferenceConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Bang An
Jie Lyu
Zhenyi Wang
Chunyuan Li
Changwei Hu
Fei Tan
Ruiyi Zhang
Yifan Hu
Changyou Chen
AAML
268
29
0
20 Sep 2020
Dissecting Lottery Ticket Transformers: Structural and Behavioral Study
  of Sparse Neural Machine Translation
Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine TranslationBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2020
Rajiv Movva
Jason Zhao
219
12
0
17 Sep 2020
Efficient Transformers: A Survey
Efficient Transformers: A SurveyACM Computing Surveys (ACM CSUR), 2020
Yi Tay
Mostafa Dehghani
Dara Bahri
Donald Metzler
VLM
869
1,362
0
14 Sep 2020
Time-based Sequence Model for Personalization and Recommendation Systems
Time-based Sequence Model for Personalization and Recommendation Systems
T. Ishkhanov
Maxim Naumov
Xianjie Chen
Yan Zhu
Yuan Zhong
A. Azzolini
Chonglin Sun
Frank Jiang
Andrey Malevich
Liang Xiong
140
17
0
27 Aug 2020
Making Neural Networks Interpretable with Attribution: Application to
  Implicit Signals Prediction
Making Neural Networks Interpretable with Attribution: Application to Implicit Signals PredictionACM Conference on Recommender Systems (RecSys), 2020
Darius Afchar
Romain Hennequin
FAttXAI
228
16
0
26 Aug 2020
TSAM: Temporal Link Prediction in Directed Networks based on
  Self-Attention Mechanism
TSAM: Temporal Link Prediction in Directed Networks based on Self-Attention Mechanism
Jinsong Li
Jianhua Peng
Shuxin Liu
Lintianran Weng
Cong Li
131
12
0
23 Aug 2020
On the Importance of Local Information in Transformer Based Models
On the Importance of Local Information in Transformer Based Models
Madhura Pande
Aakriti Budhraja
Preksha Nema
Pratyush Kumar
Mitesh M. Khapra
96
6
0
13 Aug 2020
Compression of Deep Learning Models for Text: A Survey
Compression of Deep Learning Models for Text: A SurveyACM Transactions on Knowledge Discovery from Data (TKDD), 2020
Manish Gupta
Puneet Agrawal
VLMMedImAI4CE
512
134
0
12 Aug 2020
DeLighT: Deep and Light-weight Transformer
DeLighT: Deep and Light-weight Transformer
Sachin Mehta
Marjan Ghazvininejad
Srini Iyer
Luke Zettlemoyer
Hannaneh Hajishirzi
VLM
249
34
0
03 Aug 2020
Spatially Aware Multimodal Transformers for TextVQA
Spatially Aware Multimodal Transformers for TextVQAEuropean Conference on Computer Vision (ECCV), 2020
Yash Kant
Dhruv Batra
Peter Anderson
Alex Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
205
94
0
23 Jul 2020
Data Movement Is All You Need: A Case Study on Optimizing Transformers
Data Movement Is All You Need: A Case Study on Optimizing Transformers
A. Ivanov
Nikoli Dryden
Tal Ben-Nun
Shigang Li
Torsten Hoefler
418
168
0
30 Jun 2020
Multi-Head Attention: Collaborate Instead of Concatenate
Multi-Head Attention: Collaborate Instead of Concatenate
Jean-Baptiste Cordonnier
Andreas Loukas
Martin Jaggi
222
152
0
29 Jun 2020
BERTology Meets Biology: Interpreting Attention in Protein Language
  Models
BERTology Meets Biology: Interpreting Attention in Protein Language Models
Jesse Vig
Ali Madani
Lav Varshney
Caiming Xiong
R. Socher
Nazneen Rajani
409
335
0
26 Jun 2020
A Trainable Optimal Transport Embedding for Feature Aggregation and its
  Relationship to Attention
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention
Grégoire Mialon
Dexiong Chen
Alexandre d’Aspremont
Julien Mairal
OT
232
0
0
22 Jun 2020
On the Computational Power of Transformers and its Implications in
  Sequence Modeling
On the Computational Power of Transformers and its Implications in Sequence Modeling
S. Bhattamishra
Arkil Patel
Navin Goyal
453
83
0
16 Jun 2020
Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?
Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?
Corentin Kervadec
G. Antipov
M. Baccouche
Christian Wolf
OOD
282
99
0
09 Jun 2020
BERT Loses Patience: Fast and Robust Inference with Early Exit
BERT Loses Patience: Fast and Robust Inference with Early Exit
Wangchunshu Zhou
Canwen Xu
Tao Ge
Julian McAuley
Ke Xu
Furu Wei
381
401
0
07 Jun 2020
Distilling Neural Networks for Greener and Faster Dependency Parsing
Distilling Neural Networks for Greener and Faster Dependency ParsingInternational Workshop/Conference on Parsing Technologies (IWPT), 2020
Mark Anderson
Carlos Gómez-Rodríguez
143
19
0
01 Jun 2020
CNRL at SemEval-2020 Task 5: Modelling Causal Reasoning in Language with
  Multi-Head Self-Attention Weights based Counterfactual Detection
CNRL at SemEval-2020 Task 5: Modelling Causal Reasoning in Language with Multi-Head Self-Attention Weights based Counterfactual DetectionInternational Workshop on Semantic Evaluation (SemEval), 2020
Rajaswa Patil
V. Baths
92
4
0
31 May 2020
HAT: Hardware-Aware Transformers for Efficient Natural Language
  Processing
HAT: Hardware-Aware Transformers for Efficient Natural Language ProcessingAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Hanrui Wang
Zhanghao Wu
Zhijian Liu
Han Cai
Ligeng Zhu
Chuang Gan
Song Han
264
281
0
28 May 2020
Unsupervised Quality Estimation for Neural Machine Translation
Unsupervised Quality Estimation for Neural Machine Translation
M. Fomicheva
Shuo Sun
Lisa Yankovskaya
Frédéric Blain
Francisco Guzmán
Mark Fishel
Nikolaos Aletras
Vishrav Chaudhary
Lucia Specia
UQLM
302
252
0
21 May 2020
Enhancing Monotonic Multihead Attention for Streaming ASR
Enhancing Monotonic Multihead Attention for Streaming ASR
Hirofumi Inaguma
Masato Mimura
Tatsuya Kawahara
293
35
0
19 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained
  Vision-and-Language Models
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
271
139
0
15 May 2020
A Mixture of $h-1$ Heads is Better than $h$ Heads
A Mixture of h−1h-1h−1 Heads is Better than hhh Heads
Hao Peng
Roy Schwartz
Dianqi Li
Noah A. Smith
MoE
175
35
0
13 May 2020
The Unstoppable Rise of Computational Linguistics in Deep Learning
The Unstoppable Rise of Computational Linguistics in Deep Learning
James Henderson
AI4CE
240
32
0
13 May 2020
Successfully Applying the Stabilized Lottery Ticket Hypothesis to the
  Transformer Architecture
Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer ArchitectureAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Christopher Brix
Parnia Bahar
Hermann Ney
229
39
0
04 May 2020
Similarity Analysis of Contextual Word Representation Models
Similarity Analysis of Contextual Word Representation ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
John M. Wu
Yonatan Belinkov
Hassan Sajjad
Nadir Durrani
Fahim Dalvi
James R. Glass
273
79
0
03 May 2020
Hard-Coded Gaussian Attention for Neural Machine Translation
Hard-Coded Gaussian Attention for Neural Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Weiqiu You
Simeng Sun
Mohit Iyyer
226
71
0
02 May 2020
When BERT Plays the Lottery, All Tickets Are Winning
When BERT Plays the Lottery, All Tickets Are WinningConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Sai Prasanna
Anna Rogers
Anna Rumshisky
MILM
309
200
0
01 May 2020
Does Data Augmentation Improve Generalization in NLP?
Does Data Augmentation Improve Generalization in NLP?
Rohan Jha
Charles Lovering
Ellie Pavlick
210
10
0
30 Apr 2020
How do Decisions Emerge across Layers in Neural Models? Interpretation
  with Differentiable Masking
How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable MaskingConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Nicola De Cao
Michael Schlichtkrull
Wilker Aziz
Ivan Titov
206
92
0
30 Apr 2020
Character-Level Translation with Self-attention
Character-Level Translation with Self-attentionAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Yingqiang Gao
Nikola I. Nikolov
Yuhuang Hu
Richard H. R. Hahnloser
153
31
0
30 Apr 2020
Universal Dependencies according to BERT: both more specific and more
  general
Universal Dependencies according to BERT: both more specific and more generalFindings (Findings), 2020
Tomasz Limisiewicz
Rudolf Rosa
David Marevcek
150
18
0
30 Apr 2020
What Happens To BERT Embeddings During Fine-tuning?
What Happens To BERT Embeddings During Fine-tuning?BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2020
Amil Merchant
Elahe Rahimtoroghi
Ellie Pavlick
Ian Tenney
247
206
0
29 Apr 2020
Scheduled DropHead: A Regularization Method for Transformer Models
Scheduled DropHead: A Regularization Method for Transformer ModelsFindings (Findings), 2020
Wangchunshu Zhou
Tao Ge
Ke Xu
Furu Wei
Ming Zhou
186
41
0
28 Apr 2020
Assessing the Bilingual Knowledge Learned by Neural Machine Translation
  Models
Assessing the Bilingual Knowledge Learned by Neural Machine Translation Models
Shilin He
Xing Wang
Shuming Shi
Michael R. Lyu
Zhaopeng Tu
159
5
0
28 Apr 2020
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
DeeBERT: Dynamic Early Exiting for Accelerating BERT InferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Ji Xin
Raphael Tang
Jaejun Lee
Yaoliang Yu
Jimmy J. Lin
232
441
0
27 Apr 2020
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models
On Sparsifying Encoder Outputs in Sequence-to-Sequence ModelsFindings (Findings), 2020
Biao Zhang
Ivan Titov
Rico Sennrich
97
14
0
24 Apr 2020
The Right Tool for the Job: Matching Model and Instance Complexities
The Right Tool for the Job: Matching Model and Instance ComplexitiesAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Roy Schwartz
Gabriel Stanovsky
Swabha Swayamdipta
Jesse Dodge
Noah A. Smith
334
178
0
16 Apr 2020
Relation Transformer Network
Relation Transformer Network
Rajat Koner
Poulami Sinhamahapatra
Volker Tresp
ViT
344
35
0
13 Apr 2020
Telling BERT's full story: from Local Attention to Global Aggregation
Telling BERT's full story: from Local Attention to Global AggregationConference of the European Chapter of the Association for Computational Linguistics (EACL), 2020
Damian Pascual
Gino Brunner
Roger Wattenhofer
203
21
0
10 Apr 2020
DynaBERT: Dynamic BERT with Adaptive Width and Depth
DynaBERT: Dynamic BERT with Adaptive Width and DepthNeural Information Processing Systems (NeurIPS), 2020
Lu Hou
Zhiqi Huang
Lifeng Shang
Xin Jiang
Xiao Chen
Qun Liu
MQ
277
361
0
08 Apr 2020
On the Effect of Dropping Layers of Pre-trained Transformer Models
On the Effect of Dropping Layers of Pre-trained Transformer ModelsComputer Speech and Language (CSL), 2020
Hassan Sajjad
Fahim Dalvi
Nadir Durrani
Preslav Nakov
316
174
0
08 Apr 2020
Understanding Learning Dynamics for Neural Machine Translation
Understanding Learning Dynamics for Neural Machine Translation
Conghui Zhu
Guanlin Li
Lemao Liu
Tiejun Zhao
Shuming Shi
110
3
0
05 Apr 2020
Information-Theoretic Probing with Minimum Description Length
Information-Theoretic Probing with Minimum Description LengthConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Elena Voita
Ivan Titov
266
296
0
27 Mar 2020
Probing Word Translations in the Transformer and Trading Decoder for
  Encoder Layers
Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers
Hongfei Xu
Josef van Genabith
Qiuhui Liu
Deyi Xiong
128
3
0
21 Mar 2020
Pre-trained Models for Natural Language Processing: A Survey
Pre-trained Models for Natural Language Processing: A SurveyScience China Technological Sciences (Sci China Technol Sci), 2020
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MAVLM
1.1K
1,625
0
18 Mar 2020
A Primer in BERTology: What we know about how BERT works
A Primer in BERTology: What we know about how BERT worksTransactions of the Association for Computational Linguistics (TACL), 2020
Anna Rogers
Olga Kovaleva
Anna Rumshisky
OffRL
479
1,731
0
27 Feb 2020
Train Large, Then Compress: Rethinking Model Size for Efficient Training
  and Inference of Transformers
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
Zhuohan Li
Eric Wallace
Sheng Shen
Kevin Lin
Kurt Keutzer
Dan Klein
Joseph E. Gonzalez
301
153
0
26 Feb 2020
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
  Translation
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine TranslationFindings (Findings), 2020
Alessandro Raganato
Yves Scherrer
Jörg Tiedemann
371
96
0
24 Feb 2020
Previous
123...131415
Next
Page 14 of 15
Pageof 15