Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1905.09418
Cited By
v1
v2 (latest)
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Annual Meeting of the Association for Computational Linguistics (ACL), 2019
23 May 2019
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"
50 / 741 papers shown
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Bang An
Jie Lyu
Zhenyi Wang
Chunyuan Li
Changwei Hu
Fei Tan
Ruiyi Zhang
Yifan Hu
Changyou Chen
AAML
268
29
0
20 Sep 2020
Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2020
Rajiv Movva
Jason Zhao
219
12
0
17 Sep 2020
Efficient Transformers: A Survey
ACM Computing Surveys (ACM CSUR), 2020
Yi Tay
Mostafa Dehghani
Dara Bahri
Donald Metzler
VLM
869
1,362
0
14 Sep 2020
Time-based Sequence Model for Personalization and Recommendation Systems
T. Ishkhanov
Maxim Naumov
Xianjie Chen
Yan Zhu
Yuan Zhong
A. Azzolini
Chonglin Sun
Frank Jiang
Andrey Malevich
Liang Xiong
140
17
0
27 Aug 2020
Making Neural Networks Interpretable with Attribution: Application to Implicit Signals Prediction
ACM Conference on Recommender Systems (RecSys), 2020
Darius Afchar
Romain Hennequin
FAtt
XAI
228
16
0
26 Aug 2020
TSAM: Temporal Link Prediction in Directed Networks based on Self-Attention Mechanism
Jinsong Li
Jianhua Peng
Shuxin Liu
Lintianran Weng
Cong Li
131
12
0
23 Aug 2020
On the Importance of Local Information in Transformer Based Models
Madhura Pande
Aakriti Budhraja
Preksha Nema
Pratyush Kumar
Mitesh M. Khapra
96
6
0
13 Aug 2020
Compression of Deep Learning Models for Text: A Survey
ACM Transactions on Knowledge Discovery from Data (TKDD), 2020
Manish Gupta
Puneet Agrawal
VLM
MedIm
AI4CE
512
134
0
12 Aug 2020
DeLighT: Deep and Light-weight Transformer
Sachin Mehta
Marjan Ghazvininejad
Srini Iyer
Luke Zettlemoyer
Hannaneh Hajishirzi
VLM
249
34
0
03 Aug 2020
Spatially Aware Multimodal Transformers for TextVQA
European Conference on Computer Vision (ECCV), 2020
Yash Kant
Dhruv Batra
Peter Anderson
Alex Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
205
94
0
23 Jul 2020
Data Movement Is All You Need: A Case Study on Optimizing Transformers
A. Ivanov
Nikoli Dryden
Tal Ben-Nun
Shigang Li
Torsten Hoefler
418
168
0
30 Jun 2020
Multi-Head Attention: Collaborate Instead of Concatenate
Jean-Baptiste Cordonnier
Andreas Loukas
Martin Jaggi
222
152
0
29 Jun 2020
BERTology Meets Biology: Interpreting Attention in Protein Language Models
Jesse Vig
Ali Madani
Lav Varshney
Caiming Xiong
R. Socher
Nazneen Rajani
409
335
0
26 Jun 2020
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention
Grégoire Mialon
Dexiong Chen
Alexandre d’Aspremont
Julien Mairal
OT
232
0
0
22 Jun 2020
On the Computational Power of Transformers and its Implications in Sequence Modeling
S. Bhattamishra
Arkil Patel
Navin Goyal
453
83
0
16 Jun 2020
Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?
Corentin Kervadec
G. Antipov
M. Baccouche
Christian Wolf
OOD
282
99
0
09 Jun 2020
BERT Loses Patience: Fast and Robust Inference with Early Exit
Wangchunshu Zhou
Canwen Xu
Tao Ge
Julian McAuley
Ke Xu
Furu Wei
381
401
0
07 Jun 2020
Distilling Neural Networks for Greener and Faster Dependency Parsing
International Workshop/Conference on Parsing Technologies (IWPT), 2020
Mark Anderson
Carlos Gómez-Rodríguez
143
19
0
01 Jun 2020
CNRL at SemEval-2020 Task 5: Modelling Causal Reasoning in Language with Multi-Head Self-Attention Weights based Counterfactual Detection
International Workshop on Semantic Evaluation (SemEval), 2020
Rajaswa Patil
V. Baths
92
4
0
31 May 2020
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Hanrui Wang
Zhanghao Wu
Zhijian Liu
Han Cai
Ligeng Zhu
Chuang Gan
Song Han
264
281
0
28 May 2020
Unsupervised Quality Estimation for Neural Machine Translation
M. Fomicheva
Shuo Sun
Lisa Yankovskaya
Frédéric Blain
Francisco Guzmán
Mark Fishel
Nikolaos Aletras
Vishrav Chaudhary
Lucia Specia
UQLM
302
252
0
21 May 2020
Enhancing Monotonic Multihead Attention for Streaming ASR
Hirofumi Inaguma
Masato Mimura
Tatsuya Kawahara
293
35
0
19 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
271
139
0
15 May 2020
A Mixture of
h
−
1
h-1
h
−
1
Heads is Better than
h
h
h
Heads
Hao Peng
Roy Schwartz
Dianqi Li
Noah A. Smith
MoE
175
35
0
13 May 2020
The Unstoppable Rise of Computational Linguistics in Deep Learning
James Henderson
AI4CE
240
32
0
13 May 2020
Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Christopher Brix
Parnia Bahar
Hermann Ney
229
39
0
04 May 2020
Similarity Analysis of Contextual Word Representation Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
John M. Wu
Yonatan Belinkov
Hassan Sajjad
Nadir Durrani
Fahim Dalvi
James R. Glass
273
79
0
03 May 2020
Hard-Coded Gaussian Attention for Neural Machine Translation
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Weiqiu You
Simeng Sun
Mohit Iyyer
226
71
0
02 May 2020
When BERT Plays the Lottery, All Tickets Are Winning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Sai Prasanna
Anna Rogers
Anna Rumshisky
MILM
309
200
0
01 May 2020
Does Data Augmentation Improve Generalization in NLP?
Rohan Jha
Charles Lovering
Ellie Pavlick
210
10
0
30 Apr 2020
How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Nicola De Cao
Michael Schlichtkrull
Wilker Aziz
Ivan Titov
206
92
0
30 Apr 2020
Character-Level Translation with Self-attention
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Yingqiang Gao
Nikola I. Nikolov
Yuhuang Hu
Richard H. R. Hahnloser
153
31
0
30 Apr 2020
Universal Dependencies according to BERT: both more specific and more general
Findings (Findings), 2020
Tomasz Limisiewicz
Rudolf Rosa
David Marevcek
150
18
0
30 Apr 2020
What Happens To BERT Embeddings During Fine-tuning?
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2020
Amil Merchant
Elahe Rahimtoroghi
Ellie Pavlick
Ian Tenney
247
206
0
29 Apr 2020
Scheduled DropHead: A Regularization Method for Transformer Models
Findings (Findings), 2020
Wangchunshu Zhou
Tao Ge
Ke Xu
Furu Wei
Ming Zhou
186
41
0
28 Apr 2020
Assessing the Bilingual Knowledge Learned by Neural Machine Translation Models
Shilin He
Xing Wang
Shuming Shi
Michael R. Lyu
Zhaopeng Tu
159
5
0
28 Apr 2020
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Ji Xin
Raphael Tang
Jaejun Lee
Yaoliang Yu
Jimmy J. Lin
232
441
0
27 Apr 2020
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models
Findings (Findings), 2020
Biao Zhang
Ivan Titov
Rico Sennrich
97
14
0
24 Apr 2020
The Right Tool for the Job: Matching Model and Instance Complexities
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Roy Schwartz
Gabriel Stanovsky
Swabha Swayamdipta
Jesse Dodge
Noah A. Smith
334
178
0
16 Apr 2020
Relation Transformer Network
Rajat Koner
Poulami Sinhamahapatra
Volker Tresp
ViT
344
35
0
13 Apr 2020
Telling BERT's full story: from Local Attention to Global Aggregation
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2020
Damian Pascual
Gino Brunner
Roger Wattenhofer
203
21
0
10 Apr 2020
DynaBERT: Dynamic BERT with Adaptive Width and Depth
Neural Information Processing Systems (NeurIPS), 2020
Lu Hou
Zhiqi Huang
Lifeng Shang
Xin Jiang
Xiao Chen
Qun Liu
MQ
277
361
0
08 Apr 2020
On the Effect of Dropping Layers of Pre-trained Transformer Models
Computer Speech and Language (CSL), 2020
Hassan Sajjad
Fahim Dalvi
Nadir Durrani
Preslav Nakov
316
174
0
08 Apr 2020
Understanding Learning Dynamics for Neural Machine Translation
Conghui Zhu
Guanlin Li
Lemao Liu
Tiejun Zhao
Shuming Shi
110
3
0
05 Apr 2020
Information-Theoretic Probing with Minimum Description Length
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Elena Voita
Ivan Titov
266
296
0
27 Mar 2020
Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers
Hongfei Xu
Josef van Genabith
Qiuhui Liu
Deyi Xiong
128
3
0
21 Mar 2020
Pre-trained Models for Natural Language Processing: A Survey
Science China Technological Sciences (Sci China Technol Sci), 2020
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
1.1K
1,625
0
18 Mar 2020
A Primer in BERTology: What we know about how BERT works
Transactions of the Association for Computational Linguistics (TACL), 2020
Anna Rogers
Olga Kovaleva
Anna Rumshisky
OffRL
479
1,731
0
27 Feb 2020
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
Zhuohan Li
Eric Wallace
Sheng Shen
Kevin Lin
Kurt Keutzer
Dan Klein
Joseph E. Gonzalez
301
153
0
26 Feb 2020
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
Findings (Findings), 2020
Alessandro Raganato
Yves Scherrer
Jörg Tiedemann
371
96
0
24 Feb 2020
Previous
1
2
3
...
13
14
15
Next
Page 14 of 15
Page
of 15
Go