ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1905.09418
  4. Cited By
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
  Lifting, the Rest Can Be Pruned
v1v2 (latest)

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Annual Meeting of the Association for Computational Linguistics (ACL), 2019
23 May 2019
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
ArXiv (abs)PDFHTML

Papers citing "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"

41 / 741 papers shown
Controlling Computation versus Quality for Neural Sequence Models
Controlling Computation versus Quality for Neural Sequence Models
Ankur Bapna
N. Arivazhagan
Orhan Firat
226
34
0
17 Feb 2020
Low-Rank Bottleneck in Multi-head Attention Models
Low-Rank Bottleneck in Multi-head Attention ModelsInternational Conference on Machine Learning (ICML), 2020
Srinadh Bhojanapalli
Chulhee Yun
A. S. Rawat
Sashank J. Reddi
Sanjiv Kumar
189
122
0
17 Feb 2020
Are Pre-trained Language Models Aware of Phrases? Simple but Strong
  Baselines for Grammar Induction
Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar InductionInternational Conference on Learning Representations (ICLR), 2020
Taeuk Kim
Jihun Choi
Daniel Edmiston
Sang-goo Lee
182
91
0
30 Jan 2020
Modeling Global and Local Node Contexts for Text Generation from
  Knowledge Graphs
Modeling Global and Local Node Contexts for Text Generation from Knowledge GraphsTransactions of the Association for Computational Linguistics (TACL), 2020
Leonardo F. R. Ribeiro
Yue Zhang
Claire Gardent
Iryna Gurevych
205
78
0
29 Jan 2020
SANST: A Self-Attentive Network for Next Point-of-Interest
  Recommendation
SANST: A Self-Attentive Network for Next Point-of-Interest Recommendation
Qi Guo
Jianzhong Qi
AI4TS
80
11
0
22 Jan 2020
Block-wise Dynamic Sparseness
Block-wise Dynamic SparsenessPattern Recognition Letters (Pattern Recognit. Lett.), 2020
Amir Hadifar
Johannes Deleu
Chris Develder
Thomas Demeester
125
3
0
14 Jan 2020
AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural
  Architecture Search
AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture SearchInternational Joint Conference on Artificial Intelligence (IJCAI), 2020
Daoyuan Chen
Yaliang Li
Minghui Qiu
Zhen Wang
Bofang Li
Bolin Ding
Hongbo Deng
Yanjie Liang
Jialin Li
Jingren Zhou
MQ
218
106
0
13 Jan 2020
Cross-Lingual Ability of Multilingual BERT: An Empirical Study
Cross-Lingual Ability of Multilingual BERT: An Empirical StudyInternational Conference on Learning Representations (ICLR), 2019
Karthikeyan K
Zihan Wang
Stephen D. Mayhew
Dan Roth
LRM
293
364
0
17 Dec 2019
WaLDORf: Wasteless Language-model Distillation On Reading-comprehension
WaLDORf: Wasteless Language-model Distillation On Reading-comprehension
J. Tian
A. Kreuzer
Pai-Hung Chen
Hans-Martin Will
VLM
169
3
0
13 Dec 2019
TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in
  (Un-)Supervised NLP
TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP
Nils Rethmeier
V. Saxena
Isabelle Augenstein
FAtt
202
2
0
02 Dec 2019
Do Attention Heads in BERT Track Syntactic Dependencies?
Do Attention Heads in BERT Track Syntactic Dependencies?
Phu Mon Htut
Jason Phang
Shikha Bordia
Samuel R. Bowman
238
144
0
27 Nov 2019
Graph Transformer for Graph-to-Sequence Learning
Graph Transformer for Graph-to-Sequence LearningAAAI Conference on Artificial Intelligence (AAAI), 2019
Deng Cai
W. Lam
309
246
0
18 Nov 2019
What do you mean, BERT? Assessing BERT as a Distributional Semantics
  Model
What do you mean, BERT? Assessing BERT as a Distributional Semantics Model
Timothee Mickus
Denis Paperno
Mathieu Constant
Kees van Deemter
237
49
0
13 Nov 2019
Understanding Multi-Head Attention in Abstractive Summarization
Understanding Multi-Head Attention in Abstractive Summarization
Joris Baan
Maartje ter Hoeve
M. V. D. Wees
Anne Schuth
Maarten de Rijke
AAML
130
23
0
10 Nov 2019
Blockwise Self-Attention for Long Document Understanding
Blockwise Self-Attention for Long Document UnderstandingFindings (Findings), 2019
J. Qiu
Hao Ma
Omer Levy
Scott Yih
Sinong Wang
Jie Tang
309
269
0
07 Nov 2019
Efficiency through Auto-Sizing: Notre Dame NLP's Submission to the WNGT
  2019 Efficiency Task
Efficiency through Auto-Sizing: Notre Dame NLP's Submission to the WNGT 2019 Efficiency TaskConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Kenton W. Murray
Brian DuSell
David Chiang
90
2
0
16 Oct 2019
Structured Pruning of a BERT-based Question Answering Model
Structured Pruning of a BERT-based Question Answering Model
J. Scott McCarley
Rishav Chakravarti
Avirup Sil
264
54
0
14 Oct 2019
exBERT: A Visual Analysis Tool to Explore Learned Representations in
  Transformers Models
exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models
Benjamin Hoover
Hendrik Strobelt
Sebastian Gehrmann
125
91
0
11 Oct 2019
Structured Pruning of Large Language Models
Structured Pruning of Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Ziheng Wang
Jeremy Wohlwend
Tao Lei
296
328
0
10 Oct 2019
Reducing Transformer Depth on Demand with Structured Dropout
Reducing Transformer Depth on Demand with Structured DropoutInternational Conference on Learning Representations (ICLR), 2019
Angela Fan
Edouard Grave
Armand Joulin
614
658
0
25 Sep 2019
TinyBERT: Distilling BERT for Natural Language Understanding
TinyBERT: Distilling BERT for Natural Language UnderstandingFindings (Findings), 2019
Xiaoqi Jiao
Yichun Yin
Lifeng Shang
Xin Jiang
Xiao Chen
Linlin Li
F. Wang
Qun Liu
VLM
604
2,161
0
23 Sep 2019
SANVis: Visual Analytics for Understanding Self-Attention Networks
SANVis: Visual Analytics for Understanding Self-Attention NetworksVisual .. (VISUAL), 2019
Cheonbok Park
Inyoup Na
Yongjang Jo
Sungbok Shin
J. Yoo
Bum Chul Kwon
Jian Zhao
Hyungjong Noh
Yeonsoo Lee
Jaegul Choo
HAI
173
41
0
13 Sep 2019
Multi-Granularity Self-Attention for Neural Machine Translation
Multi-Granularity Self-Attention for Neural Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Jie Hao
Xing Wang
Shuming Shi
Jinfeng Zhang
Zhaopeng Tu
MILM
170
50
0
05 Sep 2019
The Bottom-up Evolution of Representations in the Transformer: A Study
  with Machine Translation and Language Modeling Objectives
The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling ObjectivesConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Elena Voita
Rico Sennrich
Ivan Titov
471
202
0
03 Sep 2019
Rotate King to get Queen: Word Relationships as Orthogonal
  Transformations in Embedding Space
Rotate King to get Queen: Word Relationships as Orthogonal Transformations in Embedding SpaceConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Kawin Ethayarajh
LLMSV
147
15
0
02 Sep 2019
Improving Multi-Head Attention with Capsule Networks
Improving Multi-Head Attention with Capsule NetworksNatural Language Processing and Chinese Computing (NLPCC), 2019
Shuhao Gu
Yang Feng
215
14
0
31 Aug 2019
Adaptively Sparse Transformers
Adaptively Sparse TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Gonçalo M. Correia
Vlad Niculae
André F. T. Martins
341
277
0
30 Aug 2019
Encoders Help You Disambiguate Word Senses in Neural Machine Translation
Encoders Help You Disambiguate Word Senses in Neural Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Gongbo Tang
Rico Sennrich
Joakim Nivre
186
22
0
30 Aug 2019
Revealing the Dark Secrets of BERT
Revealing the Dark Secrets of BERTConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Olga Kovaleva
Alexey Romanov
Anna Rogers
Anna Rumshisky
390
604
0
21 Aug 2019
On Identifiability in Transformers
On Identifiability in TransformersInternational Conference on Learning Representations (ICLR), 2019
Gino Brunner
Yang Liu
Damian Pascual
Oliver Richter
Massimiliano Ciaramita
Roger Wattenhofer
ViT
327
202
0
12 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
596
2,211
0
09 Aug 2019
Is artificial data useful for biomedical Natural Language Processing
  algorithms?
Is artificial data useful for biomedical Natural Language Processing algorithms?
Zixu Wang
Julia Ive
S. Velupillai
Lucia Specia
MedIm
138
9
0
01 Jul 2019
Do Transformer Attention Heads Provide Transparency in Abstractive
  Summarization?
Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?
Joris Baan
Maartje ter Hoeve
M. V. D. Wees
Anne Schuth
Maarten de Rijke
163
21
0
01 Jul 2019
Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery
  in Low-resource Settings
Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource SettingsInterspeech (Interspeech), 2019
Marcely Zanon Boito
Aline Villavicencio
Laurent Besacier
163
8
0
29 Jun 2019
Theoretical Limitations of Self-Attention in Neural Sequence Models
Theoretical Limitations of Self-Attention in Neural Sequence ModelsTransactions of the Association for Computational Linguistics (TACL), 2019
Michael Hahn
352
337
0
16 Jun 2019
A Multiscale Visualization of Attention in the Transformer Model
A Multiscale Visualization of Attention in the Transformer ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2019
Jesse Vig
ViT
203
658
0
12 Jun 2019
What Does BERT Look At? An Analysis of BERT's Attention
What Does BERT Look At? An Analysis of BERT's Attention
Kevin Clark
Urvashi Khandelwal
Omer Levy
Christopher D. Manning
MILM
614
1,839
0
11 Jun 2019
Analyzing the Structure of Attention in a Transformer Language Model
Analyzing the Structure of Attention in a Transformer Language Model
Jesse Vig
Yonatan Belinkov
271
427
0
07 Jun 2019
Are Sixteen Heads Really Better than One?
Are Sixteen Heads Really Better than One?Neural Information Processing Systems (NeurIPS), 2019
Paul Michel
Omer Levy
Graham Neubig
MoE
415
1,234
0
25 May 2019
An Attentive Survey of Attention Models
An Attentive Survey of Attention Models
S. Chaudhari
Varun Mithal
Gungor Polatkan
R. Ramanath
414
723
0
05 Apr 2019
Attention in Natural Language Processing
Attention in Natural Language Processing
Andrea Galassi
Marco Lippi
Paolo Torroni
GNN
443
555
0
04 Feb 2019
Previous
123...131415
Page 15 of 15
Pageof 15