ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1905.09418
  4. Cited By
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
  Lifting, the Rest Can Be Pruned

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

23 May 2019
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
ArXivPDFHTML

Papers citing "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"

19 / 169 papers shown
Title
Time-based Sequence Model for Personalization and Recommendation Systems
Time-based Sequence Model for Personalization and Recommendation Systems
T. Ishkhanov
Maxim Naumov
Xianjie Chen
Yan Zhu
Yuan Zhong
A. Azzolini
Chonglin Sun
Frank Jiang
Andrey Malevich
Liang Xiong
13
16
0
27 Aug 2020
BERTology Meets Biology: Interpreting Attention in Protein Language
  Models
BERTology Meets Biology: Interpreting Attention in Protein Language Models
Jesse Vig
Ali Madani
L. Varshney
Caiming Xiong
R. Socher
Nazneen Rajani
15
288
0
26 Jun 2020
On the Computational Power of Transformers and its Implications in
  Sequence Modeling
On the Computational Power of Transformers and its Implications in Sequence Modeling
S. Bhattamishra
Arkil Patel
Navin Goyal
25
63
0
16 Jun 2020
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
Ji Xin
Raphael Tang
Jaejun Lee
Yaoliang Yu
Jimmy J. Lin
6
363
0
27 Apr 2020
The Right Tool for the Job: Matching Model and Instance Complexities
The Right Tool for the Job: Matching Model and Instance Complexities
Roy Schwartz
Gabriel Stanovsky
Swabha Swayamdipta
Jesse Dodge
Noah A. Smith
33
167
0
16 Apr 2020
Information-Theoretic Probing with Minimum Description Length
Information-Theoretic Probing with Minimum Description Length
Elena Voita
Ivan Titov
19
269
0
27 Mar 2020
Pre-trained Models for Natural Language Processing: A Survey
Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
243
1,450
0
18 Mar 2020
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
  Translation
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
Alessandro Raganato
Yves Scherrer
Jörg Tiedemann
22
92
0
24 Feb 2020
Are Pre-trained Language Models Aware of Phrases? Simple but Strong
  Baselines for Grammar Induction
Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction
Taeuk Kim
Jihun Choi
Daniel Edmiston
Sang-goo Lee
22
90
0
30 Jan 2020
Cross-Lingual Ability of Multilingual BERT: An Empirical Study
Cross-Lingual Ability of Multilingual BERT: An Empirical Study
Karthikeyan K
Zihan Wang
Stephen D. Mayhew
Dan Roth
LRM
25
334
0
17 Dec 2019
Graph Transformer for Graph-to-Sequence Learning
Graph Transformer for Graph-to-Sequence Learning
Deng Cai
W. Lam
16
220
0
18 Nov 2019
What do you mean, BERT? Assessing BERT as a Distributional Semantics
  Model
What do you mean, BERT? Assessing BERT as a Distributional Semantics Model
Timothee Mickus
Denis Paperno
Mathieu Constant
Kees van Deemter
18
45
0
13 Nov 2019
Reducing Transformer Depth on Demand with Structured Dropout
Reducing Transformer Depth on Demand with Structured Dropout
Angela Fan
Edouard Grave
Armand Joulin
22
584
0
25 Sep 2019
SANVis: Visual Analytics for Understanding Self-Attention Networks
SANVis: Visual Analytics for Understanding Self-Attention Networks
Cheonbok Park
Inyoup Na
Yongjang Jo
Sungbok Shin
J. Yoo
Bum Chul Kwon
Jian Zhao
Hyungjong Noh
Yeonsoo Lee
Jaegul Choo
HAI
27
38
0
13 Sep 2019
On Identifiability in Transformers
On Identifiability in Transformers
Gino Brunner
Yang Liu
Damian Pascual
Oliver Richter
Massimiliano Ciaramita
Roger Wattenhofer
ViT
22
186
0
12 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
35
1,912
0
09 Aug 2019
Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery
  in Low-resource Settings
Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings
Marcely Zanon Boito
Aline Villavicencio
Laurent Besacier
15
8
0
29 Jun 2019
What Does BERT Look At? An Analysis of BERT's Attention
What Does BERT Look At? An Analysis of BERT's Attention
Kevin Clark
Urvashi Khandelwal
Omer Levy
Christopher D. Manning
MILM
37
1,578
0
11 Jun 2019
Analyzing the Structure of Attention in a Transformer Language Model
Analyzing the Structure of Attention in a Transformer Language Model
Jesse Vig
Yonatan Belinkov
19
357
0
07 Jun 2019
Previous
1234