ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1905.07799
  4. Cited By
Adaptive Attention Span in Transformers
v1v2 (latest)

Adaptive Attention Span in Transformers

Annual Meeting of the Association for Computational Linguistics (ACL), 2019
19 May 2019
Sainbayar Sukhbaatar
Edouard Grave
Piotr Bojanowski
Armand Joulin
ArXiv (abs)PDFHTML

Papers citing "Adaptive Attention Span in Transformers"

50 / 201 papers shown
Sparse Meta Networks for Sequential Adaptation and its Application to
  Adaptive Language Modelling
Sparse Meta Networks for Sequential Adaptation and its Application to Adaptive Language Modelling
Tsendsuren Munkhdalai
CLLOffRL
211
5
0
03 Sep 2020
HiPPO: Recurrent Memory with Optimal Polynomial Projections
HiPPO: Recurrent Memory with Optimal Polynomial Projections
Albert Gu
Tri Dao
Stefano Ermon
Atri Rudra
Christopher Ré
420
822
0
17 Aug 2020
Adding Recurrence to Pretrained Transformers for Improved Efficiency and
  Context Size
Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size
Davis Yoshida
Allyson Ettinger
Kevin Gimpel
AI4CE
203
7
0
16 Aug 2020
Big Bird: Transformers for Longer Sequences
Big Bird: Transformers for Longer SequencesNeural Information Processing Systems (NeurIPS), 2020
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
1.3K
2,554
0
28 Jul 2020
Spatially Aware Multimodal Transformers for TextVQA
Spatially Aware Multimodal Transformers for TextVQAEuropean Conference on Computer Vision (ECCV), 2020
Yash Kant
Dhruv Batra
Peter Anderson
Alex Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
209
94
0
23 Jul 2020
Conformer-Kernel with Query Term Independence for Document Retrieval
Conformer-Kernel with Query Term Independence for Document Retrieval
Bhaskar Mitra
Sebastian Hofstatter
Hamed Zamani
Nick Craswell
179
22
0
20 Jul 2020
Fast Transformers with Clustered Attention
Fast Transformers with Clustered AttentionNeural Information Processing Systems (NeurIPS), 2020
Apoorv Vyas
Angelos Katharopoulos
Franccois Fleuret
290
172
0
09 Jul 2020
Do Transformers Need Deep Long-Range Memory
Do Transformers Need Deep Long-Range Memory
Jack W. Rae
Ali Razavi
RALM
241
43
0
07 Jul 2020
Data Movement Is All You Need: A Case Study on Optimizing Transformers
Data Movement Is All You Need: A Case Study on Optimizing Transformers
A. Ivanov
Nikoli Dryden
Tal Ben-Nun
Shigang Li
Torsten Hoefler
418
169
0
30 Jun 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear
  Attention
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
735
2,350
0
29 Jun 2020
Open-Domain Conversational Agents: Current Progress, Open Problems, and
  Future Directions
Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions
Stephen Roller
Y-Lan Boureau
Jason Weston
Antoine Bordes
Emily Dinan
...
Kurt Shuster
Eric Michael Smith
Arthur Szlam
Jack Urbanek
Mary Williamson
LLMAGAI4CE
237
60
0
22 Jun 2020
Input-independent Attention Weights Are Expressive Enough: A Study of
  Attention in Self-supervised Audio Transformers
Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers
Tsung-Han Wu
Chun-Chen Hsieh
Yen-Hao Chen
Po-Han Chi
Hung-yi Lee
237
1
0
09 Jun 2020
$O(n)$ Connections are Expressive Enough: Universal Approximability of
  Sparse Transformers
O(n)O(n)O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Chulhee Yun
Yin-Wen Chang
Srinadh Bhojanapalli
A. S. Rawat
Sashank J. Reddi
Sanjiv Kumar
235
94
0
08 Jun 2020
HAT: Hardware-Aware Transformers for Efficient Natural Language
  Processing
HAT: Hardware-Aware Transformers for Efficient Natural Language ProcessingAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Hanrui Wang
Zhanghao Wu
Zhijian Liu
Han Cai
Ligeng Zhu
Chuang Gan
Song Han
270
281
0
28 May 2020
Adaptive Transformers for Learning Multimodal Representations
Adaptive Transformers for Learning Multimodal Representations
Prajjwal Bhargava
117
5
0
15 May 2020
A Mixture of $h-1$ Heads is Better than $h$ Heads
A Mixture of h−1h-1h−1 Heads is Better than hhh Heads
Hao Peng
Roy Schwartz
Dianqi Li
Noah A. Smith
MoE
176
35
0
13 May 2020
Multi-scale Transformer Language Models
Multi-scale Transformer Language Models
Sandeep Subramanian
R. Collobert
MarcÁurelio Ranzato
Y-Lan Boureau
143
18
0
01 May 2020
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
  Encoder for Long-Form Document Matching
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document MatchingInternational Conference on Information and Knowledge Management (CIKM), 2020
Liu Yang
Mingyang Zhang
Cheng Li
Michael Bendersky
Marc Najork
268
96
0
26 Apr 2020
Lite Transformer with Long-Short Range Attention
Lite Transformer with Long-Short Range AttentionInternational Conference on Learning Representations (ICLR), 2020
Zhanghao Wu
Zhijian Liu
Ji Lin
Chengyue Wu
Song Han
186
367
0
24 Apr 2020
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models
On Sparsifying Encoder Outputs in Sequence-to-Sequence ModelsFindings (Findings), 2020
Biao Zhang
Ivan Titov
Rico Sennrich
100
14
0
24 Apr 2020
Vector Quantized Contrastive Predictive Coding for Template-based Music
  Generation
Vector Quantized Contrastive Predictive Coding for Template-based Music Generation
Gaëtan Hadjeres
Léopold Crestel
204
19
0
21 Apr 2020
Adaptive Attention Span in Computer Vision
Adaptive Attention Span in Computer Vision
Jerrod Parker
Shakti Kumar
Joe Roussy
ViTVLM
50
2
0
18 Apr 2020
ETC: Encoding Long and Structured Inputs in Transformers
ETC: Encoding Long and Structured Inputs in Transformers
Joshua Ainslie
Santiago Ontanon
Chris Alberti
Vaclav Cvicek
Zachary Kenneth Fisher
Philip Pham
Anirudh Ravula
Sumit Sanghai
Qifan Wang
Li Yang
318
56
0
17 Apr 2020
Training with Quantization Noise for Extreme Model Compression
Training with Quantization Noise for Extreme Model CompressionInternational Conference on Learning Representations (ICLR), 2020
Angela Fan
Pierre Stock
Benjamin Graham
Edouard Grave
Remi Gribonval
Edouard Grave
Armand Joulin
MQ
297
257
0
15 Apr 2020
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALMVLM
715
4,928
0
10 Apr 2020
Adaptive Transformers in RL
Adaptive Transformers in RL
Shakti Kumar
Jerrod Parker
Panteha Naderian
OffRLAI4CE
91
17
0
08 Apr 2020
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive
  Connection
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive ConnectionNeural Information Processing Systems (NeurIPS), 2020
Xiaoya Li
Yuxian Meng
Mingxin Zhou
Qinghong Han
Leilei Gan
Jiwei Li
286
22
0
22 Mar 2020
Efficient Content-Based Sparse Attention with Routing Transformers
Efficient Content-Based Sparse Attention with Routing TransformersTransactions of the Association for Computational Linguistics (TACL), 2020
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
994
693
0
12 Mar 2020
Meta-Embeddings Based On Self-Attention
Meta-Embeddings Based On Self-Attention
Qichen Li
Xiaoke Jiang
Jun Xia
Jian Li
159
2
0
03 Mar 2020
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
  Translation
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine TranslationFindings (Findings), 2020
Alessandro Raganato
Yves Scherrer
Jörg Tiedemann
383
96
0
24 Feb 2020
Addressing Some Limitations of Transformers with Feedback Memory
Addressing Some Limitations of Transformers with Feedback Memory
Angela Fan
Thibaut Lavril
Edouard Grave
Armand Joulin
Sainbayar Sukhbaatar
197
11
0
21 Feb 2020
Reformer: The Efficient Transformer
Reformer: The Efficient TransformerInternational Conference on Learning Representations (ICLR), 2020
Nikita Kitaev
Lukasz Kaiser
Anselm Levskaya
VLM
634
2,732
0
13 Jan 2020
Explicit Sparse Transformer: Concentrated Attention Through Explicit
  Selection
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
Guangxiang Zhao
Junyang Lin
Zhiyuan Zhang
Xuancheng Ren
Qi Su
Xu Sun
192
139
0
25 Dec 2019
Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for
  Generative Models
Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative ModelsComputer Vision and Pattern Recognition (CVPR), 2019
Giannis Daras
Augustus Odena
Han Zhang
A. Dimakis
225
61
0
27 Nov 2019
Single Headed Attention RNN: Stop Thinking With Your Head
Single Headed Attention RNN: Stop Thinking With Your Head
Stephen Merity
254
70
0
26 Nov 2019
Pre-Training of Deep Bidirectional Protein Sequence Representations with
  Structural Information
Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural InformationIEEE Access (IEEE Access), 2019
Seonwoo Min
Seunghyun Park
Siwon Kim
Hyun-Soo Choi
Byunghan Lee
Sungroh Yoon
SSL
337
63
0
25 Nov 2019
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning
Guangxiang Zhao
Xu Sun
Jingjing Xu
Zhiyuan Zhang
Liangchen Luo
LRM
181
60
0
17 Nov 2019
Compressive Transformers for Long-Range Sequence Modelling
Compressive Transformers for Long-Range Sequence ModellingInternational Conference on Learning Representations (ICLR), 2019
Jack W. Rae
Anna Potapenko
Siddhant M. Jayakumar
Timothy Lillicrap
RALMVLMKELM
311
778
0
13 Nov 2019
BP-Transformer: Modelling Long-Range Context via Binary Partitioning
BP-Transformer: Modelling Long-Range Context via Binary Partitioning
Zihao Ye
Qipeng Guo
Quan Gan
Xipeng Qiu
Zheng Zhang
221
83
0
11 Nov 2019
Two-Headed Monster And Crossed Co-Attention Networks
Two-Headed Monster And Crossed Co-Attention Networks
Yaoyiran Li
Jing Jiang
150
0
0
10 Nov 2019
Location Attention for Extrapolation to Longer Sequences
Location Attention for Extrapolation to Longer SequencesAnnual Meeting of the Association for Computational Linguistics (ACL), 2019
Yann Dubois
Gautier Dagan
Dieuwke Hupkes
Elia Bruni
221
46
0
10 Nov 2019
Improving Transformer Models by Reordering their Sublayers
Improving Transformer Models by Reordering their SublayersAnnual Meeting of the Association for Computational Linguistics (ACL), 2019
Ofir Press
Noah A. Smith
Omer Levy
169
93
0
10 Nov 2019
Blockwise Self-Attention for Long Document Understanding
Blockwise Self-Attention for Long Document UnderstandingFindings (Findings), 2019
J. Qiu
Hao Ma
Omer Levy
Scott Yih
Sinong Wang
Jie Tang
321
269
0
07 Nov 2019
Using Local Knowledge Graph Construction to Scale Seq2Seq Models to
  Multi-Document Inputs
Using Local Knowledge Graph Construction to Scale Seq2Seq Models to Multi-Document InputsConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Angela Fan
Claire Gardent
Chloé Braud
Antoine Bordes
186
107
0
18 Oct 2019
When and Why is Document-level Context Useful in Neural Machine
  Translation?
When and Why is Document-level Context Useful in Neural Machine Translation?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Yunsu Kim
Thanh-Hai Tran
Hermann Ney
171
93
0
01 Oct 2019
Reducing Transformer Depth on Demand with Structured Dropout
Reducing Transformer Depth on Demand with Structured DropoutInternational Conference on Learning Representations (ICLR), 2019
Angela Fan
Edouard Grave
Armand Joulin
636
662
0
25 Sep 2019
Towards Better Modeling Hierarchical Structure for Self-Attention with
  Ordered Neurons
Towards Better Modeling Hierarchical Structure for Self-Attention with Ordered NeuronsConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Jie Hao
Xing Wang
Shuming Shi
Jinfeng Zhang
Zhaopeng Tu
184
12
0
04 Sep 2019
Self-Attention with Structural Position Representations
Self-Attention with Structural Position RepresentationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Xing Wang
Zhaopeng Tu
Longyue Wang
Shuming Shi
MILM
188
75
0
01 Sep 2019
Adaptively Sparse Transformers
Adaptively Sparse TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Gonçalo M. Correia
Vlad Niculae
André F. T. Martins
352
280
0
30 Aug 2019
Augmenting Self-attention with Persistent Memory
Augmenting Self-attention with Persistent Memory
Sainbayar Sukhbaatar
Edouard Grave
Guillaume Lample
Edouard Grave
Armand Joulin
RALMKELM
231
152
0
02 Jul 2019
Previous
12345
Next
Page 4 of 5