ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2007.03356
  4. Cited By
Do Transformers Need Deep Long-Range Memory

Do Transformers Need Deep Long-Range Memory

7 July 2020
Jack W. Rae
Ali Razavi
    RALM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Do Transformers Need Deep Long-Range Memory"

28 / 28 papers shown
StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns
StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns
Luanbo Wan
Weizhi Ma
LLMAGKELM
257
7
0
16 Jun 2025
What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?
What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?
Jinhong Ni
Chang-Bin Zhang
Qiang Zhang
Jing Zhang
MDE
233
7
0
28 May 2025
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Gleb Rodionov
Roman Garipov
Alina Shutova
George Yakushev
Erik Schultheis
Vage Egiazarian
Anton Sinitsin
Denis Kuznedelev
Dan Alistarh
LRM
587
26
0
08 Apr 2025
Sparser is Faster and Less is More: Efficient Sparse Attention for
  Long-Range Transformers
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
Chao Lou
Zixia Jia
Zilong Zheng
Kewei Tu
ODL
285
57
0
24 Jun 2024
Are queries and keys always relevant? A case study on Transformer wave functions
Are queries and keys always relevant? A case study on Transformer wave functions
Riccardo Rende
Luciano Loris Viteritti
327
15
0
29 May 2024
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via
  Layer-wise Optimal Budget
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
Zihao Wang
Shaoduo Gan
310
17
0
07 Apr 2024
Masked Audio Generation using a Single Non-Autoregressive Transformer
Masked Audio Generation using a Single Non-Autoregressive TransformerInternational Conference on Learning Representations (ICLR), 2024
Alon Ziv
Itai Gat
Gaël Le Lan
Tal Remez
Felix Kreuk
Alexandre Défossez
Jade Copet
Gabriel Synnaeve
Yossi Adi
478
65
0
09 Jan 2024
Zebra: Extending Context Window with Layerwise Grouped Local-Global
  Attention
Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention
Kaiqiang Song
Xiaoyang Wang
Sangwoo Cho
Xiaoman Pan
Dong Yu
300
7
0
14 Dec 2023
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents
Jake Grigsby
Linxi Fan
Yuke Zhu
OffRLLM&Ro
408
44
0
15 Oct 2023
Long-range Language Modeling with Self-retrieval
Long-range Language Modeling with Self-retrievalTransactions of the Association for Computational Linguistics (TACL), 2023
Ohad Rubin
Jonathan Berant
RALMKELM
273
32
0
23 Jun 2023
DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation
  with Diffusion Models
DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion ModelsInternational Joint Conference on Artificial Intelligence (IJCAI), 2023
Sicheng Yang
Zhiyong Wu
Minglei Li
Zhensong Zhang
Lei Hao
Weihong Bao
Ming Cheng
Long Xiao
240
109
0
08 May 2023
Transformer Working Memory Enables Regular Language Reasoning and
  Natural Language Length Extrapolation
Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length ExtrapolationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ta-Chung Chi
Ting-Han Fan
Alexander I. Rudnicky
Peter J. Ramadge
LRM
186
15
0
05 May 2023
What Makes Data Suitable for a Locally Connected Neural Network? A
  Necessary and Sufficient Condition Based on Quantum Entanglement
What Makes Data Suitable for a Locally Connected Neural Network? A Necessary and Sufficient Condition Based on Quantum EntanglementNeural Information Processing Systems (NeurIPS), 2023
Yotam Alexander
Nimrod De La Vega
Noam Razin
Nadav Cohen
426
6
0
20 Mar 2023
Dissociating language and thought in large language models
Dissociating language and thought in large language models
Kyle Mahowald
Anna A. Ivanova
I. Blank
Nancy Kanwisher
J. Tenenbaum
Evelina Fedorenko
ELMReLM
404
233
0
16 Jan 2023
iColoriT: Towards Propagating Local Hint to the Right Region in
  Interactive Colorization by Leveraging Vision Transformer
iColoriT: Towards Propagating Local Hint to the Right Region in Interactive Colorization by Leveraging Vision Transformer
Jooyeol Yun
Sanghyeon Lee
Minho Park
Jaegul Choo
ViT
307
2
0
14 Jul 2022
Embedding Recycling for Language Models
Embedding Recycling for Language ModelsFindings (Findings), 2022
Jon Saad-Falcon
Amanpreet Singh
Luca Soldaini
Mike DÁrcy
Arman Cohan
Doug Downey
KELM
225
5
0
11 Jul 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with
  IO-Awareness
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessNeural Information Processing Systems (NeurIPS), 2022
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
958
3,833
0
27 May 2022
The NLP Task Effectiveness of Long-Range Transformers
The NLP Task Effectiveness of Long-Range TransformersConference of the European Chapter of the Association for Computational Linguistics (EACL), 2022
Guanghui Qin
Yukun Feng
Benjamin Van Durme
309
38
0
16 Feb 2022
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
  Long-Term Video Recognition
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video RecognitionComputer Vision and Pattern Recognition (CVPR), 2022
Chao-Yuan Wu
Yanghao Li
K. Mangalam
Haoqi Fan
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
517
259
0
20 Jan 2022
Well Googled is Half Done: Multimodal Forecasting of New Fashion Product
  Sales with Image-based Google Trends
Well Googled is Half Done: Multimodal Forecasting of New Fashion Product Sales with Image-based Google Trends
Geri Skenderi
Christian Joppi
Matteo Denitto
Marco Cristani
AI4TS
370
35
0
20 Sep 2021
Do Long-Range Language Models Actually Use Long-Range Context?
Do Long-Range Language Models Actually Use Long-Range Context?
Simeng Sun
Kalpesh Krishna
Andrew Mattarella-Micke
Mohit Iyyer
RALM
291
102
0
19 Sep 2021
Can Transformers Jump Around Right in Natural Language? Assessing
  Performance Transfer from SCAN
Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN
Rahma Chaabouni
Roberto Dessì
Eugene Kharitonov
293
20
0
03 Jul 2021
EchoFilter: End-to-End Neural Network for Acoustic Echo Cancellation
EchoFilter: End-to-End Neural Network for Acoustic Echo Cancellation
Lu Ma
Song Yang
Y. Gong
Xintian Wang
Zhongqin Wu
140
15
0
31 May 2021
Long Range Arena: A Benchmark for Efficient Transformers
Long Range Arena: A Benchmark for Efficient Transformers
Yi Tay
Mostafa Dehghani
Samira Abnar
Songlin Yang
Dara Bahri
Philip Pham
J. Rao
Liu Yang
Sebastian Ruder
Donald Metzler
577
861
0
08 Nov 2020
Sparsifying Transformer Models with Trainable Representation Pooling
Sparsifying Transformer Models with Trainable Representation PoolingAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Michal Pietruszka
Łukasz Borchmann
Lukasz Garncarek
299
13
0
10 Sep 2020
Neural Language Generation: Formulation, Methods, and Evaluation
Neural Language Generation: Formulation, Methods, and Evaluation
Cristina Garbacea
Qiaozhu Mei
406
30
0
31 Jul 2020
Efficient Content-Based Sparse Attention with Routing Transformers
Efficient Content-Based Sparse Attention with Routing TransformersTransactions of the Association for Computational Linguistics (TACL), 2020
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
1.1K
732
0
12 Mar 2020
Frustratingly Short Attention Spans in Neural Language Modeling
Frustratingly Short Attention Spans in Neural Language ModelingInternational Conference on Learning Representations (ICLR), 2017
Michal Daniluk
Tim Rocktaschel
Johannes Welbl
Sebastian Riedel
374
118
0
15 Feb 2017
1
Page 1 of 1