ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.00976
  4. Cited By
Investigating Recurrent Transformers with Dynamic Halt
v1v2v3v4 (latest)

Investigating Recurrent Transformers with Dynamic Halt

1 February 2024
Jishnu Ray Chowdhury
Cornelia Caragea
ArXiv (abs)PDFHTML

Papers citing "Investigating Recurrent Transformers with Dynamic Halt"

50 / 87 papers shown
Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization
Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization
Mohammad Mahdi Samiei Paqaleh
Arash Marioriyad
Arman Tahmasebi-Zadeh
Mohamadreza Fereydooni
Mahdi Ghaznavai
Mahdieh Soleymani Baghshah
120
0
0
06 Oct 2025
A Transformer with Stack Attention
A Transformer with Stack Attention
Jiaoda Li
Jennifer C. White
Mrinmaya Sachan
Robert Bamler
236
4
0
07 May 2024
TransformerFAM: Feedback attention is working memory
TransformerFAM: Feedback attention is working memory
Dongseong Hwang
Weiran Wang
Zhuoyuan Huo
K. Sim
P. M. Mengibar
412
17
0
14 Apr 2024
The Illusion of State in State-Space Models
The Illusion of State in State-Space Models
William Merrill
Jackson Petty
Ashish Sabharwal
413
119
0
12 Apr 2024
HGRN2: Gated Linear RNNs with State Expansion
HGRN2: Gated Linear RNNs with State Expansion
Zhen Qin
Aaron Courville
Weixuan Sun
Xuyang Shen
Dong Li
Weigao Sun
Yiran Zhong
LRM
368
85
0
11 Apr 2024
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber
Barak Lenz
Hofit Bata
Gal Cohen
Jhonathan Osin
...
Nir Ratner
N. Rozen
Erez Shwartz
Mor Zusman
Y. Shoham
424
329
0
28 Mar 2024
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated Linear Attention Transformers with Hardware-Efficient Training
Aaron Courville
Bailin Wang
Songlin Yang
Yikang Shen
Yoon Kim
443
300
0
11 Dec 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu
Tri Dao
Mamba
559
5,168
0
01 Dec 2023
On the Long Range Abilities of Transformers
On the Long Range Abilities of Transformers
Itamar Zimerman
Lior Wolf
250
11
0
28 Nov 2023
Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Zhen Qin
Aaron Courville
Yiran Zhong
196
117
0
08 Nov 2023
Recursion in Recursion: Two-Level Nested Recursion for Length
  Generalization with Scalability
Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability
Jishnu Ray Chowdhury
Cornelia Caragea
227
5
0
08 Nov 2023
What Algorithms can Transformers Learn? A Study in Length Generalization
What Algorithms can Transformers Learn? A Study in Length GeneralizationInternational Conference on Learning Representations (ICLR), 2023
Hattie Zhou
Arwen Bradley
Etai Littwin
Noam Razin
Omid Saremi
Josh Susskind
Samy Bengio
Preetum Nakkiran
289
160
0
24 Oct 2023
The Expressive Power of Transformers with Chain of Thought
The Expressive Power of Transformers with Chain of Thought
William Merrill
Ashish Sabharwal
LRMAI4CEReLM
531
41
0
11 Oct 2023
Sparse Universal Transformer
Sparse Universal TransformerConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Shawn Tan
Songlin Yang
Zhenfang Chen
Aaron Courville
Chuang Gan
MoE
260
24
0
11 Oct 2023
Stack Attention: Improving the Ability of Transformers to Model
  Hierarchical Patterns
Stack Attention: Improving the Ability of Transformers to Model Hierarchical PatternsInternational Conference on Learning Representations (ICLR), 2023
Brian DuSell
David Chiang
390
15
0
03 Oct 2023
Efficient Beam Tree Recursion
Efficient Beam Tree RecursionNeural Information Processing Systems (NeurIPS), 2023
Jishnu Ray Chowdhury
Cornelia Caragea
352
3
0
20 Jul 2023
FlashAttention-2: Faster Attention with Better Parallelism and Work
  Partitioning
FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningInternational Conference on Learning Representations (ICLR), 2023
Tri Dao
LRM
429
2,050
0
17 Jul 2023
Retentive Network: A Successor to Transformer for Large Language Models
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun
Li Dong
Shaohan Huang
Shuming Ma
Yuqing Xia
Jilong Xue
Jianyong Wang
Furu Wei
LRM
779
508
0
17 Jul 2023
Sparse Modular Activation for Efficient Sequence Modeling
Sparse Modular Activation for Efficient Sequence ModelingNeural Information Processing Systems (NeurIPS), 2023
Liliang Ren
Yang Liu
Shuohang Wang
Yichong Xu
Chenguang Zhu
Chengxiang Zhai
275
17
0
19 Jun 2023
Block-State Transformers
Block-State TransformersNeural Information Processing Systems (NeurIPS), 2023
Mahan Fathi
Jonathan Pilault
Orhan Firat
C. Pal
Pierre-Luc Bacon
Ross Goroshin
240
25
0
15 Jun 2023
Exposing Attention Glitches with Flip-Flop Language Modeling
Exposing Attention Glitches with Flip-Flop Language ModelingNeural Information Processing Systems (NeurIPS), 2023
Bingbin Liu
Jordan T. Ash
Surbhi Goel
A. Krishnamurthy
Cyril Zhang
LRM
206
70
0
01 Jun 2023
Beam Tree Recursive Cells
Beam Tree Recursive CellsInternational Conference on Machine Learning (ICML), 2023
Jishnu Ray Chowdhury
Cornelia Caragea
385
6
0
31 May 2023
Towards Revealing the Mystery behind Chain of Thought: A Theoretical
  Perspective
Towards Revealing the Mystery behind Chain of Thought: A Theoretical PerspectiveNeural Information Processing Systems (NeurIPS), 2023
Guhao Feng
Bohang Zhang
Yuntian Gu
Haotian Ye
Di He
Liwei Wang
LRM
649
354
0
24 May 2023
RWKV: Reinventing RNNs for the Transformer Era
RWKV: Reinventing RNNs for the Transformer EraConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Bo Peng
Eric Alcaide
Quentin G. Anthony
Alon Albalak
Samuel Arcadinho
...
Qihang Zhao
P. Zhou
Qinghua Zhou
Jian Zhu
Rui-Jie Zhu
578
845
0
22 May 2023
Transformer Working Memory Enables Regular Language Reasoning and
  Natural Language Length Extrapolation
Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length ExtrapolationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ta-Chung Chi
Ting-Han Fan
Alexander I. Rudnicky
Peter J. Ramadge
LRM
153
15
0
05 May 2023
CoLT5: Faster Long-Range Transformers with Conditional Computation
CoLT5: Faster Long-Range Transformers with Conditional ComputationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Joshua Ainslie
Tao Lei
Michiel de Jong
Santiago Ontañón
Siddhartha Brahma
...
Mandy Guo
James Lee-Thorp
Yi Tay
Yun-hsuan Sung
Sumit Sanghai
LLMAG
209
89
0
17 Mar 2023
Resurrecting Recurrent Neural Networks for Long Sequences
Resurrecting Recurrent Neural Networks for Long SequencesInternational Conference on Machine Learning (ICML), 2023
Antonio Orvieto
Samuel L. Smith
Albert Gu
Anushan Fernando
Çağlar Gülçehre
Razvan Pascanu
Soham De
497
418
0
11 Mar 2023
Modular Deep Learning
Modular Deep Learning
Jonas Pfeiffer
Sebastian Ruder
Ivan Vulić
Edoardo Ponti
MoMeOOD
437
103
0
22 Feb 2023
Adaptive Computation with Elastic Input Sequence
Adaptive Computation with Elastic Input SequenceInternational Conference on Machine Learning (ICML), 2023
Fuzhao Xue
Valerii Likhosherstov
Anurag Arnab
N. Houlsby
Mostafa Dehghani
Yang You
241
27
0
30 Jan 2023
A Length-Extrapolatable Transformer
A Length-Extrapolatable TransformerAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Yutao Sun
Li Dong
Barun Patra
Shuming Ma
Shaohan Huang
Alon Benhaim
Vishrav Chaudhary
Xia Song
Furu Wei
316
154
0
20 Dec 2022
Towards Reasoning in Large Language Models: A Survey
Towards Reasoning in Large Language Models: A SurveyAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Jie Huang
Kevin Chen-Chuan Chang
LM&MAELMLRM
980
805
0
20 Dec 2022
Simplicity Bias in Transformers and their Ability to Learn Sparse
  Boolean Functions
Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean FunctionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
S. Bhattamishra
Arkil Patel
Varun Kanade
Phil Blunsom
456
62
0
22 Nov 2022
Transformers Learn Shortcuts to Automata
Transformers Learn Shortcuts to AutomataInternational Conference on Learning Representations (ICLR), 2022
Bingbin Liu
Jordan T. Ash
Surbhi Goel
A. Krishnamurthy
Cyril Zhang
OffRLLRM
499
222
0
19 Oct 2022
Neural Attentive Circuits
Neural Attentive CircuitsNeural Information Processing Systems (NeurIPS), 2022
Nasim Rahaman
M. Weiß
Francesco Locatello
C. Pal
Yoshua Bengio
Bernhard Schölkopf
Erran L. Li
Nicolas Ballas
279
8
0
14 Oct 2022
Mega: Moving Average Equipped Gated Attention
Mega: Moving Average Equipped Gated AttentionInternational Conference on Learning Representations (ICLR), 2022
Xuezhe Ma
Chunting Zhou
Xiang Kong
Junxian He
Liangke Gui
Graham Neubig
Jonathan May
Luke Zettlemoyer
324
216
0
21 Sep 2022
Scaling Laws vs Model Architectures: How does Inductive Bias Influence
  Scaling?
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yi Tay
Mostafa Dehghani
Samira Abnar
Hyung Won Chung
W. Fedus
J. Rao
Sharan Narang
Vinh Q. Tran
Dani Yogatama
Donald Metzler
AI4CE
235
121
0
21 Jul 2022
Confident Adaptive Language Modeling
Confident Adaptive Language ModelingNeural Information Processing Systems (NeurIPS), 2022
Tal Schuster
Adam Fisch
Jai Gupta
Mostafa Dehghani
Dara Bahri
Vinh Q. Tran
Yi Tay
Donald Metzler
750
221
0
14 Jul 2022
Recurrent Memory Transformer
Recurrent Memory TransformerNeural Information Processing Systems (NeurIPS), 2022
Aydar Bulatov
Yuri Kuratov
Andrey Kravchenko
CLL
322
149
0
14 Jul 2022
Neural Networks and the Chomsky Hierarchy
Neural Networks and the Chomsky HierarchyInternational Conference on Learning Representations (ICLR), 2022
Grégoire Delétang
Anian Ruoss
Jordi Grau-Moya
Tim Genewein
L. Wenliang
...
Chris Cundy
Marcus Hutter
Shane Legg
Joel Veness
Pedro A. Ortega
UQCV
496
196
0
05 Jul 2022
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
The Parallelism Tradeoff: Limitations of Log-Precision TransformersTransactions of the Association for Computational Linguistics (TACL), 2022
William Merrill
Ashish Sabharwal
477
154
0
02 Jul 2022
Long Range Language Modeling via Gated State Spaces
Long Range Language Modeling via Gated State SpacesInternational Conference on Learning Representations (ICLR), 2022
Harsh Mehta
Ankit Gupta
Ashok Cutkosky
Behnam Neyshabur
Mamba
522
331
0
27 Jun 2022
On the Parameterization and Initialization of Diagonal State Space
  Models
On the Parameterization and Initialization of Diagonal State Space ModelsNeural Information Processing Systems (NeurIPS), 2022
Albert Gu
Ankit Gupta
Karan Goel
Christopher Ré
413
471
0
23 Jun 2022
Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing
  Mechanisms in Sequence Learning
Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence LearningNeural Information Processing Systems (NeurIPS), 2022
Aniket Didolkar
Kshitij Gupta
Anirudh Goyal
Nitesh B. Gundavarapu
Alex Lamb
Nan Rosemary Ke
Yoshua Bengio
AI4CE
450
21
0
30 May 2022
Formal Language Recognition by Hard Attention Transformers: Perspectives
  from Circuit Complexity
Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit ComplexityTransactions of the Association for Computational Linguistics (TACL), 2022
Sophie Hao
Dana Angluin
Robert Frank
214
98
0
13 Apr 2022
Block-Recurrent Transformers
Block-Recurrent TransformersNeural Information Processing Systems (NeurIPS), 2022
DeLesley S. Hutchins
Imanol Schlag
Yuhuai Wu
Ethan Dyer
Behnam Neyshabur
448
131
0
11 Mar 2022
Transformer Quality in Linear Time
Transformer Quality in Linear TimeInternational Conference on Machine Learning (ICML), 2022
Weizhe Hua
Zihang Dai
Hanxiao Liu
Quoc V. Le
467
297
0
21 Feb 2022
Flowformer: Linearizing Transformers with Conservation Flows
Flowformer: Linearizing Transformers with Conservation FlowsInternational Conference on Machine Learning (ICML), 2022
Haixu Wu
Jialong Wu
Jiehui Xu
Jianmin Wang
Mingsheng Long
275
118
0
13 Feb 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&RoLRMAI4CEReLM
2.3K
14,449
0
28 Jan 2022
Show Your Work: Scratchpads for Intermediate Computation with Language
  Models
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye
Anders Andreassen
Guy Gur-Ari
Henryk Michalewski
Jacob Austin
...
Aitor Lewkowycz
Maarten Bosma
D. Luan
Charles Sutton
Augustus Odena
ReLMLRM
544
920
0
30 Nov 2021
Efficiently Modeling Long Sequences with Structured State Spaces
Efficiently Modeling Long Sequences with Structured State SpacesInternational Conference on Learning Representations (ICLR), 2021
Albert Gu
Karan Goel
Christopher Ré
983
2,835
0
31 Oct 2021
12
Next