v1v2v3v4 (latest)

Investigating Recurrent Transformers with Dynamic Halt

1 February 2024

Jishnu Ray Chowdhury

Cornelia Caragea

ArXiv (abs)PDF HTML

Papers citing "Investigating Recurrent Transformers with Dynamic Halt"

50 / 87 papers shown

Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization

Mohammad Mahdi Samiei Paqaleh

Arash Marioriyad

Arman Tahmasebi-Zadeh

Mohamadreza Fereydooni

Mahdi Ghaznavai

Mahdieh Soleymani Baghshah

120

06 Oct 2025

A Transformer with Stack Attention

241

07 May 2024

TransformerFAM: Feedback attention is working memory

420

14 Apr 2024

The Illusion of State in State-Space Models

William Merrill

Jackson Petty

Ashish Sabharwal

425

119

12 Apr 2024

HGRN2: Gated Linear RNNs with State Expansion

Zhen Qin

383

11 Apr 2024

Jamba: A Hybrid Transformer-Mamba Language Model

...

430

338

28 Mar 2024

Gated Linear Attention Transformers with Hardware-Efficient Training

Bailin Wang

478

303

11 Dec 2023

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu

Tri Dao

Mamba

578

5,330

01 Dec 2023

On the Long Range Abilities of Transformers

Itamar Zimerman

Lior Wolf

264

28 Nov 2023

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Zhen Qin

Aaron Courville

Yiran Zhong

209

119

08 Nov 2023

Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability

Jishnu Ray Chowdhury

Cornelia Caragea

228

08 Nov 2023

What Algorithms can Transformers Learn? A Study in Length GeneralizationInternational Conference on Learning Representations (ICLR), 2023

301

162

24 Oct 2023

The Expressive Power of Transformers with Chain of Thought

William Merrill

Ashish Sabharwal

LRM AI4CE ReLM

542

11 Oct 2023

Sparse Universal TransformerConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Shawn Tan

Songlin Yang

Zhenfang Chen

Aaron Courville

Chuang Gan

MoE

266

11 Oct 2023

Stack Attention: Improving the Ability of Transformers to Model Hierarchical PatternsInternational Conference on Learning Representations (ICLR), 2023

Brian DuSell

David Chiang

395

03 Oct 2023

Efficient Beam Tree RecursionNeural Information Processing Systems (NeurIPS), 2023

Jishnu Ray Chowdhury

Cornelia Caragea

353

20 Jul 2023

FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningInternational Conference on Learning Representations (ICLR), 2023

Tri Dao

LRM

431

2,092

17 Jul 2023

Retentive Network: A Successor to Transformer for Large Language Models

789

515

17 Jul 2023

Sparse Modular Activation for Efficient Sequence ModelingNeural Information Processing Systems (NeurIPS), 2023

Liliang Ren

Yang Liu

Shuohang Wang

Yichong Xu

Chenguang Zhu

Chengxiang Zhai

278

19 Jun 2023

Block-State TransformersNeural Information Processing Systems (NeurIPS), 2023

Pierre-Luc Bacon

254

15 Jun 2023

Exposing Attention Glitches with Flip-Flop Language ModelingNeural Information Processing Systems (NeurIPS), 2023

212

01 Jun 2023

Beam Tree Recursive CellsInternational Conference on Machine Learning (ICML), 2023

Jishnu Ray Chowdhury

Cornelia Caragea

387

31 May 2023

Towards Revealing the Mystery behind Chain of Thought: A Theoretical PerspectiveNeural Information Processing Systems (NeurIPS), 2023

660

359

24 May 2023

RWKV: Reinventing RNNs for the Transformer EraConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

...

Rui-Jie Zhu

583

862

22 May 2023

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length ExtrapolationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Ta-Chung Chi

Ting-Han Fan

Alexander I. Rudnicky

Peter J. Ramadge

LRM

153

05 May 2023

CoLT5: Faster Long-Range Transformers with Conditional ComputationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Joshua Ainslie

Santiago Ontañón

...

Sumit Sanghai

226

17 Mar 2023

Resurrecting Recurrent Neural Networks for Long SequencesInternational Conference on Machine Learning (ICML), 2023

Antonio Orvieto

497

420

11 Mar 2023

Modular Deep Learning

438

103

22 Feb 2023

Adaptive Computation with Elastic Input SequenceInternational Conference on Machine Learning (ICML), 2023

Fuzhao Xue

Valerii Likhosherstov

Anurag Arnab

N. Houlsby

Mostafa Dehghani

Yang You

244

30 Jan 2023

A Length-Extrapolatable TransformerAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Xia Song

324

156

20 Dec 2022

Towards Reasoning in Large Language Models: A SurveyAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Jie Huang

Kevin Chen-Chuan Chang

LM&MA ELM LRM

1.0K

814

20 Dec 2022

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean FunctionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

468

22 Nov 2022

Transformers Learn Shortcuts to AutomataInternational Conference on Learning Representations (ICLR), 2022

509

224

19 Oct 2022

Neural Attentive CircuitsNeural Information Processing Systems (NeurIPS), 2022

Francesco Locatello

295

14 Oct 2022

Mega: Moving Average Equipped Gated AttentionInternational Conference on Learning Representations (ICLR), 2022

Graham Neubig

Luke Zettlemoyer

339

219

21 Sep 2022

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Sharan Narang

254

121

21 Jul 2022

Confident Adaptive Language ModelingNeural Information Processing Systems (NeurIPS), 2022

771

223

14 Jul 2022

Recurrent Memory TransformerNeural Information Processing Systems (NeurIPS), 2022

336

151

14 Jul 2022

Neural Networks and the Chomsky HierarchyInternational Conference on Learning Representations (ICLR), 2022

...

Marcus Hutter

501

196

05 Jul 2022

The Parallelism Tradeoff: Limitations of Log-Precision TransformersTransactions of the Association for Computational Linguistics (TACL), 2022

William Merrill

Ashish Sabharwal

494

155

02 Jul 2022

Long Range Language Modeling via Gated State SpacesInternational Conference on Learning Representations (ICLR), 2022

541

333

27 Jun 2022

On the Parameterization and Initialization of Diagonal State Space ModelsNeural Information Processing Systems (NeurIPS), 2022

419

484

23 Jun 2022

Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence LearningNeural Information Processing Systems (NeurIPS), 2022

Aniket Didolkar

Kshitij Gupta

Anirudh Goyal

Nitesh B. Gundavarapu

457

30 May 2022

Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit ComplexityTransactions of the Association for Computational Linguistics (TACL), 2022

Sophie Hao

Dana Angluin

Robert Frank

218

13 Apr 2022

Block-Recurrent TransformersNeural Information Processing Systems (NeurIPS), 2022

449

131

11 Mar 2022

Transformer Quality in Linear TimeInternational Conference on Machine Learning (ICML), 2022

490

301

21 Feb 2022

Flowformer: Linearizing Transformers with Conservation FlowsInternational Conference on Machine Learning (ICML), 2022

292

120

13 Feb 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022

2.3K

14,735

28 Jan 2022

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Henryk Michalewski

...

545

924

30 Nov 2021

Efficiently Modeling Long Sequences with Structured State SpacesInternational Conference on Learning Representations (ICLR), 2021

Albert Gu

Karan Goel

Christopher Ré

1.0K

2,892

31 Oct 2021