v1v2v3v4 (latest)

Investigating Recurrent Transformers with Dynamic Halt

1 February 2024

Jishnu Ray Chowdhury

Cornelia Caragea

ArXiv (abs)PDF HTML

Papers citing "Investigating Recurrent Transformers with Dynamic Halt"

50 / 87 papers shown

Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization

Mohammad Mahdi Samiei Paqaleh

Arash Marioriyad

Arman Tahmasebi-Zadeh

Mohamadreza Fereydooni

Mahdi Ghaznavai

Mahdieh Soleymani Baghshah

120

06 Oct 2025

A Transformer with Stack Attention

236

07 May 2024

TransformerFAM: Feedback attention is working memory

412

14 Apr 2024

The Illusion of State in State-Space Models

William Merrill

Jackson Petty

Ashish Sabharwal

413

119

12 Apr 2024

HGRN2: Gated Linear RNNs with State Expansion

Zhen Qin

368

11 Apr 2024

Jamba: A Hybrid Transformer-Mamba Language Model

...

424

329

28 Mar 2024

Gated Linear Attention Transformers with Hardware-Efficient Training

Bailin Wang

443

300

11 Dec 2023

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu

Tri Dao

Mamba

559

5,168

01 Dec 2023

On the Long Range Abilities of Transformers

Itamar Zimerman

Lior Wolf

250

28 Nov 2023

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Zhen Qin

Aaron Courville

Yiran Zhong

196

117

08 Nov 2023

Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability

Jishnu Ray Chowdhury

Cornelia Caragea

227

08 Nov 2023

What Algorithms can Transformers Learn? A Study in Length GeneralizationInternational Conference on Learning Representations (ICLR), 2023

289

160

24 Oct 2023

The Expressive Power of Transformers with Chain of Thought

William Merrill

Ashish Sabharwal

LRM AI4CE ReLM

531

11 Oct 2023

Sparse Universal TransformerConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Shawn Tan

Songlin Yang

Zhenfang Chen

Aaron Courville

Chuang Gan

MoE

260

11 Oct 2023

Stack Attention: Improving the Ability of Transformers to Model Hierarchical PatternsInternational Conference on Learning Representations (ICLR), 2023

Brian DuSell

David Chiang

390

03 Oct 2023

Efficient Beam Tree RecursionNeural Information Processing Systems (NeurIPS), 2023

Jishnu Ray Chowdhury

Cornelia Caragea

352

20 Jul 2023

FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningInternational Conference on Learning Representations (ICLR), 2023

Tri Dao

LRM

429

2,050

17 Jul 2023

Retentive Network: A Successor to Transformer for Large Language Models

779

508

17 Jul 2023

Sparse Modular Activation for Efficient Sequence ModelingNeural Information Processing Systems (NeurIPS), 2023

Liliang Ren

Yang Liu

Shuohang Wang

Yichong Xu

Chenguang Zhu

Chengxiang Zhai

275

19 Jun 2023

Block-State TransformersNeural Information Processing Systems (NeurIPS), 2023

Pierre-Luc Bacon

240

15 Jun 2023

Exposing Attention Glitches with Flip-Flop Language ModelingNeural Information Processing Systems (NeurIPS), 2023

206

01 Jun 2023

Beam Tree Recursive CellsInternational Conference on Machine Learning (ICML), 2023

Jishnu Ray Chowdhury

Cornelia Caragea

385

31 May 2023

Towards Revealing the Mystery behind Chain of Thought: A Theoretical PerspectiveNeural Information Processing Systems (NeurIPS), 2023

649

354

24 May 2023

RWKV: Reinventing RNNs for the Transformer EraConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

...

Rui-Jie Zhu

578

845

22 May 2023

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length ExtrapolationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Ta-Chung Chi

Ting-Han Fan

Alexander I. Rudnicky

Peter J. Ramadge

LRM

153

05 May 2023

CoLT5: Faster Long-Range Transformers with Conditional ComputationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Joshua Ainslie

Santiago Ontañón

...

Sumit Sanghai

209

17 Mar 2023

Resurrecting Recurrent Neural Networks for Long SequencesInternational Conference on Machine Learning (ICML), 2023

Antonio Orvieto

497

418

11 Mar 2023

Modular Deep Learning

437

103

22 Feb 2023

Adaptive Computation with Elastic Input SequenceInternational Conference on Machine Learning (ICML), 2023

Fuzhao Xue

Valerii Likhosherstov

Anurag Arnab

N. Houlsby

Mostafa Dehghani

Yang You

241

30 Jan 2023

A Length-Extrapolatable TransformerAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Xia Song

316

154

20 Dec 2022

Towards Reasoning in Large Language Models: A SurveyAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Jie Huang

Kevin Chen-Chuan Chang

LM&MA ELM LRM

980

805

20 Dec 2022

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean FunctionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

456

22 Nov 2022

Transformers Learn Shortcuts to AutomataInternational Conference on Learning Representations (ICLR), 2022

499

222

19 Oct 2022

Neural Attentive CircuitsNeural Information Processing Systems (NeurIPS), 2022

Francesco Locatello

279

14 Oct 2022

Mega: Moving Average Equipped Gated AttentionInternational Conference on Learning Representations (ICLR), 2022

Graham Neubig

Luke Zettlemoyer

324

216

21 Sep 2022

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Sharan Narang

235

121

21 Jul 2022

Confident Adaptive Language ModelingNeural Information Processing Systems (NeurIPS), 2022

750

221

14 Jul 2022

Recurrent Memory TransformerNeural Information Processing Systems (NeurIPS), 2022

322

149

14 Jul 2022

Neural Networks and the Chomsky HierarchyInternational Conference on Learning Representations (ICLR), 2022

...

Marcus Hutter

496

196

05 Jul 2022

The Parallelism Tradeoff: Limitations of Log-Precision TransformersTransactions of the Association for Computational Linguistics (TACL), 2022

William Merrill

Ashish Sabharwal

477

154

02 Jul 2022

Long Range Language Modeling via Gated State SpacesInternational Conference on Learning Representations (ICLR), 2022

522

331

27 Jun 2022

On the Parameterization and Initialization of Diagonal State Space ModelsNeural Information Processing Systems (NeurIPS), 2022

413

471

23 Jun 2022

Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence LearningNeural Information Processing Systems (NeurIPS), 2022

Aniket Didolkar

Kshitij Gupta

Anirudh Goyal

Nitesh B. Gundavarapu

450

30 May 2022

Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit ComplexityTransactions of the Association for Computational Linguistics (TACL), 2022

Sophie Hao

Dana Angluin

Robert Frank

214

13 Apr 2022

Block-Recurrent TransformersNeural Information Processing Systems (NeurIPS), 2022

448

131

11 Mar 2022

Transformer Quality in Linear TimeInternational Conference on Machine Learning (ICML), 2022

467

297

21 Feb 2022

Flowformer: Linearizing Transformers with Conservation FlowsInternational Conference on Machine Learning (ICML), 2022

275

118

13 Feb 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022

2.3K

14,449

28 Jan 2022

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Henryk Michalewski

...

544

920

30 Nov 2021

Efficiently Modeling Long Sequences with Structured State SpacesInternational Conference on Learning Representations (ICLR), 2021

Albert Gu

Karan Goel

Christopher Ré

983

2,835

31 Oct 2021