Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization

23 May 2025

Papers citing "Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization"

17 / 17 papers shown

Title
Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models Fei Wang Li Shen Liang Ding Chao Xue Ye Liu Changxing Ding 64 1 0 13 Oct 2024
Demystify Mamba in Vision: A Linear Attention Perspective Dongchen Han Ziyi Wang Zhuofan Xia Yizeng Han Yifan Pu Chunjiang Ge Jun Song Shiji Song Bo Zheng Gao Huang Mamba 74 56 0 26 May 2024
Gradient-Free Training of Recurrent Neural Networks using Random Perturbations Jesus Garcia Fernandez Sander Keemink Marcel van Gerven AAML 61 5 0 14 May 2024
Guided-SPSA: Simultaneous Perturbation Stochastic Approximation assisted by the Parameter Shift Rule Maniraman Periyasamy Axel Plinge Christopher Mutschler Daniel D. Scherer Wolfgang Mauerer 63 12 0 24 Apr 2024
Linear attention is (maybe) all you need (to understand transformer optimization) Kwangjun Ahn Xiang Cheng Minhak Song Chulhee Yun Ali Jadbabaie S. Sra 59 50 1 02 Oct 2023
LLaMA: Open and Efficient Foundation Language Models Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux ... Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave Guillaume Lample ALM PILM 813 12,840 0 27 Feb 2023
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Tri Dao Daniel Y. Fu Stefano Ermon Atri Rudra Christopher Ré VLM 166 2,131 0 27 May 2022
OPT: Open Pre-trained Transformer Language Models Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen ... Daniel Simig Punit Singh Koura Anjali Sridhar Tianlu Wang Luke Zettlemoyer VLM OSLM AI4CE 282 3,583 0 02 May 2022
Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies Paul Vicol Luke Metz Jascha Narain Sohl-Dickstein 66 69 0 27 Dec 2021
Efficient Transformers: A Survey Yi Tay Mostafa Dehghani Dara Bahri Donald Metzler VLM 140 1,111 0 14 Sep 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos Katharopoulos Apoorv Vyas Nikolaos Pappas Franccois Fleuret 113 1,734 0 29 Jun 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 500 41,106 0 28 May 2020
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 453 129,831 0 12 Jun 2017
Unbiased Online Recurrent Optimization Corentin Tallec Yann Ollivier 62 97 0 16 Feb 2017
Decoupled Neural Interfaces using Synthetic Gradients Max Jaderberg Wojciech M. Czarnecki Simon Osindero Oriol Vinyals Alex Graves David Silver Koray Kavukcuoglu 65 356 0 18 Aug 2016
Neural Turing Machines Alex Graves Greg Wayne Ivo Danihelka 75 2,318 0 20 Oct 2014
Optimal rates for zero-order convex optimization: the power of two function evaluations John C. Duchi Michael I. Jordan Martin J. Wainwright Andre Wibisono 57 480 0 07 Dec 2013