ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.14342
  4. Cited By
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model
  Pre-training

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

23 May 2023
Hong Liu
Zhiyuan Li
David Leo Wright Hall
Percy Liang
Tengyu Ma
    VLM
ArXivPDFHTML

Papers citing "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training"

50 / 103 papers shown
Title
MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and
  Provable Convergence
MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence
Ionut-Vlad Modoranu
M. Safaryan
Grigory Malinovsky
Eldar Kurtic
Thomas Robert
Peter Richtárik
Dan Alistarh
MQ
32
12
0
24 May 2024
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling
Shuaipeng Li
Penghao Zhao
Hailin Zhang
Xingwu Sun
Hao Wu
...
Zheng Fang
Jinbao Xue
Yangyu Tao
Bin Cui
Di Wang
19
6
0
23 May 2024
Exact Gauss-Newton Optimization for Training Deep Neural Networks
Exact Gauss-Newton Optimization for Training Deep Neural Networks
Mikalai Korbit
Adeyemi Damilare Adeoye
Alberto Bemporad
Mario Zanon
ODL
16
0
0
23 May 2024
How to set AdamW's weight decay as you scale model and dataset size
How to set AdamW's weight decay as you scale model and dataset size
Xi Wang
Laurence Aitchison
38
9
0
22 May 2024
Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Yeqi Gao
Yuzhou Gu
Zhao-quan Song
33
0
0
09 May 2024
Towards Stability of Parameter-free Optimization
Towards Stability of Parameter-free Optimization
Yijiang Pang
Shuyang Yu
Hoang Bao
Jiayu Zhou
29
1
0
07 May 2024
Dynamic Anisotropic Smoothing for Noisy Derivative-Free Optimization
Dynamic Anisotropic Smoothing for Noisy Derivative-Free Optimization
S. Reifenstein
T. Leleu
Yoshihisa Yamamoto
35
1
0
02 May 2024
Expanding the Horizon: Enabling Hybrid Quantum Transfer Learning for
  Long-Tailed Chest X-Ray Classification
Expanding the Horizon: Enabling Hybrid Quantum Transfer Learning for Long-Tailed Chest X-Ray Classification
Skylar Chan
Pranav Kulkarni
P. Yi
Vishwa S. Parekh
26
0
0
30 Apr 2024
SOUL: Unlocking the Power of Second-Order Optimization for LLM
  Unlearning
SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning
Jinghan Jia
Yihua Zhang
Yimeng Zhang
Jiancheng Liu
Bharat Runwal
James Diffenderfer
B. Kailkhura
Sijia Liu
MU
29
32
0
28 Apr 2024
Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks
Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks
Matteo Tucat
Anirbit Mukherjee
Procheta Sen
Mingfei Sun
Omar Rivasplata
MLT
31
1
0
12 Apr 2024
Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization
Implicit Bias of AdamW: ℓ∞\ell_\inftyℓ∞​ Norm Constrained Optimization
Shuo Xie
Zhiyuan Li
OffRL
35
12
0
05 Apr 2024
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language
  Model Fine-Tuning
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
Rui Pan
Xiang Liu
Shizhe Diao
Renjie Pi
Jipeng Zhang
Chi Han
Tong Zhang
33
36
0
26 Mar 2024
Understanding Emergent Abilities of Language Models from the Loss Perspective
Understanding Emergent Abilities of Language Models from the Loss Perspective
Zhengxiao Du
Aohan Zeng
Yuxiao Dong
Jie Tang
UQCV
LRM
62
46
0
23 Mar 2024
Towards Faster Training of Diffusion Models: An Inspiration of A
  Consistency Phenomenon
Towards Faster Training of Diffusion Models: An Inspiration of A Consistency Phenomenon
Tianshuo Xu
Peng Mi
Ruilin Wang
Yingcong Chen
DiffM
27
6
0
14 Mar 2024
CardioGenAI: A Machine Learning-Based Framework for Re-Engineering Drugs
  for Reduced hERG Liability
CardioGenAI: A Machine Learning-Based Framework for Re-Engineering Drugs for Reduced hERG Liability
Gregory W. Kyro
Matthew T. Martin
Eric D. Watt
Victor S. Batista
28
2
0
12 Mar 2024
Towards Optimal Learning of Language Models
Towards Optimal Learning of Language Models
Yuxian Gu
Li Dong
Y. Hao
Qingxiu Dong
Minlie Huang
Furu Wei
36
7
0
27 Feb 2024
Variational Learning is Effective for Large Deep Networks
Variational Learning is Effective for Large Deep Networks
Yuesong Shen
Nico Daheim
Bai Cong
Peter Nickl
Gian Maria Marconi
...
Rio Yokota
Iryna Gurevych
Daniel Cremers
Mohammad Emtiyaz Khan
Thomas Möllenhoff
35
22
0
27 Feb 2024
Why Transformers Need Adam: A Hessian Perspective
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Ruoyu Sun
Zhimin Luo
32
40
0
26 Feb 2024
Helen: Optimizing CTR Prediction Models with Frequency-wise Hessian
  Eigenvalue Regularization
Helen: Optimizing CTR Prediction Models with Frequency-wise Hessian Eigenvalue Regularization
Zirui Zhu
Yong Liu
Zangwei Zheng
Huifeng Guo
Yang You
19
0
0
23 Feb 2024
Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer
Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer
Yanjun Zhao
Sizhe Dang
Haishan Ye
Guang Dai
Yi Qian
Ivor W.Tsang
66
8
0
23 Feb 2024
Stochastic Hessian Fittings with Lie Groups
Stochastic Hessian Fittings with Lie Groups
Xi-Lin Li
24
1
0
19 Feb 2024
Rethinking Machine Unlearning for Large Language Models
Rethinking Machine Unlearning for Large Language Models
Sijia Liu
Yuanshun Yao
Jinghan Jia
Stephen Casper
Nathalie Baracaldo
...
Hang Li
Kush R. Varshney
Mohit Bansal
Sanmi Koyejo
Yang Liu
AILaw
MU
65
81
0
13 Feb 2024
Should I try multiple optimizers when fine-tuning pre-trained
  Transformers for NLP tasks? Should I tune their hyperparameters?
Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters?
Nefeli Gkouti
Prodromos Malakasiotis
Stavros Toumpis
Ion Androutsopoulos
14
5
0
10 Feb 2024
Efficient Stagewise Pretraining via Progressive Subnetworks
Efficient Stagewise Pretraining via Progressive Subnetworks
Abhishek Panigrahi
Nikunj Saunshi
Kaifeng Lyu
Sobhan Miryoosefi
Sashank J. Reddi
Satyen Kale
Sanjiv Kumar
30
5
0
08 Feb 2024
Can We Remove the Square-Root in Adaptive Gradient Methods? A
  Second-Order Perspective
Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective
Wu Lin
Felix Dangel
Runa Eschenhagen
Juhan Bae
Richard E. Turner
Alireza Makhzani
ODL
51
12
0
05 Feb 2024
Ginger: An Efficient Curvature Approximation with Linear Complexity for
  General Neural Networks
Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks
Yongchang Hao
Yanshuai Cao
Lili Mou
ODL
14
1
0
05 Feb 2024
Neglected Hessian component explains mysteries in Sharpness
  regularization
Neglected Hessian component explains mysteries in Sharpness regularization
Yann N. Dauphin
Atish Agarwala
Hossein Mobahi
FAtt
32
7
0
19 Jan 2024
Context-PEFT: Efficient Multi-Modal, Multi-Task Fine-Tuning
Context-PEFT: Efficient Multi-Modal, Multi-Task Fine-Tuning
Avelina Asada Hadji-Kyriacou
Ognjen Arandjelović
11
0
0
14 Dec 2023
A Test-Time Learning Approach to Reparameterize the Geophysical Inverse
  Problem with a Convolutional Neural Network
A Test-Time Learning Approach to Reparameterize the Geophysical Inverse Problem with a Convolutional Neural Network
Anran Xu
L. Heagy
32
4
0
07 Dec 2023
CoLLiE: Collaborative Training of Large Language Models in an Efficient
  Way
CoLLiE: Collaborative Training of Large Language Models in an Efficient Way
Kai Lv
Shuo Zhang
Tianle Gu
Shuhao Xing
Jiawei Hong
...
Tengxiao Liu
Yu Sun
Penousal Machado
Hang Yan
Xipeng Qiu
35
7
0
01 Dec 2023
Locally Optimal Descent for Dynamic Stepsize Scheduling
Locally Optimal Descent for Dynamic Stepsize Scheduling
Gilad Yehudai
Alon Cohen
Amit Daniely
Yoel Drori
Tomer Koren
Mariano Schain
24
0
0
23 Nov 2023
Autoregressive Language Models For Estimating the Entropy of Epic EHR
  Audit Logs
Autoregressive Language Models For Estimating the Entropy of Epic EHR Audit Logs
Benjamin C. Warner
Thomas Kannampallil
Seunghwan Kim
29
0
0
10 Nov 2023
A Coefficient Makes SVRG Effective
A Coefficient Makes SVRG Effective
Yida Yin
Zhiqiu Xu
Zhiyuan Li
Trevor Darrell
Zhuang Liu
25
1
0
09 Nov 2023
Signal Processing Meets SGD: From Momentum to Filter
Signal Processing Meets SGD: From Momentum to Filter
Zhipeng Yao
Guisong Chang
Jiaqi Zhang
Qi Zhang
Dazhou Li
Yu Zhang
ODL
24
0
0
06 Nov 2023
Application of deep and reinforcement learning to boundary control
  problems
Application of deep and reinforcement learning to boundary control problems
Zenin Easa Panthakkalakath
J. Kardoš
Olaf Schenk
AI4CE
14
0
0
21 Oct 2023
DPZero: Private Fine-Tuning of Language Models without Backpropagation
DPZero: Private Fine-Tuning of Language Models without Backpropagation
Liang Zhang
Bingcong Li
K. K. Thekumparampil
Sewoong Oh
Niao He
28
11
0
14 Oct 2023
Sheared LLaMA: Accelerating Language Model Pre-training via Structured
  Pruning
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Mengzhou Xia
Tianyu Gao
Zhiyuan Zeng
Danqi Chen
24
262
0
10 Oct 2023
Accelerated Neural Network Training with Rooted Logistic Objectives
Accelerated Neural Network Training with Rooted Logistic Objectives
Zhu Wang
Praveen Raj Veluswami
Harshit Mishra
Sathya Ravi
22
0
0
05 Oct 2023
A Fast Optimization View: Reformulating Single Layer Attention in LLM
  Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
Yeqi Gao
Zhao-quan Song
Weixin Wang
Junze Yin
18
25
0
14 Sep 2023
ChemSpaceAL: An Efficient Active Learning Methodology Applied to
  Protein-Specific Molecular Generation
ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation
Gregory W. Kyro
Anton Morgunov
Rafael I. Brent
Victor S. Batista
25
12
0
11 Sep 2023
nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style
  Models with Limited Resources
nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style Models with Limited Resources
Piotr Nawrot
AI4CE
17
5
0
05 Sep 2023
How to Protect Copyright Data in Optimization of Large Language Models?
How to Protect Copyright Data in Optimization of Large Language Models?
T. Chu
Zhao-quan Song
Chiwun Yang
32
29
0
23 Aug 2023
Convergence of Two-Layer Regression with Nonlinear Units
Convergence of Two-Layer Regression with Nonlinear Units
Yichuan Deng
Zhao-quan Song
Shenghao Xie
15
7
0
16 Aug 2023
Eva: A General Vectorized Approximation Framework for Second-order
  Optimization
Eva: A General Vectorized Approximation Framework for Second-order Optimization
Lin Zhang
S. Shi
Bo-wen Li
13
1
0
04 Aug 2023
Mini-Giants: "Small" Language Models and Open Source Win-Win
Mini-Giants: "Small" Language Models and Open Source Win-Win
Zhengping Zhou
Lezhi Li
Xinxi Chen
Andy Li
SyDa
ALM
MoE
24
6
0
17 Jul 2023
No Train No Gain: Revisiting Efficient Training Algorithms For
  Transformer-based Language Models
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
Jean Kaddour
Oscar Key
Piotr Nawrot
Pasquale Minervini
Matt J. Kusner
15
41
0
12 Jul 2023
Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized
  Language Model Finetuning Using Shared Randomness
Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness
E. Zelikman
Qian Huang
Percy Liang
Nick Haber
Noah D. Goodman
62
14
0
16 Jun 2023
Early Weight Averaging meets High Learning Rates for LLM Pre-training
Early Weight Averaging meets High Learning Rates for LLM Pre-training
Sunny Sanyal
A. Neerkaje
Jean Kaddour
Abhishek Kumar
Sujay Sanghavi
MoMe
19
17
0
05 Jun 2023
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on
  Transformers, but Sign Descent Might Be
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark W. Schmidt
40
67
0
27 Apr 2023
FOSI: Hybrid First and Second Order Optimization
FOSI: Hybrid First and Second Order Optimization
Hadar Sivan
Moshe Gabel
Assaf Schuster
ODL
13
2
0
16 Feb 2023
Previous
123
Next