42
4

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

Abstract

Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce F\textbf{F}orward G\textbf{G}radient U\textbf{U}nrolling with F\textbf{F}orward F\textbf{F}radient, abbreviated as (FG)2U(\textbf{FG})^2\textbf{U}, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. (FG)2U(\text{FG})^2\text{U} circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, (FG)2U(\text{FG})^2\text{U} is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, (FG)2U(\text{FG})^2\text{U} and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, (FG)2U(\text{FG})^2\text{U} is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for (FG)2U(\text{FG})^2\text{U}, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks.

View on arXiv
Comments on this paper