v1v2 (latest)

Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis

13 October 2024

Yunwei Ren

Jason D. Lee

ArXiv (abs)PDF HTML

Main:10 Pages

1 Figures

Bibliography:4 Pages

Appendix:39 Pages

Abstract

The information exponent ([BAGJ21]) and its extensions -- which are equivalent to the lowestdegree in the Hermite expansion of the link function (after a potential label transform) for Gaussian single-indexmodels -- have played an important role in predicting the sample complexity of online stochastic gradient descent(SGD) in various learning tasks. In this work, we demonstrate that, for multi-index models, focusing solely on thelowest degree can miss key structural details of the model and result in suboptimal rates.Specifically, we consider the task of learning target functions of form $f_*(\mathbf{x}) = \sum_{k=1}^{P} \phi(\mathbf{v}_k^* \cdot \mathbf{x})$ ,where $P \ll d$ , the ground-truth directions $\{ \mathbf{v}_k^* \}_{k=1}^P$ are orthonormal, and the information exponent of $\phi$ is $L$ . Based on the theory of information exponent, when $L = 2$ , only the relevant subspace (not the exactdirections) can be recovered due to the rotational invariance of the second-order terms, and when $L > 2$ ,recovering the directions using online SGD require $\tilde{O}(P d^{L-1})$ samples. In this work, we show that byconsidering both second- and higher-order terms, we can first learn the relevant space using the second-orderterms, and then the exact directions using the higher-order terms, and the overall sample and complexity of onlineSGD is $\tilde{O}( d P^{L-1} )$ .

View on arXiv

Comments on this paper