208

Characterizing the implicit bias via a primal-dual analysis

Abstract

This paper shows that the implicit bias of gradient descent on linearly separable data is exactly characterized by the optimal solution of a dual optimization problem given by a smoothed margin, even for general losses. This is in contrast to prior results, which are often tailored to exponentially-tailed losses. For the exponential loss specifically, with nn training examples and tt gradient descent steps, our dual analysis further allows us to prove an O(ln(n)/ln(t))O(\ln(n)/\ln(t)) convergence rate to the 2\ell_2 maximum margin direction, when a constant step size is used. This rate is tight in both nn and tt, which has not been presented by prior work. On the other hand, with a properly chosen but aggressive step size schedule, we prove an O(1/t)O(1/t) convergence rate for 2\ell_2 margin maximization, while prior work has only proved an O~(1/t)\tilde{O}(1/\sqrt{t}) rate, or an O(1/t)O(1/t) convergence rate to a suboptimal margin. Our key observations include that gradient descent on the primal variable naturally induces a mirror descent update on the dual variable, and that the dual objective in this setting is smooth enough to give a faster rate.

View on arXiv
Comments on this paper