Towards stability and optimality in stochastic gradient descent
Iterative procedures for parameter estimation based on stochastic gradient descent allow the estimation to scale to massive data sets. However, in both theory and practice, they suffer from numerical instability. Moreover, they are statistically inefficient as estimators of the true parameter value. To address these two issues, we propose a new iterative procedure termed averaged implicit stochastic gradient descent (AI-SGD). For statistical efficiency, AISGD employs averaging of the iterates, which achieves the optimal Cram\'{e}r-Rao bound under strong convexity, i.e., it is an optimal unbiased estimator of the true parameter value. For numerical stability, AISGD employs an implicit update at each iteration, which is related to proximal operators in optimization. In practice, AISGD achieves competitive performance with state-of-the-art procedures. Furthermore, it is more stable than averaging procedures that do not employ proximal operators, and is simpler to implement than procedures that do employ proximal operators but require careful tuning of several hyperparameters.
View on arXiv