13
2

Error Feedback Shines when Features are Rare

Abstract

We provide the first proof that gradient descent (GD)\left({\color{green}\sf GD}\right) with greedy sparsification (TopK)\left({\color{green}\sf TopK}\right) and error feedback (EF)\left({\color{green}\sf EF}\right) can obtain better communication complexity than vanilla GD{\color{green}\sf GD} when solving the distributed optimization problem minxRdf(x)=1ni=1nfi(x)\min_{x\in \mathbb{R}^d} {f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)}, where nn = # of clients, dd = # of features, and f1,,fnf_1,\dots,f_n are smooth nonconvex functions. Despite intensive research since 2014 when EF{\color{green}\sf EF} was first proposed by Seide et al., this problem remained open until now. We show that EF{\color{green}\sf EF} shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector x^\hat{x} such that f(x^)2ε\lVert {\nabla f(\hat{x})} \rVert^2 \leq \varepsilon in expectation, GD{\color{green}\sf GD} with the Top1{\color{green}\sf Top1} sparsifier and EF{\color{green}\sf EF} requires O((L+rcnmin(cnmaxiLi2,1ni=1nLi2))1ε){\cal O} \left(\left( L+{\color{blue}r} \sqrt{ \frac{{\color{red}c}}{n} \min \left( \frac{{\color{red}c}}{n} \max_i L_i^2, \frac{1}{n}\sum_{i=1}^n L_i^2 \right) }\right) \frac{1}{\varepsilon} \right) bits to be communicated by each worker to the server only, where LL is the smoothness constant of ff, LiL_i is the smoothness constant of fif_i, c{\color{red}c} is the maximal number of clients owning any feature (1cn1\leq {\color{red}c} \leq n), and r{\color{blue}r} is the maximal number of features owned by any client (1rd1\leq {\color{blue}r} \leq d). Clearly, the communication complexity improves as c{\color{red}c} decreases (i.e., as features become more rare), and can be much better than the O(rL1ε){\cal O}({\color{blue}r} L \frac{1}{\varepsilon}) communication complexity of GD{\color{green}\sf GD} in the same regime.

View on arXiv
Comments on this paper