Error Feedback Shines when Features are Rare

We provide the first proof that gradient descent with greedy sparsification and error feedback can obtain better communication complexity than vanilla when solving the distributed optimization problem , where = # of clients, = # of features, and are smooth nonconvex functions. Despite intensive research since 2014 when was first proposed by Seide et al., this problem remained open until now. We show that shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector such that in expectation, with the sparsifier and requires bits to be communicated by each worker to the server only, where is the smoothness constant of , is the smoothness constant of , is the maximal number of clients owning any feature (), and is the maximal number of features owned by any client (). Clearly, the communication complexity improves as decreases (i.e., as features become more rare), and can be much better than the communication complexity of in the same regime.
View on arXiv