12
1

Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment

Abstract

In this paper, we study differentially private online learning problems in a stochastic environment under both bandit and full information feedback. For differentially private stochastic bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal O(j:Δj>0ln(T)min{Δj,ϵ})O \left(\sum_{j: \Delta_j>0} \frac{\ln(T)}{\min \left\{\Delta_j, \epsilon \right\}} \right) instance-dependent regret bound, where TT is the finite learning horizon, Δj\Delta_j denotes the suboptimality gap between the optimal arm and a suboptimal arm jj, and ϵ\epsilon is the required privacy parameter. For the differentially private full information setting with stochastic rewards, we show an Ω(ln(K)min{Δmin,ϵ})\Omega \left(\frac{\ln(K)}{\min \left\{\Delta_{\min}, \epsilon \right\}} \right) instance-dependent regret lower bound and an Ω(Tln(K)+ln(K)ϵ)\Omega\left(\sqrt{T\ln(K)} + \frac{\ln(K)}{\epsilon}\right) minimax lower bound, where KK is the total number of actions and Δmin\Delta_{\min} denotes the minimum suboptimality gap among all the suboptimal actions. For the same differentially private full information setting, we also present an ϵ\epsilon-differentially private algorithm whose instance-dependent regret and worst-case regret match our respective lower bounds up to an extra log(T)\log(T) factor.

View on arXiv
Comments on this paper