Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment

In this paper, we study differentially private online learning problems in a stochastic environment under both bandit and full information feedback. For differentially private stochastic bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal instance-dependent regret bound, where is the finite learning horizon, denotes the suboptimality gap between the optimal arm and a suboptimal arm , and is the required privacy parameter. For the differentially private full information setting with stochastic rewards, we show an instance-dependent regret lower bound and an minimax lower bound, where is the total number of actions and denotes the minimum suboptimality gap among all the suboptimal actions. For the same differentially private full information setting, we also present an -differentially private algorithm whose instance-dependent regret and worst-case regret match our respective lower bounds up to an extra factor.
View on arXiv