252
v1v2 (latest)

Combinatorial Semi-Bandit in the Non-Stationary Environment

Conference on Uncertainty in Artificial Intelligence (UAI), 2020
Abstract

In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists \cite{wang2017improving}, our algorithm achieves O~(ST)\tilde{\mathcal{O}}(\sqrt{\mathcal{S} T}) distribution-dependent regret in the switching case, and O~(V1/3T2/3)\tilde{\mathcal{O}}(\mathcal{V}^{1/3}T^{2/3}) in the dynamic case, where S\mathcal S is the number of switchings and V\mathcal V is the sum of the total ``distribution changes''. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter S\mathcal S or V\mathcal V in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters S\mathcal S or V\mathcal V but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.

View on arXiv
Comments on this paper