Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point function values. In this paper, we investigate BCO in non-stationary environments and choose the \emph{dynamic regret} as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let be the time horizon and be the path-length of the comparator sequence that reflects the non-stationarity of environments. We propose a novel algorithm that achieves and dynamic regret respectively for the one-point and two-point feedback models. The latter result is optimal, matching the lower bound established in this paper. Notably, our algorithm is more adaptive to non-stationary environments since it does not require prior knowledge of the path-length ahead of time, which is generally unknown.
View on arXiv