8

Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs

International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
Jing Wang
Peng Zhao
Zhi-Hua Zhou
Main:38 Pages
2 Figures
Bibliography:2 Pages
1 Tables
Abstract

Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a \emph{refined analysis framework}, which simplifies the derivation and, importantly, produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an O~(kμ5/4cμ3/4d3/4PT1/4T3/4)\tilde{O}(k_\mu^{5/4} c_\mu^{-3/4} d^{3/4} P_T^{1/4}T^{3/4}) regret, improving the O~(kμ2cμ1d9/10PT1/5T4/5)\tilde{O}(k_\mu^{2} c_\mu^{-1}d^{9/10} P_T^{1/5}T^{4/5}) bound in prior work, where kμk_\mu and cμc_\mu characterize the reward model's nonlinearity, PTP_T measures the non-stationarity, dd and TT denote the dimension and time horizon. Moreover, we extend our framework to non-stationary Markov Decision Processes (MDPs) with function approximation, focusing on Linear Mixture MDP and Multinomial Logit (MNL) Mixture MDP. For both classes, we propose algorithms based on the weighted strategy and establish dynamic regret guarantees using our analysis framework.

View on arXiv
Comments on this paper