From Theory to Practice with RAVEN-UCB: Addressing Non-Stationarity in Multi-Armed Bandits through Variance Adaptation
The Multi-Armed Bandit (MAB) problem is challenging in non-stationary environments where reward distributions evolve dynamically. We introduce RAVEN-UCB, a novel algorithm that combines theoretical rigor with practical efficiency via variance-aware adaptation. It achieves tighter regret bounds than UCB1 and UCB-V, with gap-dependent regret of order and gap-independent regret of order . RAVEN-UCB incorporates three innovations: (1) variance-driven exploration using in confidence bounds, (2) adaptive control via , and (3) constant-time recursive updates for efficiency. Experiments across non-stationary patterns - distributional changes, periodic shifts, and temporary fluctuations - in synthetic and logistics scenarios demonstrate its superiority over state-of-the-art baselines, confirming theoretical and practical robustness.
View on arXiv