Bandit and Delayed Feedback in Online Structured Prediction
- OffRL

Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the \textit{surrogate regret}, i.e. the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, bandit and delayed feedback. For bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of for the time horizon and the size of the output set . However, can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of , which is independent of . This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare the performance of these algorithms, as well as existing ones. Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.
View on arXiv@article{shibukawa2025_2502.18709, title={ Bandit and Delayed Feedback in Online Structured Prediction }, author={ Yuki Shibukawa and Taira Tsuchiya and Shinsaku Sakaue and Kenji Yamanishi }, journal={arXiv preprint arXiv:2502.18709}, year={ 2025 } }