v1v2 (latest)

Stochastic Matching Bandits with Rare Optimization Updates

4 September 2025

Jung-hun Kim

Min-hwan Oh

ArXiv (abs)PDF HTML Github

Main:2 Pages

6 Figures

Bibliography:2 Pages

Appendix:41 Pages

Abstract

We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon $T$ . The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to $\Theta(\log\log T)$ over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$ .

View on arXiv

Comments on this paper