15
16

Adversarial Combinatorial Bandits with General Non-linear Reward Functions

Abstract

In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. {The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback.} We show that, with NN arms and subsets of KK arms being chosen at each of TT time periods, the minimax optimal regret is Θ~d(NdT)\widetilde\Theta_{d}(\sqrt{N^d T}) if the reward function is a dd-degree polynomial with d<Kd< K, and ΘK(NKT)\Theta_K(\sqrt{N^K T}) if the reward function is not a low-degree polynomial. {Both bounds are significantly different from the bound O(poly(N,K)T)O(\sqrt{\mathrm{poly}(N,K)T}) for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures.} Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual (NK)\binom{N}{K} assortment as independent.

View on arXiv
Comments on this paper