219
v1v2 (latest)

Multinomial Logit Bandit with Linear Utility Functions

Abstract

Multinomial logit bandit is a sequential subset selection problem which arises in many applications. In each round, the player selects a KK-cardinality subset from NN candidate items, and receives a reward which is governed by a {\it multinomial logit} (MNL) choice model considering both item utility and substitution property among items. The player's objective is to dynamically learn the parameters of MNL model and maximize cumulative reward over a finite horizon TT. This problem faces the exploration-exploitation dilemma, and the involved combinatorial nature makes it non-trivial. In recent years, there have developed some algorithms by exploiting specific characteristics of the MNL model, but all of them estimate the parameters of MNL model separately and incur a regret no better than O~(NT)\tilde{O}\big(\sqrt{NT}\big) which is not preferred for large candidate set size NN. In this paper, we consider the {\it linear utility} MNL choice model whose item utilities are represented as linear functions of dd-dimension item features, and propose an algorithm, titled {\bf LUMB}, to exploit the underlying structure. It is proven that the proposed algorithm achieves O~(dKT)\tilde{O}\big(dK\sqrt{T}\big) regret which is free of candidate set size. Experiments show the superiority of the proposed algorithm.

View on arXiv
Comments on this paper