Low-rank Matrix Bandits with Heavy-tailed Rewards

In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown by low-rank parameter matrix with rank . While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of \underline{low}-rank matrix bandit with \underline{h}eavy-\underline{t}ailed \underline{r}ewards (LowHTR), where the rewards only have finite moment for some . By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS attaining the regret bound of order without knowing , which matches the state-of-the-art regret bound under sub-Gaussian noises~\citep{lu2021low,kang2022efficient} with . Moreover, we establish a lower bound of the order for LowHTR, which indicates our LOTUS is nearly optimal in the order of . In addition, we improve LOTUS so that it does not require knowledge of the rank with regret bound, and it is efficient under the high-dimensional scenario. We also conduct simulations to demonstrate the practical superiority of our algorithm.
View on arXiv