630

Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

Neural Information Processing Systems (NeurIPS), 2024
Main:10 Pages
Bibliography:3 Pages
2 Tables
Appendix:16 Pages
Abstract

We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its benefits, introducing non-linear function approximation raises significant challenges in both computational and statistical efficiency. The best-known method of Hwang and Oh [2023] has achieved an O~(κ1dH2K)\widetilde{\mathcal{O}}(\kappa^{-1}dH^2\sqrt{K}) regret, where κ\kappa is a problem-dependent quantity, dd is the feature space dimension, HH is the episode length, and KK is the number of episodes. While this result attains the same rate in KK as the linear cases, the method requires storing all historical data and suffers from an O(K)\mathcal{O}(K) computation cost per episode. Moreover, the quantity κ\kappa can be exponentially small, leading to a significant gap for the regret compared to the linear cases. In this work, we first address the computational concerns by proposing an online algorithm that achieves the same regret with only O(1)\mathcal{O}(1) computation cost. Then, we design two algorithms that leverage local information to enhance statistical efficiency. They not only maintain an O(1)\mathcal{O}(1) computation cost per episode but achieve improved regrets of O~(κ1/2dH2K)\widetilde{\mathcal{O}}(\kappa^{-1/2}dH^2\sqrt{K}) and O~(dH2K+κ1d2H2)\widetilde{\mathcal{O}}(dH^2\sqrt{K} + \kappa^{-1}d^2H^2) respectively. Finally, we establish a lower bound, justifying the optimality of our results in dd and KK. To the best of our knowledge, this is the first work that achieves almost the same computational and statistical efficiency as linear function approximation while employing non-linear function approximation for reinforcement learning.

View on arXiv
Comments on this paper