19
5

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

Abstract

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among MM candidates and an agent interacts with the MDP throughout the episode for HH time steps. Our goal is to learn a near-optimal policy that nearly maximizes the HH time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for M=2M=2. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary M2M\ge2, we provide a sample-efficient algorithm--EM2\texttt{EM}^2--that outputs an ϵ\epsilon-optimal policy using O~(ϵ2SdAdpoly(H,Z)d)\tilde{O} \left(\epsilon^{-2} \cdot S^d A^d \cdot \texttt{poly}(H, Z)^d \right) episodes, where S,AS, A are the number of states and actions respectively, HH is the time-horizon, ZZ is the support size of reward distributions and d=min(2M1,H)d=\min(2M-1,H). Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the \algname algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of (SA)Ω(M)/ϵ2(SA)^{\Omega(\sqrt{M})} / \epsilon^{2} for a general instance of RMMDP, supporting that super-polynomial sample complexity in MM is necessary.

View on arXiv
Comments on this paper