Reward-Mixing MDPs with a Few Latent Contexts are Learnable

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among candidates and an agent interacts with the MDP throughout the episode for time steps. Our goal is to learn a near-optimal policy that nearly maximizes the time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for . In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary , we provide a sample-efficient algorithm----that outputs an -optimal policy using episodes, where are the number of states and actions respectively, is the time-horizon, is the support size of reward distributions and . Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the \algname algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of for a general instance of RMMDP, supporting that super-polynomial sample complexity in is necessary.
View on arXiv