126

Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs

Main:23 Pages
Bibliography:1 Pages
Appendix:32 Pages
Abstract

We study meta-reinforcement learning in finite-horizon MDPs where related tasks share similar structures in their optimal action-value functions. Specifically, we posit a linear representation Qh(s,a)=Φh(s,a)θh(k)Q^*_h(s,a)=\Phi_h(s,a)\,\theta^{(k)}_h and place a Gaussian meta-prior $ \mathcal{N}(\theta^*_h,\Sigma^*_h)$ over the task-specific parameters θh(k)\theta^{(k)}_h. Building on randomized value functions, we propose two Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and performs posterior sampling with the learned mean and known covariance; and (ii) MTSRL+\text{MTSRL}^{+}, which additionally estimates the covariance and employs prior widening to control finite-sample estimation error. Further, we develop a prior-alignment technique that couples the posterior under the learned prior with a meta-oracle that knows the true prior, yielding meta-regret guarantees: we match prior-independent Thompson sampling in the small-task regime and strictly improve with more tasks once the prior is learned. Concretely, for known covariance we obtain O~(H4S3/2ANK)\tilde{O}(H^{4}S^{3/2}\sqrt{ANK}) meta-regret, and with learned covariance O~(H4S3/2AN3K)\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K}); both recover a better behavior than prior-independent after KO~(H2)K \gtrsim \tilde{O}(H^2) and KO~(N2H2)K \gtrsim \tilde{O}(N^2H^2), respectively. Simulations on a stateful recommendation environment (with feature and prior misspecification) show that after brief exploration, MTSRL/MTSRL\(^+\) track the meta-oracle and substantially outperform prior-independent RL and bandit-only meta-baselines. Our results give the first meta-regret guarantees for Thompson-style RL with learned Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation, covariance widening) for experiment-rich settings.

View on arXiv
Comments on this paper