SAMBO-RL: Shifts-aware Model-based Offline Reinforcement Learning

23 August 2024

Abstract

Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models, eliminating the need for direct real-world environment interaction. However, this paradigm is inherently challenged by distribution shift (DS). Existing methods address this issue by leveraging off-policy mechanisms and estimating model uncertainty, but they often result in inconsistent objectives and lack a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. Our theoretical and empirical investigations reveal how these factors distort value estimation and restrict policy optimization. To tackle these challenges, we derive a novel Shifts-aware Reward (SAR) through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training. Building on this, we introduce Shifts-aware Model-based Offline Reinforcement Learning (SAMBO-RL), a practical framework that efficiently trains classifiers to approximate SAR for policy optimization. Empirical experiments show that SAR effectively mitigates DS, and SAMBO-RL achieves superior or comparable performance across various benchmarks, underscoring its effectiveness and validating our theoretical analysis.

View on arXiv

@article{luo2025_2408.12830,
  title={ SAMBO-RL: Shifts-aware Model-based Offline Reinforcement Learning },
  author={ Wang Luo and Haoran Li and Zicheng Zhang and Congying Han and Jiayu Lv and Tiande Guo },
  journal={arXiv preprint arXiv:2408.12830},
  year={ 2025 }
}

Comments on this paper