49
0

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Abstract

Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks. Code:this https URL.

View on arXiv
@article{wilcoxson2025_2410.18076,
  title={ Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration },
  author={ Max Wilcoxson and Qiyang Li and Kevin Frans and Sergey Levine },
  journal={arXiv preprint arXiv:2410.18076},
  year={ 2025 }
}
Comments on this paper