Generalizable Imitation Learning Through Pre-Trained Representations

In this paper, we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce DVK, an imitation learning algorithm that leverages rich pre-trained Visual Transformer patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into groups associated with semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We demonstrate how this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. To facilitate further study of generalization in Imitation Learning, all of our code for the method and evaluation, as well as the dataset, is made available.
View on arXiv@article{chang2025_2311.09350, title={ Generalizable Imitation Learning Through Pre-Trained Representations }, author={ Wei-Di Chang and Francois Hogan and Scott Fujimoto and David Meger and Gregory Dudek }, journal={arXiv preprint arXiv:2311.09350}, year={ 2025 } }