66
3

Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games

Abstract

Video games have served as useful benchmarks for the decision-making community, but going beyond Atari games towards modern games has been prohibitively expensive for the vast majority of the research community. Prior work in modern video games typically relied on game-specific integration to obtain game features and enable online training, or on existing large datasets. An alternative approach is to train agents using imitation learning to play video games purely from images. However, this setting poses a fundamental question: which visual encoders obtain representations that retain information critical for decision making? To answer this question, we conduct a systematic study of imitation learning with publicly available pre-trained visual encoders compared to the typical task-specific end-to-end training approach in Minecraft, Counter-Strike: Global Offensive, and Minecraft Dungeons. Our results show that end-to-end training can be effective with comparably low-resolution images and only minutes of demonstrations, but significant improvements can be gained by utilising pre-trained encoders such as DINOv2 depending on the game. In addition to enabling effective decision making, we show that pre-trained encoders can make decision-making research in video games more accessible by significantly reducing the cost of training.

View on arXiv
@article{schäfer2025_2312.02312,
  title={ Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games },
  author={ Lukas Schäfer and Logan Jones and Anssi Kanervisto and Yuhan Cao and Tabish Rashid and Raluca Georgescu and Dave Bignell and Siddhartha Sen and Andrea Treviño Gavito and Sam Devlin },
  journal={arXiv preprint arXiv:2312.02312},
  year={ 2025 }
}
Comments on this paper