Constructing AI models that respond to text instructions is challenging,
especially for sequential decision-making tasks. This work introduces a
methodology, inspired by unCLIP, for instruction-tuning generative models of
behavior without relying on a large dataset of instruction-labeled
trajectories. Using this methodology, we create an instruction-tuned Video
Pretraining (VPT) model called STEVE-1, which can follow short-horizon
open-ended text and visual instructions in Minecraft. STEVE-1 is trained in two
steps: adapting the pretrained VPT model to follow commands in MineCLIP's
latent space, then training a prior to predict latent codes from text. This
allows us to finetune VPT through self-supervised behavioral cloning and
hindsight relabeling, reducing the need for costly human text annotations, and
all for only 60ofcompute.ByleveragingpretrainedmodelslikeVPTandMineCLIPandemployingbestpracticesfromtext−conditionedimagegeneration,STEVE−1setsanewbarforopen−endedinstruction−followinginMinecraftwithlow−levelcontrols(mouseandkeyboard)andrawpixelinputs,faroutperformingpreviousbaselinesandrobustlycompleting12of13tasksinourearly−gameevaluationsuite.Weprovideexperimentalevidencehighlightingkeyfactorsfordownstreamperformance,includingpretraining,classifier−freeguidance,anddatascaling.Allresources,includingourmodelweights,trainingscripts,andevaluationtoolsaremadeavailableforfurtherresearch.