ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.00937
11
66

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

1 June 2023
Shalev Lifshitz
Keiran Paster
Harris Chan
Jimmy Ba
Sheila A. McIlraith
    LM&Ro
ArXivPDFHTML
Abstract

Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces a methodology, inspired by unCLIP, for instruction-tuning generative models of behavior without relying on a large dataset of instruction-labeled trajectories. Using this methodology, we create an instruction-tuned Video Pretraining (VPT) model called STEVE-1, which can follow short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, reducing the need for costly human text annotations, and all for only 60ofcompute.ByleveragingpretrainedmodelslikeVPTandMineCLIPandemployingbestpracticesfromtext−conditionedimagegeneration,STEVE−1setsanewbarforopen−endedinstruction−followinginMinecraftwithlow−levelcontrols(mouseandkeyboard)andrawpixelinputs,faroutperformingpreviousbaselinesandrobustlycompleting12of13tasksinourearly−gameevaluationsuite.Weprovideexperimentalevidencehighlightingkeyfactorsfordownstreamperformance,includingpretraining,classifier−freeguidance,anddatascaling.Allresources,includingourmodelweights,trainingscripts,andevaluationtoolsaremadeavailableforfurtherresearch.60 of compute. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 sets a new bar for open-ended instruction-following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines and robustly completing 12 of 13 tasks in our early-game evaluation suite. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.60ofcompute.ByleveragingpretrainedmodelslikeVPTandMineCLIPandemployingbestpracticesfromtext−conditionedimagegeneration,STEVE−1setsanewbarforopen−endedinstruction−followinginMinecraftwithlow−levelcontrols(mouseandkeyboard)andrawpixelinputs,faroutperformingpreviousbaselinesandrobustlycompleting12of13tasksinourearly−gameevaluationsuite.Weprovideexperimentalevidencehighlightingkeyfactorsfordownstreamperformance,includingpretraining,classifier−freeguidance,anddatascaling.Allresources,includingourmodelweights,trainingscripts,andevaluationtoolsaremadeavailableforfurtherresearch.

View on arXiv
Comments on this paper