A $60 Minecraft Instructable Generative AI Model That Complies With Both Text and Visual Instructions

Powerful AI models may now be operated and interacted with via language commands, making them widely available and adaptable. Stable Diffusion, which transforms natural language into a picture, and ChatGPT, which can reply to messages written in natural language and carry out various tasks, are examples of such models. While the cost of training those models can range from tens of thousands to millions of dollars, there has been a similarly exciting development in which strong open-source foundation models, such as LLaMA, can be improved with surprisingly little computation and data to become instruction-following. 

Researchers from the University of Toronto and the Vector Institute for Artificial Intelligence investigate the viability of such a strategy in sequential decision-making domains in this research. Diverse data for sequential decision-making is highly costly and frequently does not have an easy-to-use “instruction” label like captions for pictures, unlike in the text and image domains. They suggest modifying pretrained generative behavior models using instruction data, building on previous developments in instruction-tuned LLMs like Alpaca. Two foundation models for the well-known open-ended video game Minecraft have been made available in the last year: MineCLIP, a model for aligning text and video clips, and VPT, a model for behavior. 

This has created a fascinating opportunity to investigate instruction-following optimization in Minecraft’s sequential decision-making domain. The agent has an extensive understanding of the Minecraft world because VPT was trained on 70k hours of Minecraft playtime. The VPT model may, however, have the potential for broad, controlled behavior if it is fine-tuned to follow directions, much as the enormous potential of LLMs was unlocked by aligning them to obey instructions. They specifically show in their research how to fine-tune VPT to obey short-horizon text instructions using just $60 of computing and around 2,000 instruction-labeled trajectory segments. 

Their methodology is influenced by unCLIP, which was used to develop the well-known text-to-image model DALLe 2. They break down the challenge of designing a Minecraft agent that follows instructions into a VPT model adjusted to accomplish visual objectives stored in the MineCLIP latent space and a previous model that converts text instructions into MineCLIP visual embeddings. They employ visual MineCLIP embeddings rather than pricey text-instruction labels to fine-tune VPT via behavioral cloning with self-supervised data produced by hindsight relabeling. 

They combine unCLIP with classifier-free guiding to develop their agent, dubbed STEVE-1, which considerably exceeds the benchmark set by Baker et al. for open-ended command following in Minecraft using low-level controllers (mouse and keyboard) and raw pixel inputs. 

The following are their primary contributions: 

• They develop STEVE-1, a Minecraft agent with high accuracy while executing open-ended text and visual commands. They conduct in-depth analyses of their agent, demonstrating that it can carry out various short-horizon tasks1 in Minecraft. They demonstrate that straightforward prompt chaining may significantly boost performance for longer-horizon operations like construction and crafts. 

• They explain how to build STEVE-1 with just $60 of computing, demonstrating that unCLIP and classifier-free guiding are crucial for effective performance in sequential decision-making. 

• They make the STEVE-1 model weights, assessment scripts, and training scripts available to encourage future study on teachable, open-ended sequential decision-making agents.

The website has video demos of the agent in the game.