An Innovative Prompting Framework for LLMs Called SPRING Is Designed to Facilitate Chain-of-Thought Planning and Reasoning in Context.

SPRING is an LLM-based policy that outperforms Reinforcement Learning algorithms in an interactive environment requiring multi-task planning and reasoning. 

A group of researchers from Carnegie Mellon University, NVIDIA, Ariel University, and Microsoft have investigated the use of Large Language Models (LLMs) for understanding and reasoning with human knowledge in the context of games. They propose a two-stage approach called SPRING, which involves studying an academic paper and then using a Question-Answer (QA) framework to justify the knowledge obtained.

More details about SPRING

To gather background information, the authors’ initial step was to examine the original paper’s LaTeX source code by Hafner (2021). To gather pertinent data, including game mechanics and desirable behaviors described in the research, they used an LLM. The second phase concentrated on solving difficult games utilizing in-context chain-of-thought reasoning and LLMs. As a reasoning module, they created a directed acyclic graph (DAG), in which the questions are the nodes and the relationships between the questions are the edges. For instance, within the DAG, the question “What are the top 5 actions?” is linked to the question “For each action, are the requirements met?” creating a dependence from the latter question to the former.

LLM answers are computed for each node/question by traversing the DAG in topological order. The final node in the DAG represents the question about the best action to take, and the LLM’s answer is directly translated into an environmental action.

Experiments and Results

The Crafter Environment, introduced by Hafner (2021), is an open-world survival game with 22 achievements organized in a tech tree of depth 7. The game is represented as a grid world with top-down observations and a discrete action space consisting of 17 options. The observations also provide information about the player’s current inventory state, including health points, food, water, rest levels, and inventory items.

The authors compared SPRING and popular RL methods on the Crafter benchmark. Subsequently, experiments and analysis were carried out on different components of their architecture to examine the impact of each part on the in-context “reasoning” abilities of the LLM.


The authors compared the performance of various RL baselines to SPRING with GPT-4, conditioned on the environment paper by Hafner (2021). SPRING surpasses previous state-of-the-art (SOTA) methods by a significant margin, achieving an 88% relative improvement in-game score and a 5% improvement in reward compared to the best-performing RL method by Hafner et al. (2023).

Notably, SPRING leverages prior knowledge from reading the paper and requires zero training steps, while RL methods typically necessitate millions of training steps.


The above figure represents a plot of unlock rates for different tasks, comparing SPRING to popular RL baselines. SPRING, empowered by prior knowledge, outperforms RL methods by more than ten times on achievements such as “Make Stone Pickaxe,” “Make Stone Sword,” and “Collect Iron,” which are deeper in the tech tree (up to depth 5) and challenging to reach through random exploration. 

Moreover, SPRING performs perfectly on achievements like “Eat Cow” and “Collect Drink.” At the same time, model-based RL frameworks like Dreamer-V3 have significantly lower unlock rates (over five times lower) for “Eat Cow” due to the challenge of reaching moving cows through random exploration. Importantly, SPRING does not take action “Place Stone” since it was not discussed as beneficial for the agent in the paper by Hafner (2021), even though it could be easily achieved through random exploration.


One limitation of using an LLM for interacting with the environment is the need for object recognition and grounding. However, this limitation doesn’t exist in environments that provide accurate object information, such as contemporary games and virtual reality worlds. While pre-trained visual backbones struggle with games, they perform reasonably well in real-world-like environments. Recent advancements in visual-language models indicate potential for reliable solutions in visual-language understanding in the future.


In summary, the SPRING framework showcases the potential of Language Models (LLMs) for game understanding and reasoning. By leveraging prior knowledge from academic papers and employing in-context chain-of-thought reasoning, SPRING outperforms previous state-of-the-art methods on the Crafter benchmark, achieving substantial improvements in-game score and reward. The results highlight the power of LLMs in complex game tasks and suggest future advancements in visual-language models could address existing limitations, paving the way for reliable and generalizable solutions.