Artificial Intelligence is rapidly popularizing and for all good reasons. With the introduction of Large Language Models like GPT, BERT, and LLaMA, almost every industry, including healthcare, finance, E-commerce, and media, is making use of these models for tasks like Natural Language Understanding (NLU), Natural Language Generation (NLG), question answering, programming, information retrieval and so on. The very famous ChatGPT, which has been in the headlines ever since its release, has been built with the GPT 3.5 and GPT 4’s transformer technology.
TThese artificial intelligence (AI) systems that mimic human behavior mainly rely on the creation of agents with human-like problem-solving abilities. The three main methods for creating agents that can handle complex interactive reasoning tasks are: Deep Reinforcement Learning (RL), which involves training agents through a process of trial and error; Behavior Cloning (BC) through Sequence-to-Sequence (seq2seq) Learning; and Prompting LLMs, which uses generative agents based on prompting LLMs to produce reasonable plans and actions for complex problems.
The task decomposition, inability to store long-term memory, generalization to new tasks, and exception handling are some of the drawbacks of RL-based and seq2seq-based BC techniques. Each time step’s LLM inference was repeated, thus the previous approaches are also computationally expensive.
Recently, a framework called SWIFTSAGE has been proposed to address these challenges and enable agents to imitate how humans solve complex, open-world tasks. SWIFTSAGE aims to integrate the strengths of behavior cloning and prompt LLMs to enhance task completion performance in complex interactive tasks. The framework draws inspiration from the dual process theory, which suggests that human cognition involves two distinct systems: System 1 and System 2. System 1 involves rapid, intuitive, and automatic thinking, while System 2 entails methodical, analytical, and deliberate thought processes.
The SWIFTSAGE framework consists of two modules – the SWIFT module and the SAGE module. Similar to System 1, the SWIFT module represents quick and intuitive thinking. It is implemented as a compact encoder-decoder language model that has been fine-tuned on the action trajectories of an oracle agent. The SWIFT module encodes short-term memory components like previous actions, observations, visited locations, and the current environment state, followed by decoding the next individual action, thus aiming to simulate the rapid and instinctive decision-making process shown by humans.
The SAGE module, on the other hand, imitates thought processes similar to System 2 and utilizes LLMs such as GPT-4 for subgoal planning and grounding. In the planning stage, LLMs are prompted to locate necessary items, plan, track subgoals, and detect and rectify potential mistakes, while in the grounding stage, LLMs are employed to transform the output subgoals derived from the planning stage into a sequence of executable actions.
The SWIFT and SAGE modules have been integrated through a heuristic algorithm that determines when to activate or deactivate the SAGE module and how to combine the outputs of both modules using an action buffer mechanism. Unlike previous methods that generate only the immediate next action, SWIFTSAGE engages in longer-term action planning.
For evaluating the performance of SWIFTSAGE, experiments have been conducted on 30 tasks from the ScienceWorld benchmark. The results have shown that SWIFTSAGE significantly outperforms other existing methods, such as SayCan, ReAct, and Reflexion. It achieves higher scores and demonstrates superior effectiveness in solving complex real-world tasks.
In conclusion, SWIFTSAGE is a promising framework that combines the strengths of behavior cloning and prompting LLMs. It thus can be really beneficial in enhancing action planning and improving performance in complex reasoning tasks.