Microsoft and Columbia Researchers Propose LLM-AUGMENTER: An AI System that Augments a Black-Box LLM with a Set of Plug-and-Play Modules

Large language models (LLMs) like GPT-3 are widely recognized for their ability to generate coherent and informative natural language texts due to their vast amount of world knowledge. However, encoding this knowledge in LLMs is lossy and can lead to memory distortion, resulting in hallucinations that can be detrimental to mission-critical tasks. Additionally, LLMs cannot encode all necessary information for some applications, making them unsuitable for time-sensitive tasks like news question answering. Although various methods have been proposed to enhance LLMs using external knowledge, these typically require fine-tuning LLM parameters, which can be prohibitively expensive. Consequently, there is a need for plug-and-play modules that can be added to a fixed LLM to improve its performance in mission-critical tasks.

The paper proposes a system called LLM-AUGMENTER that addresses the challenges of applying Large Language Models (LLMs) to mission-critical applications. The system is designed to augment a black-box LLM with plug-and-play modules to ground its responses in external knowledge stored in task-specific databases. It also includes iterative prompt revision using feedback generated by utility functions to improve the factuality score of LLM-generated responses. The system’s effectiveness is validated empirically in task-oriented dialog and open-domain question-answering scenarios, where it significantly reduces hallucinations without sacrificing the fluency and informativeness of reactions. The source code and models of the system are publicly available.

The LLM-Augmenter process involves three main steps. Firstly, when given a user query, it retrieves evidence from external knowledge sources such as web searches or task-specific databases. It can also connect the retrieved raw evidence with relevant context and reason on the concatenation to create “evidence chains.” Secondly, the LLM-Augmenter prompts a fixed LLM like ChatGPT by using the consolidated evidence to generate a response rooted in evidence. Lastly, LLM-Augmenter checks the generated response and creates a corresponding feedback message. This feedback message modifies and iterates the ChatGPT query until the candidate’s response meets verification requirements.

The work presented in this study shows that the LLM-Augmenter approach can effectively augment black-box LLMs with external knowledge pertinent to their interactions with users. This augmentation greatly reduces the problem of hallucinations without compromising the fluency and informative quality of the responses generated by the LLMs.

LLM-AUGMENTER’s performance was evaluated on information-seeking dialog tasks using both automatic metrics and human evaluations. Commonly used metrics, such as Knowledge F1 (KF1) and BLEU-4, were used to assess the overlap between the model’s output and the ground-truth human response and the overlap with the knowledge that the human used as a reference during dataset collection. Additionally, the researchers included these metrics that best correlate with human judgment on the DSTC9 and DSTC11 customer support tasks. Other metrics, such as BLEURT, BERTScore, chrF, and BARTScore, were also considered, as they are among the best-performing text generation metrics on the dialog.


Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 15k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.