A Two-Stage Generation Process for Large Language Models for Improved Spatial and Common Sense Reasoning

Recent advancements in text-to-image generation have emerged diffusion models that can synthesize highly realistic and diverse images. However, despite their impressive capabilities, diffusion models like Stable Diffusion often need help with prompts requiring spatial or common sense reasoning, leading to inaccuracies in generated images.

To address this challenge, a research team from UC Berkeley and UCSF has proposed a novel LLM-grounded Diffusion (LMD) approach that enhances prompt understanding in a text-to-image generation. They have identified scenarios, including negation, numeracy, attribute assignment, and spatial relationships, where Stable Diffusion falls short compared to LMD.

The researchers adopted a cost-efficient solution to avoid the costly and time-consuming process of training large language models (LLMs) and diffusion models. They integrated off-the-shelf frozen LLMs into diffusion models, resulting in a two-stage generation process that provides enhanced spatial and common sense reasoning capabilities.

In the first stage, an LLM is adapted to function as a text-guided layout generator through in-context learning. When given an image prompt, the LLM produces a scene layout consisting of bounding boxes and corresponding descriptions. In the second stage, a diffusion model is guided by the generated layout using a novel controller to generate images. Both stages employ frozen pre-trained models without any parameter optimization for LLM or diffusion models.

LMD offers several advantages beyond improved prompt understanding. It enables dialog-based multi-round scene specification, allowing users to provide additional clarifications and modifications for each prompt. Moreover, LMD can handle prompts in languages unsupported by the underlying diffusion model. By incorporating an LLM that supports multi-round dialog, users can query the LLM after the initial layout generation and receive updated layouts for subsequent image generation, facilitating requests such as adding objects or changing their locations or descriptions.

Additionally, LMD accepts non-English prompts by providing an example of a non-English prompt with an English layout and background description during in-context learning. This allows LMD to generate layouts with English descriptions, even when the underlying diffusion models lack support for the given language.

The researchers validated the superiority of LMD by comparing it with the base diffusion model, Stable Diffusion 2.1, which LMD utilizes. They invite readers to explore their work for a comprehensive evaluation and further comparisons.

In summary, LMD presents a novel approach to address the limitations of diffusion models in accurately following prompts requiring spatial or common sense reasoning. By incorporating frozen LLMs and employing a two-stage generation process, LMD significantly enhances prompt understanding in text-to-image generation tasks. It offers additional capabilities, such as dialog-based scene specification and handling prompts in unsupported languages. The research team’s work opens new possibilities for improving the accuracy and diversity of synthesized images through the integration of off-the-shelf frozen models.