Meet MultiDiffusion: A Unified AI Framework That Enables Versatile And Controllable Image Generation Using A Pre-Trained Text-to-Image Diffusion Model

While diffusion models are now considered state-of-the-art, text-to-image generative models, they have emerged as a “disruptive technology” that exhibits previously unheard-of skills in creating high-quality, diversified pictures from text prompts. The ability to give users intuitive control over the created material remains a challenge for text-to-image models, even though this advancement holds significant potential for transforming how they may create digital content.

Presently, there are two techniques to regulate diffusion models: (i) training a model from scratch or (ii) fine-tuning an existing diffusion model for the job at hand. Even in a fine-tuning scenario, this strategy frequently necessitates considerable computation and a lengthy development period due to the ever-increasing volume of models and training data. (ii) Reuse a model that has already been trained and add some controlled generation abilities. Some techniques have previously focused on particular tasks and created a specialized methodology. This study aims to generate MultiDiffusion, a new, unified framework that vastly improves the adaptability of a pre-trained (reference) diffusion model to controlled picture production.

Figure 1: Flexible text-to-image production is made possible by MultiDiffusion, which unifies many controls over the created content, such as the desired aspect ratio or basic spatial guiding signals like rough region-based text-prompts.

The fundamental goal of MultiDiffusion is to design a new generation process comprising several reference diffusion generation processes joined by a common set of characteristics or constraints. The resultant image’s various areas are subjected to the reference diffusion model, which more specifically predicts a denoising sampling step for each. The MultiDiffusion then performs a global denoising sampling step, using the least squares best solution, to reconcile all of these separate phases. Consider, for instance, the challenge of creating a picture with any aspect ratio using a reference diffusion model trained on square images (see Figure 2 below).

Figure 2: MultiDiffusion: a new generation process, ?, is defined over a pre-trained reference model ?. Starting from a noise image JT , at each generation step, they solve an optimization task whose objective is that each crop Fi(Jt) will follow as closely as possible its denoised version ?(Fi(Jt)). Note that while each denoising step ?(Fi(Jt)) may pull to a different direction, their process fuses these inconsistent directions into a global denoising step ?(Jt), resulting in a high-quality seamless image

The MultiDiffusion merges the denoising directions from all the square crops that the reference model provides at each phase of the denoising process. It tries to follow them all as closely as possible, hampered by the neighboring crops sharing common pixels. Although each crop may tug in a distinct direction for denoising, it should be noted that their framework results in a single denoising phase, producing high-quality and seamless pictures. We should urge each crop to represent a true sample of the reference model.

Using MultiDiffusion, they may apply a pre-trained reference text-to-image model to a variety of tasks, such as generating pictures with a specific resolution or aspect ratio or generating images from illegible region-based text prompts, as shown in Fig. 1. Significantly, their architecture enables the concurrent resolution of both tasks by utilizing a shared developing process. They discovered that their methodology could achieve state-of-the-art controlled generation quality even when compared to approaches specially trained for these jobs by comparing them to relevant baselines. Also, their approach operates effectively without adding computational burden. The complete codebase will be soon released on their Github page. One can also see more demos on their project page.


Check out the Paper, Github, and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.