One Diffusion to Rule Diffusion: Modulating Pre-trained Diffusion Models for Multimodal Image Synthesis


Image generation AI models have stormed the domain in the last couple of months. You probably heard of midjourney, DALL-E, ControlNet, or Stable dDiffusion. These models are capable of generating photo-realistic images with given prompts, no matter how weird the given prompt is. You want to see Pikachu running around on Mars? Go ahead, ask one of these models to do it for you, and you will get it.

Existing diffusion models rely on large-scale training data. When we say large-scale, it is really large. For example, Stable Diffusion itself was trained on more than 2.5 Billion image-caption pairs. So, if you planned to train your own diffusion model at home, you might want to reconsider it, as training these models is extremely expensive regarding computational resources. 

On the other hand, existing models are usually unconditioned or conditioned on an abstract format like text prompts. This means they only take a single thing into account when generating the image, and it is not possible to pass external information like a segmentation map. Combining this with their reliance on large-scale datasets means large-scale generation models are limited in their applicability on domains where we do not have a large-scale dataset to train on.

One approach to overcome this limitation is to fine-tune the pre-trained model for a specific domain. However, this requires access to the model parameters and significant computational resources to calculate gradients for the full model. Moreover, fine-tuning a full model limits its applicability and scalability, as new full-sized models are required for each new domain or combination of modalities. Additionally, due to the large size of these models, they tend to quickly overfit to the smaller subset of data that they are fine-tuned on.

It is also possible to train models from scratch, conditioned on the chosen modality. But again, this is limited by the availability of training data, and it is extremely expensive to train the model from scratch. On the other hand, people tried to guide a pre-trained model at inference time toward the desired output. They use gradients from a pre-trained classifier or CLIP network, but this approach slows down the sampling of the model as it adds a lot of calculations during inference.

What if we could use any existing model and adapt it to our condition without requiring an extremely expensive process? What if we did not go into the cumbersome and time-consuming process of altering the diffusion mode? Would it be possible to condition it still? The answer is yes, and let me introduce it to you.

Multimodal conditioning modules use case. Source: https://arxiv.org/pdf/2302.12764.pdf

The proposed approach, multimodal conditioning modules (MCM), is a module that could be integrated into existing diffusion networks. It uses a small diffusion-like network that is trained to modulate the original diffusion network’s predictions at each sampling timestep so that the generated image follows the provided conditioning.

MCM does not require the original diffusion model to be trained in any way. The only training is done for the modulating network, which is small-scale and is not expensive to train. This approach is computationally efficient and requires fewer computational resources than training a diffusion net from scratch or fine-tuning an existing diffusion net, as it does not require calculating gradients for the large diffusion net. 

Moreover, MCM generalizes well even when we do not have a large training dataset. It does not slow down the inference process as there are no gradients that need to be calculated, and the only computational overhead comes from running the small diffusion net. 

Overview of the proposed modulation pipeline. Source: https://arxiv.org/pdf/2302.12764.pdf

The incorporation of the multimodal conditioning module adds more control to image generation by being able to condition on additional modalities such as a segmentation map or a sketch. The main contribution of the approach is the introduction of multimodal conditioning modules, a method for adapting pre-trained diffusion models for conditional image synthesis without changing the original model’s parameters, and achieving high-quality and diverse results while being cheaper and using less memory than training from scratch or fine-tuning a large model.

Check out the Paper and Project All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Ekrem C?etinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.