Researchers at Stanford Introduce ControlNet: A Neural Network Structure to Control Pre-Trained Large Diffusion Models to Support Additional Input Conditions

The development of Large language models like ChatGPT and DALL-E has been a topic of interest in the Artificial Intelligence community. By using advanced deep learning techniques, these models do everything from generating text to producing images. DALL-E, developed by OpenAI, is a text-to-image generation model that produces high-quality images based on the entered textual description. Trained on massive datasets of texts and images, these text-to-image generation models develop a visual representation of the given text or the prompt. Not only this but currently, there are several text-to-image models that not only produce a fresh image from a textual description but also generate a new image from an existing image. This is done using the concept of Stable Diffusion. The recently introduced neural network structure, ControlNet, significantly improves the control over text-to-image diffusion models.  

Developed by researchers from Stanford University named Lvmin Zhang and Maneesh Agrawala, ControlNet allows the generation of images with some precise and fine-grained control over the process of producing the image with the help of diffusion models. A diffusion model is simply a generative model that helps generate an image from a text by iteratively modifying and updating variables representing the image. With each iteration, more detailing is added to the image, and noise is removed, gradually shifting toward the target image. These diffusion models are implemented with the help of Stable Diffusion, in which an improved process of diffusion is used to train the diffusion models. It helps in producing varying images with a lot more stability and convenience. 

ControlNet works in combination with the previously trained diffusion models to allow the generation of images covering all the aspects of the textual descriptions fed as input. This neural network structure allows the production of high-quality images by taking into consideration the additional input conditions. ControlNet works by making a copy of each block of stable Diffusion into two variants – a trainable variant and a locked variant. During the production of the target image, the trainable variant tries to memorize new conditions for synthesizing the images and minutely putting details into it with the help of short datasets. On the other hand, the blocked variant helps in retaining the abilities and potentials of the diffusion model just before the generation of the objective image.   

The best part about the development of ControlNet is its ability to tell which parts of the input image are significant to generate the objective image and which are not. Unlike the traditional methods that lack the ability to observe the input image minutely, ControlNet conveniently overcomes the issue of spatial consistency by enabling Stable diffusion models to use the supplementary input conditions to figure out the model. The researchers behind the development of ControlNet have shared that ControlNet even allows training on a Graphical Processing Unit (GPU) with a graphics memory of whopping eight gigabytes. 

ControlNet is definitely a great breakthrough as it has been trained in a way that it learns conditions ranging from edge maps and key points to segmentation maps. It is a great addition to the already popular image generation techniques and, by augmentation of large datasets and with the help of Stable Diffusion, can be used in various applications for better control over image generation.  


Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Tanya Malhotra is a final year undergrad from the University of Petroleum Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.