The Generation Of Text-To-Image Diffusion Using An Invertible Process Given Any Existing Diffusion Model

With the recent advancements in technology and the field of Artificial Intelligence, there have been a lot of innovations. Be it text generation using the super trending ChatGPT model or image generation from a text, everything is now possible. Currently, there are several text-to-image models that not only produce a fresh image from a textual description but also edit an existing image. Generating an image is usually easier than editing an available image, as a lot of fine detailing needs to be maintained while editing. For accurate text-based image editing, researchers have developed a new algorithm, EDICT – Exact Diffusion Inversion via Coupled Transformations. EDICT is a new algorithm capable of performing text-guided image editing with the help of diffusion models.

Text-to-image generation is a task in which a machine learning model is trained to produce an image based on a given text description. The model learns to associate text descriptions with pictures and generates new images that match the specified description. EDICT performs text-to-image diffusion generation using any existing diffusion model. In image generation, diffusion models are generative models that use a diffusion process to produce new images. The diffusion process begins from a random image and then iteratively filters it by applying a series of transformations until it reaches a final image similar to the target image.

Diffusion models are trained to generate a denoised image from a noisy image with the help of a textual description. For editing an image, noise is added to the original image, and this partial generation is used to perform a new generation using the given text. EDICT works on the concept of obtaining a noisy image that would exactly produce the original image when provided with the original text or the prompt. It is a kind of inverse noising technique. This way, if the original text is slightly altered, the edited image would be mostly unchanged with just the required alterations.

The team behind EDICT shares the results of the algorithm with the help of an example. While generating an image of a cat surfing in water by editing an existing image of a surfing dog, a lot of details and minute information is lost, such as the waves, the color of the board, etc. This is because, in this method, noise is simply added to the original image to generate the new one. In the EDICT technique, reverse generation is carried out by finding a noisy image that would exactly generate the original image. This noisy image then generates the actual image of the surfing dog with the help of the textual caption. The noise from the generated image is copied to query the model again with the picture without noise. Followed by this, the tweaking is done in the text by simply replacing the word dog with the word cat, and finally, a comparatively detailed edited image of a surfing cat is obtained. EDICT works merely on the idea of making two identical copies of an image and alternatively improving each one of them with details from the other in a reversible manner.

This new approach undoubtedly seems promising, as current text-to-image generation models are inconsistent and do not do full justice to the detailing of the original image. By inverting the generation process, the important content of the image can be preserved. Considering these image generation models’ growing innovations and demand, EDICT appears to be a big competition to all existing models.