Text-to-image is a challenging task in computer vision and natural language processing. Generating high-quality visual content from textual descriptions requires capturing the intricate relationship between language and visual information. If text-to-image is already challenging, text-to-video synthesis extends the complexity of 2D content generation to 3D, given the temporal dependencies between video frames.

A classic approach when dealing with such complex content is exploiting diffusion models. Diffusion models have emerged as a powerful technique for addressing this problem, leveraging the power of deep neural networks to generate photo-realistic images that align with a given textual description or video frames with temporal consistency.

Diffusion models work by iteratively refining the generated content through a sequence of diffusion steps, where the model learns to capture the complex dependencies between the textual and visual domains. These models have shown impressive results in recent years, achieving state-of-the-art text-to-image and text-to-video synthesis performance. 

Although these models offer new creative processes, they are mostly constrained to creating novel images rather than editing existing ones. Some recent approaches have been developed to fill this gap, focusing on preserving particular image characteristics, such as facial features, background, or foreground, while editing others.

For video editing, the situation changes. To date, only a few models have been employed for this task, and with scarce results. The goodness of a technique can be described by alignment, fidelity, and quality. Alignment refers to the degree of consistency between the input text prompt and the outcome video. Fidelity accounts for the degree of preservation of the original input content (or at least of that portion not referred to in the text prompt). Quality stands for the definition of the image, such as the presence of fine-grained details.

The most challenging part of this type of video editing is maintaining temporal consistency between frames. Since the application of image-level editing methods (frame-by-frame) can not guarantee such consistency, different solutions are needed.

An interesting approach to address the video editing task comes from Dreamix, a novel text-to-image artificial intelligence (AI) framework based on diffusion models.

The overview of Dreamix is depicted below.

Source: https://arxiv.org/pdf/2302.01329.pdf

The core of this method is enabling a text-conditioned video diffusion model (VDM) to maintain high fidelity to the given input video. But how?

First, instead of following the classic approach and feeding pure noise as initialization to the model, the authors use a degraded version of the original video. This version has low spatiotemporal information and is obtained through downscaling and noise addition. 

Second, the generation model is finetuned on the original video to improve the fidelity further. 

Finetuning ensures that the learning model can understand the finer details of a high-resolution video. However, suppose the model is simply finetuned on the input video. In that case, it may lack motion editability since it will prefer the original motion rather than following the text prompts. 

To address this issue, the authors suggest a new approach called mixed finetuning. In mixed finetuning, the Video Diffusion Models (VDMs) are finetuned on individual input video frames while disregarding the temporal order. This is achieved by masking temporal attention. Mixed finetuning leads to a significant improvement in the quality of motion edits.

The comparison in the results between Dreamix and state-of-the-art approaches is depicted below.

Source: https://arxiv.org/pdf/2302.01329.pdf

This was the summary of Dreamix, a novel AI framework for text-guided video editing.

If you are interested or want to learn more about this framework, you can find a link to the paper and the project page.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.

By admin