A New Cross-Modal Diffusion Model For Any-to-Any Synthesis Is CoDi

In the past few years, there has been a notable emergence of robust cross-modal models capable of generating one type of information from another, such as transforming text into text, images, or audio. An example is the notable Stable Diffusion, which can generate stunning images from an input prompt describing the expected outcome.

Despite delivering realistic results, these models face limitations in their practical application when multiple modalities coexist and interact. Let us assume we want to generate an image from a text description like “cute puppy sleeping on a leather couch.” That is, however, not enough. After receiving the output image from a text-to-image model, we also want to hear what such a situation would sound like with, for instance, the puppy snoring on the couch. In this case, we would need another model to transform the text or the resulting image into a sound. Therefore, although connecting multiple specific generative models in a multi-step generation scenario is possible, this approach can be cumbersome and slow. Additionally, independently generated unimodal streams will lack consistency and alignment when combined in a post-processing manner, such as synchronizing video and audio.

A comprehensive and versatile any-to-any model could simultaneously generate coherent video, audio, and text descriptions, enhancing the overall experience and reducing the required time.

In pursuit of this goal, Composable Diffusion (CoDi) has been developed for simultaneously processing and generating arbitrary combinations of modalities.

The architecture overview is reported here below.

Training a model to handle any mixture of input modalities and flexibly generate various output combinations entails significant computational and data requirements.

This is due to the exponential growth in possible combinations of input and output modalities. Additionally, obtaining aligned training data for many groups of modalities is very limited and nonexistent, making it infeasible to train the model using all possible input-output combinations. To address this challenge, a strategy is proposed to align multiple modalities in the input conditioning and generation diffusion step. Furthermore, a “Bridging Alignment” strategy for contrastive learning efficiently models the exponential number of input-output combinations with a linear number of training objectives.

To achieve a model with the ability to generate any-to-any combinations and maintain high-quality generation, a comprehensive model design and training approach is necessary, leveraging diverse data resources. The researchers have adopted an integrative approach to building CoDi. Firstly, they train a latent diffusion model (LDM) for each modality, such as text, image, video, and audio. These LDMs can be trained independently and in parallel, ensuring excellent generation quality for each individual modality using available modality-specific training data. This data consists of inputs with one or more modalities and an output modality.

For conditional cross-modality generation, where combinations of modalities are involved, such as generating images using audio and language prompts, the input modalities are projected into a shared feature space. This multimodal conditioning mechanism prepares the diffusion model to condition on any modality or combination of modalities without requiring direct training for specific settings. The output LDM then attends to the combined input features, enabling cross-modality generation. This approach allows CoDi to handle various modality combinations effectively and generate high-quality outputs.

The second stage of training in CoDi facilitates the model’s ability to handle many-to-many generation strategies, allowing for the simultaneous generation of diverse combinations of output modalities. To the best of current knowledge, CoDi stands as the first AI model to possess this capability. This achievement is made possible by introducing a cross-attention module to each diffuser and an environment encoder V, which projects the latent variables from different LDMs into a shared latent space.

During this stage, the parameters of the LDM are frozen, and only the cross-attention parameters and V are trained. As the environment encoder aligns the representations of different modalities, an LDM can cross-attend with any set of co-generated modalities by interpolating the output representation using V. This seamless integration enables CoDi to generate arbitrary combinations of modalities without the need to train on every possible generation combination. Consequently, the number of training objectives is reduced from exponential to linear, providing significant efficiency in the training process.

Some output samples produced by the model are reported below for each generation task.

This was the summary of CoDi, an efficient cross-modal generation model for any-to-any generation with state-of-the-art quality. If you are interested, you can learn more about this technique in the links below.