Unified AI Model for Image, Video, Audio, and Language Tasks

One big leap forward in creating generalist models is the appearance of Large Language Models (LLMs). Their astounding text understanding and generation performances are often based on the Transformer architecture and a single next-token prediction aim. However, they are currently hampered by their inability to access information outside the text. This emphasizes the requirement for reliable multimodal models capable of performing various tasks using various modalities.

Recent efforts have sought to improve task/modality-specific techniques by constructing multimodal models with more power. A few of these methods seek to include more than two modalities, such as image/video-text, although most of these efforts are devoted to image-text jobs.

To address this problem, the researchers at Sorbonne University began by developing general-purpose models that can address any problem. They introduce UnIVAL, a method that avoids relying on any single modality. UnIVAL integrates two modalities and all four (text, pictures, video, and audio).

UnIVAL is the first model to solve picture, video, and audio language challenges with a unified architecture, vocabulary, input/output format, and training aim without requiring massive amounts of data for training or massive model size. The 0.25 billion parameter model delivers performance on par with prior art tailored to a certain modality. The researchers obtained new SoTA on several jobs with similarly sized models.

Their research into the interplay and transfer of knowledge between pretrained tasks and modalities demonstrates the value of multitask pretraining compared to traditional single-task pretraining. They also discover that pretraining the model on additional modalities improves its generalization to untrained modalities. In particular, when fine-tuned on audio-text problems, UnIVAL can achieve competitive performance to SoTA without audio pretraining.

Based on previous studies, the team also presents a new investigation into merging multimodal models by weight interpolation. They demonstrate that interpolation in the weight space may successfully combine the skills of the multiple fine-tuned weights, creating more robust multitask models without any inference overhead when using the unified pretrained model for various multimodal tasks. The diversity of multimodal activities can thus be used and recycled by averaging various fine-tuned weights and multitasking pretraining. Weight interpolation has never been tested with multimodal baseline models before, but this research is the first to successfully do so.

The researchers also mention two significant drawbacks of UnIVAL:

UnIVAL is susceptible to hallucinations. In particular, it may invent new objects in visual descriptions (object bias), giving more weight to consistency than accuracy.
It has trouble following elaborate directions. They found that the model underperformed when given complex instructions, such as picking out one object from a group of similar ones, finding things that are far away or extremely close, or recognizing numbers.

The researchers hope their findings will motivate other scientists and speed up the process of building new modality-agnostic generalist assistant agents.