Multimodal models are one of the greatest advancements in the field of Artificial Intelligence. These models have been designed to process and understand data from multiple modalities, be it visual, which includes images and videos, textual, including natural language, or audio, i.e., speech and sound. These models are able to combine and analyze data from these various modalities to carry out complex tasks that call for comprehension and inference across a variety of data kinds. Since large multimodal models are used in vision tasks, pre-training such models on image-text pairs has shown to yield high performance on various vision-related tasks.
Researchers have been trying to improve the utility of web data, like image-text pairs, for training large multimodal models used in vision tasks, but due to a number of factors, such as poorly aligned image-text pairs, faulty data sources, and low-quality content, online data is frequently noisy or uninformative. Currently, existing methods reduce noise in the data, but it often results in a loss of data diversity. To address that, a team of researchers has presented their approach that focuses on the quality of captions as a significant source of noise in the web-scraped data.
The primary goal is to explore how generated captions can improve the usefulness of image-text pairs with vague or uninformative text. For that, the team has tested several mixing tactics, combining raw site captions with captions produced by the mode. The approach has outperformed the top filtering strategy suggested by the DataComp benchmark by a wide margin. Using a candidate pool of 128 million image-text pairs, the improvement on ImageNet is 2%, and across 38 jobs, the average improvement is 4%. Their best method surpasses conventional techniques in retrieval tasks on Flickr and MS-COCO, demonstrating the viability of their strategy in real-world situations.
The team has examined the rationale behind why artificial captions are a useful tool for text supervision. Through their testing of multiple image captioning models, the team has shown that the usefulness of the captions a model produces for multimodal training is not always determined by how well it performs on established image captioning benchmarks, like NoCaps CIDEr. This highlights the necessity of evaluating the generated captions, particularly for multimodal activities, rather than relying merely on conventional image captioning benchmarks.
The study has used DataComp’s dataset of 1.28 billion image-text pairs to investigate the application of generated captions on a broader scale. This experiment reveals the limitations of synthetic text and emphasizes the growing significance of image curation in light of the expansion of training data. The insights shared by the team are:
- Selecting a captioning model: Fine-tuning a pretrained network for image captioning based on standard benchmarks may not lead to effective captions for multimodal training. Reference-free metrics like CLIP-S better reflect the generated captions’ training quality.
- Combining captions from multiple sources: Multiple strategies have been explored for filtering and mixing raw and synthetic captions, resulting in performance gains at small and medium scales on the DataComp benchmark.
- Effectiveness of synthetic captions: On an individual level, synthetic captions are less noisy and contain more visual information. However, at the population level, they lack diversity compared to raw captions.
- Scalability of synthetic captions’ benefits: The best filtering approach varies across different data scales. Experimenting with different quantities highlights the limitations of synthetic captions, with image quality control and diversity gap becoming more critical in larger data regimes.