A group of researchers from the University of Washington, Stanford, AI2, UCSB, and Google recently developed the OpenFlamingo project, which aims to build models similar to those DeepMind’s Flamingo team. OpenFlamingo models can handle any mixed text and image sequences and produce text as an output. Captioning, visual question answering, and image classification are just some of the activities that can benefit from this and the model’s ability to take samples in context.
Now, the team announces the release of v2 with five trained OpenFlamingo models at the 3B, 4B, and 9B levels. These models are derived from open-source models with less stringent licenses than LLaMA, including Mosaic’s MPT-1B and 7B and Together.XYZ’s RedPajama-3B.
The researchers used the Flamingo modeling paradigm by adding visual characteristics to the layers of a static language model that have already been pretrained. The vision encoder and language model are kept static, but the connecting modules are trained using web-scraped image-text sequences, similar to Flamingo.
The team tested their captioning, VQA, and classification models on vision-language datasets. Their findings show that the team has made significant progress between their v1 release and the OpenFlamingo-9B v2 model.
They combine results from seven datasets and five different contexts for evaluating models’ efficacy: no shots, four shots, eight shots, sixteen shots, and thirty-two shots. They compare OpenFlamingo (OF) models at the OF-3B and OF-4B levels to those at the Flamingo-3B and Flamingo-9B levels, and find that, on average, OpenFlamingo (OF) achieves more than 80% of matching Flamingo performance. The researchers also compare their results to the optimized SoTAs published on PapersWithCode. OpenFlamingo-3B and OpenFlamingo-9B models, pre-trained only on online data, achieve more than 55% of fine-tuned performance with 32 in-context instances. OpenFlamingo’s models lag behind DeepMind’s by an average of 10% in the 0-shot and 15% in the 32-shot.
The team is continuously making progress in training and delivering state-of-the-art multimodal models. Next, they aim to enhance the quality of the data used for pre-training.