An important goal in the study of computer vision is to comprehend visual situations. Over the years, several proxy tasks—from picture-level tasks like classification to dense prediction tasks like object recognition, segmentation, and depth prediction—have been developed to measure how effectively models properly comprehend the contents of an image. These standards serve as a useful north star for researchers looking to create better visual understanding systems. However, one drawback of these conventional computer vision benchmarks is that they frequently confine their label sets to a predetermined lexicon of concepts. As a result, there are inherent biases and blind spots in the skills that may be acquired and used to evaluate models.

Designing benchmarks that use natural language to elicit a model’s comprehension of a particular image more nuancedly is one way to loosen up this tight formulation. Image captioning is one of the oldest of these tasks, followed by many others, including Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), and Visual Entailment (VE), among others. They are particularly interested in challenges like phrase grounding and reference expression comprehension (REC) that test a model’s fine-grained localization skills. Although they are a logical extension of classical object detection, these tasks are only localization rather than genuine object detection because they presume that the items of interest are visible in the picture. They provide a bridge between these two categories of tasks in their study, which they refer to as contextual phrase detection (CPD).

When used in CPD, models are given one or more phrases that might be a component of a longer textual context. The model must find all occurrences of each word if and only if they fit inside the context established by the whole sentence. For instance, they ask the model to predict boxes for each cat and any table when there is a cat on the table and for no other item given the statement “cat on a table” (including other cats or tables that may exist in the image; see Figure 1d). Importantly, they do not imply a priori that all phrases are groundable, unlike REC and phrase grounding. When this premise is relaxed, the model is tested to see if it can stop predicting boxes when no object fulfills all of the sentence’s restrictions.

Having explicit negative certificates for a word given a picture is crucial for reliably testing the model’s capacity to discern whether the item defined by the phrase is present in the image. Since the ability to accomplish the problem requires knowledge of both localization (where the things are) and classification (is the indicated object present? ), this may be considered a real extension of the object detection task. With CPD, models may now be benchmarked for detecting anything that can be described in the free-form text without being restricted by the vocabulary, giving models’ detection skills a chance to be evaluated flexibly. They publish TRICD, a human-annotated assessment dataset comprising 2672 image-text pairings with 1101 distinct phrases linked to a total of 6058 bounding boxes, to facilitate the evaluation of this innovative job.

Figure 1: Contextual Phrase Detection (1d) develops earlier associated tasks: Similar to object detection (1a), phrase detection (1c), phrase grounding (1b), and phrase detection (1c), phrase grounding assesses both positives and negatives. It also has an unrestricted vocabulary.

They add this new restriction to the earlier attempts at open-ended detection. They chose a federated strategy since it is impossible to produce negative certifications for all the words in all the photos. For each positive phrase, they carefully select a comparable “distractor” image in which the target phrase does not appear. The biggest challenge is finding and verifying these negative examples, particularly those that can test a model’s discriminative skills.

They discover that, depending on their circumstances, models frequently mistakenly identify things when they appear in unexpected situations or hallucinate nonexistent objects. The results of this study are similar to hallucination phenomena in picture captioning systems. For instance, SoTA VQA models like FIBER, OFA, and Flamingo-3B all respond “yes” to the questions “Is there a person rowing a boat in the river?” and “Is there a baseball bat?” regarding Fig. 2a and Fig. 2b, respectively. Predicting bounding boxes requires CPD and enables a more granular insight into VL model failure mechanisms and thought processes.

They discover that, depending on their circumstances, models frequently mistakenly identify things when they appear in unexpected situations or hallucinate nonexistent objects. The results of this study are similar to hallucination phenomena in picture captioning systems. For instance, SoTA VQA models like FIBER, OFA, and Flamingo-3B all respond “yes” to the questions “Is there a person rowing a boat in the river?” and “Is there a baseball bat?” regarding Fig. 2a and Fig. 2b, respectively. Predicting bounding boxes requires CPD and enables a more granular insight into VL model failure mechanisms and thought processes.

Figure 2: Questions with “yes” responses from SOTA VQA models

They show a large performance gap (?10 points) between the evaluated models’ performance on TRICD compared to benchmarks like GQA  and Flickr30k in terms of F1-score on binary questions and phrase grounding recall@1, respectively, indicating that their dataset is challenging. On the CPD task, the best model achieves 21.5 AP on TRICD. They examine failure cases and find substantial room for improvement in SoTA models’ abilities to understand contextual cues. They hope that TRICD serves to better measure progress in building visual understanding models having fine-grained spatial and relational understanding. More examples can be found on their project website.


Check out the Paper, Project and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.