Finding all of the “objects” in a given image is the groundwork of computer vision. By creating a vocabulary of categories and training a model to recognize instances of this vocabulary, one may avoid the question, “What is an Object?” The situation worsens when one tries to use these object detectors as practical home agents. Models often learn to pick the referenced item from a pool of object suggestions a pre-trained detector offers when requested to ground referential utterances in 2D or 3D settings. As a result, the detector may miss utterances that relate to finer-grained visual things, such as the chair, the chair leg, or the chair leg’s front tip.
The research team presents a Bottom-up, Top-Down DEtection TRansformer (BUTD-DETR pron. Beauty-DETER) as a model that conditions directly on a spoken utterance and finds all mentioned items. BUTD-DETR functions as a normal object detector when the utterance is a list of object categories. It is trained on image-language pairings tagged with the bounding boxes for all items alluded to in the speech, as well as fixed-vocab object detection datasets. However, with a few tweaks, BUTD-DETR may also anchor language phrases in 3D point clouds and 2D pictures.
Instead of randomly picking them from a pool, BUTD-DETR decodes object boxes by paying attention to verbal and visual input. The bottom-up, task-agnostic attention can overlook some details when locating an item, but language-directed attention fills in the gaps. A scene and a spoken utterance are used as input for the model. Suggestions for boxes are extracted using a detector that has already been trained. Next, visual, box, and linguistic tokens are extracted from the scene, boxes, and speech using per-modality-specific encoders. These tokens gain meaning within their context by paying attention to one another. Refined visual tickets kick off object queries that decode boxes and span over many streams.
The practice of object detection is an example of grounded referential language, where the utterance is the category label for the thing being detected. Researchers use object detection as the referential grounding of detection prompts by randomly selecting certain object categories from the detector’s vocabulary and generating synthetic utterances by sequencing them (for example, “Couch. Person. Chair.”). These detection cues are used as supplemental supervision information, with the goal being to find all occurrences of the category labels specified in the cue inside the scene. The model is instructed to avoid making box associations for category labels for which there are no visual input examples (such as “person” in the example above). In this approach, a single model can ground language and recognize objects while sharing the same training data for both tasks.
The developed MDETR-3D equivalent performs poorly compared to earlier models, whereas BUTD-DETR achieves state-of-the-art performance on 3D language grounding.
BUTD-DETR also functions in the 2D domain, and with architectural enhancements like deformable attention, it achieves performance on par with MDETR while converging twice as quickly. The approach takes a step toward unifying grounding models for 2D and 3D since it can be easily adapted to function in both dimensions with minor adjustments.
For all 3D language grounding benchmarks, BUTD-DETR demonstrates significant performance gains over state-of-the-art methods (SR3D, NR3D, ScanRefer). In addition, it was the best submission at the ECCV workshop on Language for 3D Scenes, where the ReferIt3D competition was conducted. However, when trained on massive data, BUTD-DETR may compete with the best existing approaches for 2D language grounding benchmarks. Specifically, researchers’ efficient deformable attention to the 2D model allows the model to converge twice as rapidly as state-of-the-art MDETR.
The video below describes the complete workflow.