Training Models for Object Detection and Instance Segmentation Without Human Annotations

Object detection and image segmentation are crucial tasks in computer vision and artificial intelligence. They are critical in numerous applications, such as autonomous vehicles, medical imaging, and security systems.

Object detection involves detecting instances of objects within an image or a video stream. It consists of identifying the class of the object and its location within the image. The goal is to produce a bounding box around the object, which can then be used for further analysis or to track the object over time in a video stream. Object detection algorithms can be divided into two categories: one-stage and two-stage. One-stage methods are faster but less accurate, while two-stage methods are slower but more accurate.

On the other hand, image segmentation involves partitioning an image into multiple segments or regions, where each segment corresponds to a different object or part of an object. The goal is to label each pixel in the image with a semantic class, such as “person,” “car,” “sky,” etc. Image segmentation algorithms can be divided into two categories: semantic segmentation and instance segmentation. Semantic segmentation involves labeling each pixel with a class label, while instance segmentation concerns detecting and segmenting individual objects within an image.

Both object detection and image segmentation algorithms have advanced significantly in recent years, mainly due to deep learning approaches. Because of their capacity to learn hierarchical representations of picture input, Convolutional Neural Networks (CNNs) have become the go-to option for these problems. However, training these models necessitates specialized annotations such as object boxes, masks, and localized points, which are both challenging and time-consuming. Without accounting for overhead, manually annotating 164K pictures in the COCO dataset with masks for only 80 classes required more than 28K hours.

With a novel architecture termed Cut-and-LEaRn (CutLER), the authors try to address these issues by studying unsupervised object detection and instance segmentation models that can be trained without human labels. The method consists of three simple architecture- and data-agnostic mechanisms. The pipeline for the proposed architecture is depicted below.


The authors of CutLER first introduce MaskCut, a tool capable of automatically generating several initial rough masks for each image based on features computed by a self-supervised pre-trained vision transformer ViT. MaskCut has been developed to address the limitations of current masking tools, such as Normalized Cuts (NCut). Indeed, NCut’s applications are restricted to single object detection in an image, which can be heavily limiting. For this reason, MaskCut extends it to discover multiple objects per image by iteratively applying NCut to a masked similarity matrix.

Second, the authors implement a straightforward loss-dropping strategy to train the detectors using these coarse masks, which are robust to objects that MaskCut missed. Despite being trained with these rough masks, the detectors can refine the ground truth and produce masks (and boxes) that are more accurate. Therefore, multiple rounds of self-training on the models’ predictions can allow the model to evolve from focusing on local pixel similarities to considering the overall object geometry, resulting in more precise segmentation masks.

The figure below offers a comparison between the proposed framework and state-of-the-art approaches.


This was the summary of CutLER, a novel AI tool for accurate and consistent object detection and image segmentation.

If you are interested or want to learn more about this framework, you can find a link to the paper and the project page.