A pixel-based screenshot AI agent capable of interacting with GUIs


Accurate segmentation of multiple objects is essential for various scene understanding applications, such as image/video processing, robotic perception, and AR/VR. The Segment Anything Model (SAM) was recently released, a basic vision model for broad image segmentation. It was trained using billion-scale mask labels. SAM can segment various objects, components, and visual structures in multiple contexts by using a sequence of points, a bounding box, or a coarse mask as input. Its zero-shot segmentation capabilities have sparked a quick paradigm change since they can be used in many applications with just a few basic prompts. 

Despite its outstanding performance, SAM’s segmentation outcomes still need improvement. Two significant issues plague SAM: 1) Rough mask borders, frequently omitting to segment thin object structures, as demonstrated in Figure 1. 2) Wrong forecasts, damaged masks, or significant inaccuracies in difficult instances. This is frequently connected to SAM’s tendency to misread thin structures, like the kite lines in the figure’s top right-hand column. The application and efficacy of fundamental segmentation methods, such as SAM, are significantly constrained by these errors, especially for automated annotation and image/video editing jobs where extremely precise picture masks are essential. 

Figure 1: Ccompares the predicted masks of SAM and our HQ-SAM using input prompts of a single red box or a number of points on the object. With extremely precise bounds, HQ-SAM generates findings that are noticeably more detailed. In the rightmost column, SAM misinterprets the kite lines’ thin structure and generates a significant number of mistakes with broken holes for the input box prompt.

Researchers from ETH Zurich and HKUST suggest HQ-SAM, which maintains the original SAM’s robust zero-shot capabilities and flexibility while being able to anticipate very accurate segmentation masks, even in extremely difficult circumstances (see Figure 1). They suggest a minor adaption of SAM, adding less than 0.5% parameters, to increase its capacity for high-quality segmentation while maintaining efficiency and zero-shot performance. The general arrangement of zero-shot segmentation is substantially hampered by directly adjusting the SAM decoder or adding a new decoder module. Therefore, they suggest the HQ-SAM design completely retains the zero-shot efficiency, integrating with and reusing the current learned SAM structure. 

In addition to the original prompt and output tokens, they create a learnable HQ-Output Token fed into SAM’s mask decoder. Their HQ-Output Token and its related MLP layers are taught to forecast a high-quality segmentation mask, in contrast to the original output tokens. Second, their HQ-Output Token operates on an improved feature set to produce precise mask information instead of only employing the SAM’s mask decoder capabilities. They combine SAM’s mask decoder features with the early and late feature maps from its ViT encoder to use global semantic context and fine-grained local features. 

The complete pre-trained SAM parameters are frozen during training, and just the HQ-Output Token, the related three-layer MLPs, and a tiny feature fusion block are updated. A dataset with precise mask annotations of various objects with intricate and complicated geometries is necessary for learning accurate segmentation. The SA-1B dataset, which has 11M photos and 1.1 billion masks created automatically using a model similar to SAM, is used to train SAM. However, SAM’s performance in Figure 1 shows that employing this large dataset has major economic consequences. It fails to produce the necessary high-quality mask generations targeted in their study. 

As a result, they create HQSeg-44K, a new dataset that comprises 44K highly fine-grained picture mask annotations. Six existing picture datasets are combined with very precise mask annotations to make the HQSeg-44K, which spans over 1,000 different semantic classes. HQ-SAM can be trained on 8 RTX 3090 GPUs in under 4 hours thanks to the smaller dataset and their simple integrated design. They conduct a rigorous quantitative and qualitative experimental study to verify the efficacy of HQ-SAM. 

On a collection of nine distinct segmentation datasets from various downstream tasks, they compare HQ-SAM with SAM, seven of which are under a zero-shot transfer protocol, including COCO, UVO, LVIS, HQ-YTVIS, BIG, COIFT, and HR-SOD. This thorough analysis shows that the proposed HQ-SAM can manufacture masks of a greater caliber while still having a zero-shot capability compared to SAM. A virtual demo is present on their GitHub page.

the first high-quality zero-shot segmentation model by introducing negligible overhead to the original SAM