Meta AI Releases the Segment Anything Model (SAM): A New AI Model That Can Cut Out Any Object In A Image/Video With A Single Click

Computer vision relies heavily on segmentation, the process of determining which pixels in an image represents a particular object for uses ranging from analyzing scientific images to creating artistic photographs. However, building an accurate segmentation model for a given task typically necessitates the assistance of technical experts with access to AI training infrastructure and large volumes of carefully annotated in-domain data.

Recent Meta AI research presents their project called “Segment Anything,” which is an effort to “democratize segmentation” by providing a new task, dataset, and model for image segmentation. Their Segment Anything Model (SAM) and Segment Anything 1-Billion mask dataset (SA-1B), the largest ever segmentation dataset.

There used to be two main categories of strategies for dealing with segmentation issues. The first, interactive segmentation, could segment any object, but it needed a human operator to refine a mask iteratively. Automatic segmentation, however, allowed for predefined object categories to be segmented. Still, it required a large number of manually annotated objects, in addition to computing resources and technical expertise, to train the segmentation model. Neither method offered a foolproof, universally automated means of segmentation.

SAM encompasses both of these broader categories of methods. It’s a unified model that executes interactive and automated segmentation tasks effortlessly. Due to its flexible, prompt interface, the model can be used for various segmentation tasks by simply engineering the appropriate prompt. In addition, SAM can generalize to new types of objects and images because it is trained on a diverse, high-quality dataset of more than 1 billion masks. By and large, practitioners won’t have to collect their segmentation data and fine-tune a model for their use case because of this ability to generalize.

These features allow SAM to transfer to different domains and perform different tasks. Some of the SAM’s capabilities are as follows:

SAM facilitates object segmentation with a single mouse click or through the interactive selection of points for inclusion and exclusion. A boundary box can also be used as a prompt for the model.
For practical segmentation problems, SAM’s ability to generate competing valid masks in the face of object ambiguity is a crucial feature.
SAM can instantly detect and mask any objects in an image.
After precomputing the image embedding, SAM can instantly generate a segmentation mask for any prompt, enabling real-time interaction with the model.

The team needed a large and varied data set to train the model. SAM was used to gather the information. In particular, SAM was used by annotators to perform interactive image annotation, and the resulting data was subsequently used to refine and improve SAM. This loop ran several times to refine the model and data.

New segmentation masks can be collected at lightning speed using SAM. The tool used by the team makes interactive mask annotation quick and easy, taking only about 14 seconds. This model is 6.5x faster than COCO fully manual polygon-based mask annotation and 2x faster than the previous largest data annotation effort, which was also model-assisted compared to previous large-scale segmentation data collection efforts.

The presented 1 billion mask dataset could not have been built with interactively annotated masks alone. As a result, the researchers developed a data engine to use when collecting data for the SA-1B. There are three “gears” in this data “engine.” The model’s first mode of operation is to aid human annotators. In the next gear, fully automatic annotation is combined with human assistance to broaden the range of collected masks. Last, fully automated mask creation supports the dataset’s ability to scale.

The final dataset has over 11 million images with licenses, privacy protections, and 1.1 billion segmentation masks. Human evaluation studies have confirmed that the masks in SA-1B are of high quality and diversity and are comparable in quality to masks from the previous much smaller, manually annotated datasets. SA-1B has 400 times as many masks as any existing segmentation dataset.

The researchers trained SAM to provide an accurate segmentation mask in response to various inputs, including foreground/background points, a rough box or mask, freeform text, etc. They observed that the pretraining task and interactive data collection imposed particular constraints on the model design. For annotators to effectively utilize SAM during annotation, the model must run in real-time on a CPU in a web browser.

A lightweight encoder can instantly transform any prompt into an embedding vector, while an image encoder creates a one-time embedding for the image. A lightweight decoder is then used to combine the data from these two sources into a prediction of the segmentation mask. Once the image embedding has been calculated, SAM can respond to any query in a web browser with a segment in under 50 ms.

SAM has the potential to fuel future applications in a wide variety of fields that require locating and segmenting any object in any given image. For example, understanding a webpage’s visual and textual content is just one example of how SAM could be integrated into larger AI systems for a general multimodal understanding of the world.

Check out the Paper, Demo, Blog and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 18k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.