The success of prompt-based universal interfaces for LLMs like ChatGPT has paved the way for the importance of modern AI models in human-AI interactions, opening up numerous possibilities for further research and development. In visual understanding, tasks have not received as much attention in the context of human-AI interactions, and new studies are now starting to emerge. One such task is image segmentation, which aims to divide an image into multiple segments or regions with similar visual characteristics, such as color, texture, or a class of object. Interactive image segmentation has a long history, but segmentation models that can interact with humans via interfaces that can take multiple types of prompts, such as texts, clicks, and images, or a combination of those, have not been well-explored. Most segmentation models today are only able to use spatial hints like clicks or scribbles or referring segmentation using language. Recently, a segmentation model called SAM introduced a model that could support multiple prompts, but its interaction is limited to only boxes or points, and it does not provide semantic labels as output.
This paper, presented by researchers from the University of Wisconsin-Madison, introduces SEEM, a new approach to image segmentation that uses a universal interface and multi-modal prompts. The acronym stands for Segmenting Everything Everywhere all at once in an image (in reference to the movie, in case you missed it!). This new, ground-breaking model was built with 4 main characteristics in mind: Versatility, Compositionality, Interactivity, and Semantic-awareness. For versatility, their model enables the use of inputs such as points, masks, text, boxes, and even a referred region of another seemingly heterogeneous image. The model can deal with any combination of those input prompts, leading to strong compositionality. The interactivity aspect comes from the ability of the model to use memory prompts to interact with other prompts and retain previous segmentation information. Finally, semantic awareness refers to the ability of the model to recognize and label different objects in an image based on their semantic meaning (for example, distinguishing between different types of cars). SEEM can give open-set semantics to any output segmentation, which means that the model can recognize and segment objects that were never seen during training. This is really important for real-world applications where the model may encounter new and previously unseen objects.
The model follows a simple Transformer encoder-decoder architecture with an extra text encoded. All queries are taken as prompts and fed into the decoder. The image encoder is used to encode all spatial queries, such as points, boxes, and scribbles, into visual prompts, and the text encoder is used to convert text queries into textual prompts. Then, prompts of all 5 different types are mapped to a joint visual-semantic space, enabling unseen user prompts. Different types of prompts can help each other via cross-attention so that composite prompts can be used to obtain better segmentation results. Additionally, the authors claim that SEEM is efficient to run since when doing multi-round interactions with humans, the model only needs to run the (heavy) feature extractor once at the beginning and then run the (lightweight) decoder with each new prompt.
The researchers conducted experiments to show that their model has strong performance on many segmentation tasks, including closed-set and open-set segmentations of different types (interactive, referring, panoptic, and segmentation with combined prompts). The model was trained on panoptic and interactive segmentation with COCO2017, with 107K segmentation images in total. For referring segmentation, they used a combination of sources for image annotations (Ref-COCO, Ref-COCOg, and Ref-COCO+). To evaluate performance, they used standard metrics for all segmentation tasks, such as Panoptic Quality, Average Precision, and Mean Intersection over Union. For interactive segmentation, they used the Number of Clicks needed to achieve a certain Intersection over Union.
The results are very promising. The model performs well on all three segmentation types: interactive, generic, and referring segmentation. For interactive segmentation, its performance is even comparable to SAM (which is trained with 5-x more segmentation data) whilst additionally allowing for a wide range of user input types and providing strong compositional capabilities. The user can click or draw a scribble on an input image or input a text, and SEEM can produce both masks and semantic labels for the objects on that image. For example, the user might input “the black dog,” and SEEM can draw the contour around the black dog in the picture and add the label “black dog.” The user can also input a referring image with a river and draw a scribble on the river, and the model is able to find the river and label it on other images. It is notable to say that the model shows powerful generalization capabilities to unseen scenarios like cartoons, movies, and games. The model can label objects in a zero-shot manner, i.e. it is able to classify new examples from previously unseen classes. It can also precisely segment the objects in different frames from a movie, even when the object changes in appearance by blurring or intensive deformations.
In conclusion, SEEM is a powerful, state-of-the-art segmentation model that is able to segment everything (all semantics), everywhere (on every pixel in the image), all at once (support all compositions of prompts). It is the first and preliminary step toward a universal and interactive interface for image segmentation, bringing computer vision closer to the types of advancements seen in LLMs. The performance is currently limited by the amount of training data and will likely be improved by larger segmentation datasets, like the one currently developed by the concurrent work SAM. Supporting part-based segmentation is another avenue to explore to enhance the model.
Check out the Paper and Github Link. Don’t forget to join our 20k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]
???? Check Out 100’s AI Tools in AI Tools Club
Nathalie Crevoisier holds a Bachelor’s and Master’s degree in Physics from Imperial College London. She spent a year studying Applied Data Science, Machine Learning, and Internet Analytics at the Ecole Polytechnique Federale de Lausanne (EPFL) as part of her degree. During her studies, she developed a keen interest in AI, which led her to join Meta (formerly Facebook) as a Data Scientist after graduating. During her four-year tenure at the company, Nathalie worked on various teams, including Ads, Integrity, and Workplace, applying cutting-edge data science and ML tools to solve complex problems affecting billions of users. Seeking more independence and time to stay up-to-date with the latest AI discoveries, she recently decided to transition to a freelance career.