SUSTech VIP Lab Proposes Track Anything Model (TAM) That Achieves High-Performance Interactive Tracking and Segmentation in Videos


Video item Tracking (VOT) is a cornerstone of computer vision research due to the significance of tracking an unknown item in unconstrained settings. Video Object Segmentation (VOS) is a technique that, like VOT, seeks to identify the region of interest in a video and isolate it from the remainder of the frame. The best video trackers/segmenters nowadays are initiated by a segmentation mask or a bounding box and are trained on large-scale manually-annotated datasets. Large amounts of labeled data, on the one hand, conceal a vast human labor force. Also, the semi-supervised VOS requires a unique object mask ground truth for initialization under the present initialization parameters.

The Segment-Anything approach (SAM) was recently developed as a comprehensive baseline for segmenting images. Thanks to its adaptable prompts and real-time mask computation, it allows for interactive use. Satisfactory segmentation masks on specified image areas can be returned by SAM when given user-friendly suggestions in the form of points, boxes, or language. However, due to its lack of temporal consistency, researchers do not see spectacular performance when SAM is immediately applied to videos.

Researchers from SUSTech VIP Lab introduce the Track-Anything project, creating powerful tools for video object tracking and segmentation. The Track Anything Model (TAM) has a straightforward interface and can track and segment any objects in a video with a single round of inference. 

TAM is an expansion of SAM, a large-scale segmentation model, with XMem, a state-of-the-art VOS model. Users can define a target object by interactively initializing the SAM (i.e., clicking on the object); next, XMem provides a mask prediction of the object in the next frame based on temporal and spatial correspondence. Finally, SAM provides a more precise mask description; users can pause and correct during the tracking process as soon as they notice tracking failures.

The DAVIS-2016 validation set and the DAVIS-2017 test-development set were used in the analysis of TAM. Most notably, the findings show that TAM excels in challenging and complex settings. TAM’s outstanding tracking and segmentation abilities within only click initialization, and one-round inference are demonstrated by its ability to handle multi-object separation, target deformation, size change, and camera motion well.

The proposed Track Anything Model (TAM) offers a wide variety of options for adaptive video tracking and segmentation, including but not limited to the following:

  • Quick and easy video transcription: TAM may separate regions of interest in movies and allow users to pick and choose which items they want to follow. This means it can be used for video annotation, such as tracking and segmenting video objects.
  • Prolonged observation of an object: Since long-term tracking has many real-world uses, researchers are paying increasing attention to it. Real-world applications of TAM are more advanced since they can accommodate frequent shot changes in extended videos.
  • A video editor that is simple to use: The Track Anything Model allows us to divide things into categories. TAM’s object segmentation masks allow us to selectively cut out or reposition any object in a movie.
  • Kit for visualizing and developing video-related activities: The team also supplies visualized user interfaces for various video operations, including VOS, VOT, video inpainting, and more, to facilitate their use. Users can test their models on real-world footage and see the real-time outcomes with the toolbox.

Check out the Paper and Github Link. Don’t forget to join our 20k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]

???? Check Out 100’s AI Tools in AI Tools Club


Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.