Ability to Track and Segment Anything in Dynamic Videos with SAM-PT


Numerous applications, such as robotics, autonomous driving, and video editing, benefit from video segmentation. Deep neural networks have made great progress in the last several years. However, the existing approaches need help with untried data, especially in zero-shot scenarios. These models need specific video segmentation data for fine-tuning to maintain consistent performance across diverse scenarios. In a zero-shot setting, or when these models are transferred to video domains they have not been trained on and encompass object categories that fall outside of the training distribution, the current methods in semi-supervised Video Object Segmentation (VOS) and Video Instance Segmentation (VIS) show performance gaps when dealing with unseen data. 

Using successful models from the image segmentation domain for video segmentation tasks offers a potential solution to these problems. The Segment Anything concept (SAM) is one such promising concept. With an astonishing 11 million pictures and more than 1 billion masks, the SA-1B dataset served as the training ground for SAM, a strong foundation model for image segmentation. SAM’s outstanding zero-shot generalization skills are made possible by its huge training set. The model has proven to operate reliably in various downstream tasks using zero-shot transfer protocols, is very customizable, and can create high-quality masks from a single foreground point. 

SAM exhibits strong zero-shot image segmentation skills. However, it is not naturally suitable for video segmentation problems. SAM has recently been modified to include video segmentation. As an illustration, TAM combines SAM with the cutting-edge memory-based mask tracker XMem. Similar to how SAM-Track combines DeAOT with SAM. While these techniques largely restore SAM’s performance on in-distribution data, they fall short when applied to more difficult, zero-shot conditions. Many segmentation issues may be resolved using visual prompting by other techniques that do not need SAM, including SegGPT, although they still require mask annotation for the initial video frame. 

This issue poses a substantial obstacle to zero-shot video segmentation, especially as researchers work to create simple techniques to generalize to new situations and reliably produce high-quality segmentation across various video domains. Researchers from ETH Zurich, HKUST and EPFL introduce SAM-PT (Segment Anything Meets Point Tracking). This approach offers a fresh approach to the issue by being the first to segment videos using sparse point tracking and SAM. Instead of utilizing mask propagation or object-centric dense feature matching, they suggest a point-driven method that uses the detailed local structural data encoded in movies to track points.

Because of this, it only needs sparse points to be annotated in the first frame to indicate the target item and offers superior generalization to unseen objects, a strength that was proved on the open-world UVO benchmark. This strategy effectively expands SAM’s capabilities to video segmentation while preserving its intrinsic flexibility. Utilizing the adaptability of modern point trackers like PIPS, SAM-PT prompts SAM with sparse point trajectories predicted using these tools. They concluded that the approach most suited for motivating SAM was initializing locations to track using K-Medoids cluster centers from a mask label. 

It is possible to distinguish clearly between the backdrop and the target items by tracking both positive and negative points. They suggest different mask decoding processes that use both points to improve the output masks further. They also developed a point re-initialization technique that improves tracking precision over time. In this method, points that have been unreliable or obscured are discarded, and points from sections or segments of the object that become visible in succeeding frames, such as when the object rotates, are added. 

Notably, their test findings show that SAMPT performs as well as or better than existing zero-shot approaches on several video segmentation benchmarks. This shows how adaptable and reliable their method is because no video segmentation data was required during training. In zero-shot settings, SAM-PT can accelerate progress on video segmentation tasks. Their website has multiple interactive video demos.

n