AI Framework for High-Quality Tracking Anything in Videos

Visual object tracking is the backbone of numerous subfields within computer vision, including robot vision and autonomous driving. This job aims to reliably identify the target object in a video sequence. Many state-of-the-art algorithms compete in the Visual Object Tracking (VOT) challenge since it is one of the most important competitions in the tracking field.

The Visual Object Tracking and Segmentation competition (VOTS2023) removes some of the restrictions imposed by previous VOT challenges so that participants can think about object tracking more broadly. As a result, VOTS2023 combines short- and long-term monitoring of a single target and tracking many targets, using target segmentation as the only position specification. This introduces new difficulties, such as precise mask estimate, multi-target trajectory tracking, and recognizing relationships between objects.

A new study by the Dalian University of Technology, China, and DAMO Academy, Alibaba Group, presents a system called HQTrack, which stands for High-Quality Tracking. It comprises primarily a video multi-object segmenter (VMOS) and a mask refiner (MR). To perceive tiny objects in intricate setups, the researchers employ VMOS, an enhanced variation of DeAOT, and cascade a gated propagation module (GPM) at 1/8 scale. In addition, they use Intern-T as their feature extractor to improve the ability to distinguish between different types of objects. In VMOS, the researchers only keep the most recently used frame in the long-term memory, discarding the older ones to make room. However, applying a large segmentation model to improve the tracking masks could be useful. Objects with complicated structures are especially challenging for SAM to predict, and they appear frequently in the VOTS challenge. 

< />

Using an HQ-SAM model that has already been pre-trained, the team may further enhance the quality of the tracking masks. Final tracking results were chosen from VMOS and MR, and they used the outer enclosing boxes of the predicted masks as box prompts to feed into HQ-SAM alongside the original images to obtain the refined masks. HQTrack finishes in second place at the VOTS2023 competition with a quality score of 0.615 on the test set.