OmniMotion: A Revolutionary AI Method for Estimating Dense and Long-Range Motion in Videos


Sparse feature tracking or dense optical flow have historically been the two main methodologies used in motion estimating algorithms. Both types of methods have been successful in their particular applications. However, neither representation completely captures the motion of a video: sparse tracking can not describe the motion of all pixels. In contrast, pairwise optical flow cannot capture motion trajectories across large temporal frames. To reduce this gap, many methods have been used to predict dense and long-range pixel trajectories in videos. These range from simple two-frame optical flow field chaining techniques to more advanced algorithms that directly forecast per-pixel trajectories across several frames. 

However, all these approaches ignore information from the current temporal or geographical context when calculating velocity. This localization might cause the motion estimates to have spatiotemporal inconsistencies and cumulative mistakes over extended trajectories. Even when past techniques did take into account long-range context, they did so in the 2D domain, which led to tracking loss during occlusion situations. Creating dense and long-range trajectories still presents several issues, including tracking points across occlusions, preserving coherence in space and time, and keeping accurate tracks throughout lengthy periods. In this study, researchers from Cornell University, Google Research and UC Berkeley provide a comprehensive method for estimating full-length motion trajectories for each pixel in a movie by using all available video data. 

Their approach, which they call OmniMotion, uses a quasi-3D representation in which a collection of local-canonical bijections maps a canonical 3D volume to per-frame local volumes. These bijections describe a combination of camera and scene motion as a flexible relaxation of dynamic multi-view geometry. They can monitor all pixels, even obscured ones, and their representation ensures cycle consistency (“Everything, Everywhere”). To jointly solve the motion of the complete video “All at Once,” they optimize their representation for each video. After optimization, any continuous coordinate in the movie can query its representation to obtain a motion trajectory that spans the entire thing.

In conclusion, they provide a method that can handle in-the-wild films with any combination of camera and scene motion:

  1. Generates globally consistent full-length motion trajectories for all points in an entire video.
  2. Can track points through occlusions.
  3. Can track points through occlusions.

They statistically illustrate these strengths on the TAP video tracking benchmark, where they attain state-of-the-art performance and vastly surpass all previous techniques. They have released several demo videos on their website and plan to release the code soon.

https://omnimotion.github.io/

As seen by the motion routes above, they provide a novel technique for calculating full-length motion trajectories for each pixel in each frame of a movie. They only display sparse trajectories for foreground objects to maintain clarity, despite the fact that our technique computes motion for all pixels. Their approach produces precise, coherent long-range motion, even for quickly moving objects, and reliably tracks across occlusions, as demonstrated by the instances of the dog and swing. The moving item is shown in the second row at various points in time to provide context.