Annotation Procedure that Obtains Rich Video Descriptions with Accurate Spatio-Temporal Localizations

Vision and language research is a dynamically evolving field that has recently witnessed remarkable advancements, particularly in datasets that establish connections between static images and corresponding captions. These datasets also involve associating certain words within the captions with specific regions within the images, utilizing diverse methodologies. An intriguing approach is presented by the latest Localized Narratives (ImLNs), which offer an appealing solution: annotators verbally describe an image while simultaneously moving their mouse cursor across the regions they are discussing. This dual process of speech and cursor movement mirrors natural communication and yields comprehensive visual grounding for each word. It is worth noting, however, that still images only capture a single moment in time. The prospect of annotating videos holds even more fascination, as videos portray complete narratives, showcasing events with multiple entities and objects dynamically interacting.

To address this time-consuming and complex task, an enhanced annotation approach for extending ImLNs to videos has been presented. 

The pipeline of the proposed technique is presented here below.

This new protocol allows annotators to craft the video’s narrative in a controlled setting. Annotators begin by carefully observing the video, identifying the principal characters (such as “man” or “ostrich”), and selecting pivotal key frames that represent significant moments for each character. 

Subsequently, for each character individually, the narrative is constructed. Annotators articulate the character’s involvement in various events using spoken descriptions while concurrently guiding the cursor over the keyframes to highlight relevant objects and actions. These verbal descriptions encompass the character’s name, attributes, and particularly the actions it undertakes, including interactions with other characters (e.g., “playing with the ostrich”) and inanimate objects (e.g., “grabbing the cup of food”). To provide comprehensive context, annotators also provide a brief description of the background in a separate phase. 

Effectively employing key frames eliminates the time constraint while creating distinct narrations for each character enables the disentanglement of intricate situations. This disentanglement facilitates the comprehensive depiction of multifaceted events involving multiple characters interacting among themselves and with numerous passive objects. Like ImLN, this protocol leverages mouse trace segments to localize each word. The study also implements several additional measures to ensure precise localizations, surpassing the achievements of the previous work.

The researchers conducted annotations on different datasets using Video Localized Narratives (VidLNs). The considered videos depict intricate scenarios featuring interactions among various characters and inanimate objects, resulting in captivating narratives described through detailed annotations. An example is reported below.

The depth of the VidLNs dataset forms a robust foundation for various tasks, such as Video Narrative Grounding (VNG) and Video Question Answering (VideoQA). The freshly introduced VNG challenge necessitates the development of a technique capable of localizing nouns from an input narrative by generating segmentation masks on the video frames. This task presents a significant challenge, as the text frequently comprises multiple identical nouns requiring disambiguation, a process that leverages contextual cues from surrounding words. Although these new benchmarks remain complex challenges far from being fully resolved, the proposed approach reveals meaningful progress in the right direction (refer to the published paper for further information).

This was the summary of Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. If you are interested and want to learn more about it, please feel free to refer to the links cited below.