HMN 2025: How Multi-modal AI agent mimics human considering for lengthy video evaluation and reasoning

A novel multi-modal agent to facilitate long video understanding by AI — Credit: GitHub: https://github.com/yeliudev/VideoThoughts

While Artificial Intelligence (AI) know-how is evolving quickly, AI models nonetheless battle with understanding lengthy movies. A analysis crew from The Hong Kong Polytechnic University (PolyU) has developed a novel video-language agent, VideoThoughts, that allows AI models to carry out lengthy video reasoning and question-answering duties by emulating people’ mind-set.

The VideoMind framework incorporates an revolutionary Chain-of-Low-Rank Adaptation (LoRA) technique to scale back the demand for computational sources and energy, advancing the appliance of generative AI in video evaluation. The findings have been submitted to the world-leading AI conferences.

Videos, particularly these longer than quarter-hour, carry info that unfolds over time, such because the sequence of occasions, causality, coherence and scene transitions. To perceive the video content, AI models subsequently needn’t solely to establish the objects current, but additionally take into consideration how they modify all through the video. As visuals in movies occupy numerous tokens, video understanding requires huge quantities of computing capability and reminiscence, making it tough for AI models to course of lengthy movies.

Prof. Changwen Chen, Interim Dean of the PolyU Faculty of Computer and Mathematical Sciences and Chair Professor of Visual Computing, and his crew have achieved a breakthrough in analysis on lengthy video reasoning by AI. In designing VideoThoughts, they made reference to a human-like means of video understanding, and launched a role-based workflow. The 4 roles included within the framework are:

the Planner, to coordinate all different roles for every question;
the Grounder, to localize and retrieve related moments;
the Verifier, to validate the data accuracy of the moments and choose essentially the most dependable one;
and the Answerer, to generate the query-aware reply.

This progressive strategy to video understanding helps deal with the problem of temporal-grounded reasoning that almost all AI models face.

Another core innovation of the VideoThoughts framework lies in its adoption of a Chain-of-LoRA technique. LoRA is a finetuning method that has emerged lately. It adapts AI models for particular makes use of with out performing full-parameter retraining. The revolutionary chain-of-LoRA technique pioneered by the crew includes making use of 4 light-weight LoRA adapters in a unified model, every of which is designed for calling a selected position.

Credit: VideoThoughts

With this technique, the model can dynamically activate role-specific LoRA adapters throughout inference through self-calling to seamlessly swap amongst these roles, eliminating the necessity and value of deploying a number of models whereas enhancing the effectivity and adaptability of the only model.

VideoThoughts is open source on GitHub and Huggingface, and the associated analysis is available on the arXiv preprint server. Details of the experiments performed to judge its effectiveness in temporal-grounded video understanding throughout 14 numerous benchmarks are additionally out there. Comparing VideoThoughts with some state-of-the-art AI models, together with GPT-4o and Gemini 1.5 Pro, the researchers discovered that the grounding accuracy of VideoThoughts outperformed all rivals in difficult duties involving movies with a mean length of 27 minutes.

Notably, the crew included two variations of VideoThoughts within the experiments: one with a smaller, 2 billion (2B) parameter model, and one other with an even bigger, 7 billion (7B) parameter model. The outcomes confirmed that, even on the 2B dimension, VideoThoughts nonetheless yielded efficiency comparable with most of the different 7B dimension models.

Prof. Chen stated, “Humans swap amongst totally different considering modes when understanding movies: breaking down duties, figuring out related moments, revisiting these to verify particulars and synthesizing their observations into coherent solutions. The course of could be very environment friendly with the human mind utilizing solely about 25 watts of energy, which is about 1,000,000 occasions decrease than that of a supercomputer with equal computing energy.

“Inspired by this, we designed the role-based workflow that enables AI to grasp movies like human, whereas leveraging the chain-of-LoRA technique to attenuate the necessity for computing energy and reminiscence on this course of.”

AI is on the core of world technological improvement. The development of AI models is nonetheless constrained by inadequate computing energy and extreme energy consumption. Built upon a unified, open-source model Qwen2-VL and augmented with further optimization instruments, the VideoThoughts framework has lowered the technological price and the edge for deployment, providing a possible resolution to the bottleneck of decreasing energy consumption in AI models.

Prof. Chen added, “VideoThoughts not solely overcomes the efficiency limitations of AI models in video processing, but additionally serves as a modular, scalable and interpretable multimodal reasoning framework. We envision that it’ll develop the appliance of generative AI to numerous areas, corresponding to clever surveillance, sports activities and leisure video evaluation, video search engines like google and yahoo and extra.”

More info:
Ye Liu et al, VideoThoughts: A Chain-of-LoRA Agent for Long Video Reasoning, arXiv (2025). DOI: 10.48550/arxiv.2503.13444

Journal info:
arXiv

Provided by
Hong Kong Polytechnic University

Citation:
Multi-modal AI agent mimics human considering for lengthy video evaluation and reasoning ( 10)
16
fromnews/2025-06-multi-modal-ai-agent-mimics.html

.
. The content material is offered for info functions solely.