Advanced Multimodal Reasoning and Action Using ChatGPT and Vision Experts

Large Language Models (LLMs) are rapidly advancing and contributing to notable economic and social transformations. With many artificial intelligence (AI) tools getting released on the internet, one such tool that has become extremely popular in the past few months is ChatGPT. ChatGPT is a natural language processing model allowing users to generate meaningful text like humans. OpenAI’s ChatGPT is based on the GPT transformer architecture, with GPT-4 being the latest language model that powers it.

With the latest Artificial Intelligence and Machine Learning developments, computer vision has advanced exponentially, with improved network architecture and large-scale model training. Recently, some researchers have introduced MM-REACT, which is a system paradigm that composes numerous vision experts with ChatGPT for multimodal reasoning and action. MM-REACT combines individual vision models with the language model in a more flexible manner to overcome complicated visual understanding challenges.

MM-REACT has been developed with the objective of taking care of a wide range of complex visual tasks that existing vision and vision-language models struggle with. For this, MM-REACT uses a prompt design for representing various types of information, such as text descriptions, textualized spatial coordinates, and dense visual signals, such as images and videos, represented as aligned file names. This design lets ChatGPT accept and process different types of information in combination with visual input, leading to a more accurate and comprehensive understanding.

MM-REACT is a system that combines the abilities of ChatGPT with a pool of vision experts for the addition of multimodal functionalities. The file path is used as a placeholder and inputted into ChatGPT to enable the system to accept images as input. Whenever the system requires specific information from the image, such as identifying a celebrity name or box coordinates, ChatGPT seeks help from a specific vision expert. The expert’s output is then serialized as text and combined with the input to activate ChatGPT further. The response is directly returned to the user if no external experts are needed.

ChatGPT has been made to understand the knowledge of the usages of the vision experts by adding certain instructions to ChatGPT prompts which are related to each expert’s capability, input argument type, and output type, along with a few in-context examples for each expert. Moreover, a special watchword is instructed for using regex expression matching to invoke the expert accordingly.

Upon experimentation, Zero-shot experiments have shown how MM-REACT effectively addresses its particular capabilities of interest. It has proven efficient in solving a wide range of advanced visual tasks requiring complex visual understanding. The authors have shared a few examples where MM-REACT is able to provide solutions to linear equations displayed on an image. Also, It is able to perform concept understanding by naming products in the image and their ingredients and so on. In conclusion, this system paradigm greatly combines language and vision expertise and is capable of achieving advanced visual intelligence.

n n