Regardless of the industry they are employed in, artificial intelligence (AI) and machine learning (ML) technologies have always attempted to improve the quality of life for people. One of the major applications of AI in recent times is to design and create agents that can accomplish decision-making tasks across various domains. For instance, large language models like GPT-3 and PaLM and vision models like CLIP and Flamingo have proven to be exceptionally good at zero-shot learning in their respective fields. However, there is one prime drawback associated with training such agents. This is because such agents exhibit the inherent property of environmental diversity during training. In simple terms, training for different tasks or environments necessitates the use of various state spaces, which can occasionally impede learning, knowledge transfer, and the generalization ability of models across domains. Moreover, for reinforcement learning (RL) based tasks, creating reward functions for specific tasks across environments becomes difficult.
Working on this problem statement, a team from Google Research investigated whether such tools can be used to construct more all-purpose agents. For their research, the team specifically focused on text-guided image synthesis, wherein the desired goal in the form of text is fed to a planner, which creates a sequence of frames that represent the intended course of action, after which control actions are extracted from the generated video. The Google team, thus, proposed a Universal Policy (UniPi) that addresses challenges in environmental diversity and reward specification in their recent paper titled “Learning Universal Policies via Text-Guided Video Generation.” The UniPi policy uses text as a universal interface for task descriptions and video as a universal interface for communicating action and observation behavior in various situations. Specifically, the team designed a video generator as a planner that accepts the current image frame and a text prompt stating the current goal as input to generate a trajectory in the form of an image sequence or video. The generated video is then fed into an inverse dynamics model that extracts underlying actions executed. This approach stands out as it allows the universal nature of language and video to be leveraged in generalizing to novel goals and tasks across diverse environments.
Over the past few years, significant progress has been achieved in the text-guided image synthesis domain, which has yielded models with an exceptional capability of generating sophisticated images. This further motivated the team to choose this as their decision-making task. The UniPi approach proposed by Google researchers mainly consists of four components: trajectory consistency through tiling, hierarchical planning, flexible behavior modulation, and task-specific action adaptation, which are described in detail as follows:
1. Trajectory consistency through tiling:
Existing text-to-video methods often produce videos with a substantially changing underlying environment state. However, ensuring the environment is constant throughout all timestamps is essential to build an accurate trajectory planner. Thus, to enforce environment consistency in conditional video synthesis, the researchers additionally provide the observed image while denoising each frame in the synthesized video. In order to retain the underlying environment state across time, UniPi directly concatenates each noisy intermediate frame with the conditioned observed image across sampling steps.
2. Hierarchical Planning:
It’s difficult to generate all the necessary actions when making plans in complex and sophisticated environments that require a lot of time and measures. Planning methods overcome this issue by leveraging a natural hierarchy by creating rough plans in a smaller space and refining them into more detailed plans. Similarly, in the video generation process, UniPi first creates videos at a coarse level demonstrating the desired agent behavior and then improves them to make them more realistic by filling in the missing frames and making them smoother. This is done by using a hierarchy of steps, with each step improving the video quality until the desired level of detail is reached.
3. Flexible behavioral modulation:
While planning a sequence of actions for a smaller goal, one can easily include external constraints to modify the generated plan. This can be done by incorporating a probabilistic prior that reflects the desired limitations based on the properties of the plan. The prior can be described using a learned classifier or a Dirac delta distribution on a particular image to guide the plan toward specific states. This approach is also compatible with UniPi. The researchers employed the video diffusion algorithm to train the text-conditioned video generation model. This algorithm consists of encoded pre-trained language features from the Text-To-Text Transfer Transformer (T5).
4. Task-specific action adaptation:
A small inverse dynamics model is trained to translate video frames into low-level control actions using a set of synthesized videos. This model is separate from the planner and can be trained on a separate smaller dataset generated by a simulator. The inverse dynamics model takes input frames and text descriptions of the current goals, synthesizes the image frames, and generates a sequence of actions to predict future steps. An agent then executes these low-level control actions using closed-loop control.
To summarize, the researchers from Google have made an impressive contribution by showcasing the value of using text-based video generation to represent policies capable of enabling combinatorial generalization, multi-task learning, and real-world transfer. The researchers evaluated their approach on a number of novel language-based tasks, and it was concluded that UniPi generalizes well to both seen and unknown combinations of language prompts, compared to other baselines such as Transformer BC, Trajectory Transformer, and Diffuser. These encouraging findings highlight the potential of utilizing generative models and the vast data available as valuable resources for creating versatile decision-making systems.