A new algorithm developed by UC Berkeley researchers called Video Prediction Rewards (VIPER) uses learned video prediction models as action-free reward signals  

Designing a reward function by hand is time-consuming and can result in unintended consequences. This is a major roadblock in developing reinforcement learning (RL)-based generic decision-making agents.

Previous video-based learning methods have rewarded agents whose current observations are most like those of experts. They cannot capture meaningful activities throughout time since rewards are conditional solely on the current observation. And generalization is hindered by the adversarial training techniques that lead to mode collapse.

Researchers from U.C. Berkeley have created a brand-new technique called Video Prediction Incentives for Reinforcement Learning (VIPER) for extracting rewards from video prediction models. VIPER can generalize to untrained domains and learn reward functions from unprocessed videos.

First, VIPER trains a prediction model using videos created by experts. The log-likelihood of agent trajectories is then optimized by training an agent in reinforcement learning using the video prediction model. To match the distribution of the video model, the distribution of the agent’s trajectories must be reduced. The agent may be trained to follow a trajectory distribution similar to that of the video model by using the likelihoods of the video model directly as a reward signal. Rewards from video models, as opposed to those at the observational level, quantify the temporal consistency of of behavior. It also allows quicker training timeframes and greater interactions with the environment because evaluating likelihoods is much faster than doing video model rollouts. 

Across 15 DMC tasks, 6 RLBench tasks, and 7 Atari tasks, the team conducts a thorough study and demonstrates that VIPER can achieve expert-level control without using task rewards. According to the findings, VIPER-trained RL agents beat adversarial imitation learning across the board. Since VIPER is integrated into the setting, it does not care which RL agent is used. Video models are already generalizable to arm/task combinations not encountered during training, even in the small dataset regime.

The researchers think using big, pre-trained conditional video models will make more flexible reward functions possible. With the help of recent breakthroughs in generative modeling, they believe their work provides the community with a foundation for scalable reward specification from unlabeled films.