AI Meta-RL Framework with Two Policies Explore/Exploit


Successful reinforcement learning (RL) applications include difficult tasks like plasma control, molecular design, game playing, and robot control. Despite its potential, traditional RL is extremely sample inefficient. Learning a task that a human could pick up in a few tries can take an agent hundreds of thousands of episodes of play.

Studies show the following reasons for the inefficiency of the sample:

  • A complex prior, like a human’s common sense or broad experience, is outside the scope of typical RL’s conditioning capabilities.
  • Conventional RL can’t customize each exploration to be as informative as possible; instead, it adjusts by repeatedly reinforcing previously learned behaviors. 
  • Both traditional RL and meta-RL employ the same policy to explore (collect data to better the policy) and exploit (get high episode reward).

To address these shortcomings, researchers from the University of British Columbia, Vector Institute, and Canada CIFAR AI Chair introduce First-Explore. This lightweight meta-RL framework learns a set of policies: an intelligent explore policy and an intelligent exploit policy. Meta-RL’s human-level, in-context, sample-efficient learning on unknown hard-exploration domains, such as hostile ones that require sacrificing reward to investigate effectively, is made possible by First-Explore. 

Developing algorithms with human-level performance on previously encountered hard-exploration domains is one of the primary obstacles in developing artificial general intelligence (AGI). The team suggests that combining First-Explore with a curriculum, such as the AdA curriculum, could be a step in the right direction. They believe such progress would lead to the realization of the great potential benefits of AGI if they could appropriately handle the genuine and serious safety issues connected with developing AGI.

The computational resources dedicated to domain randomization early on allow First-Explore to learn intelligent exploration, such as searching thoroughly for the first ten activities and then prioritizing sampling those with high rewards. However, once trained, the exploring strategy may be incredibly sample efficient when learning new tasks. Given that standard RL appears successful despite this constraint, one may also query how serious exploring through exploiting is. The researchers contend that the gap becomes most noticeable when one wants to explore and exploit intelligently with human-level adaptation on complex tasks. 

Even on straightforward domains like the multi-armed Gaussian bandit, First-Explore performs better, and it dramatically increases performance on sacrificial exploration domains like the Dark Prize Room environment (where the average expected prize value is negative). The findings from both problem domains highlight the importance of understanding the differences between optimal exploitation and exploring for achieving effective in-context learning, specifically about the extent to which each strategy covers the state or action space and whether or not it aids in attaining high reward.