Reinforcement learning is at the forefront of the development of artificial general intelligence. AI researchers at Google and the University of California, Berkeley, are trying to work out ways to make it easier for researchers working on meta learning or reinforcement learning systems. Researchers Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn and Sergey Levine introduced an approach called Unsupervised Meta Learning which allows an AI agent to get a distribution of tasks. The agent can go on to do meta learning over these tasks.
Meta learning is similar to multi task learning where an agent learns to adopt new tasks quickly. Meta learning can use reinforcement learning (RL) to solve new problems. It becomes more efficient by using meta learning tasks using RL. Meta learning algorithms tend to do well when they have a data that has same distributions as the tasks the algorithm has to generalise upon.
This shows that the performance of meta learning algorithms is heavily dependent upon the meta training task distribution. Hence generalisation of the meta learning algorithms will improve if tasks are taken from the similar distribution as the meta learning tasks. The researchers set out with the target of automating meta training process by dispensing away with the need of hand-designing meta training task. This target is particularly difficult because we need to address two big problems together.
- Meta reinforcement learning with broad task distributions
- Unsupervised exploration for proposing a wide variety of tasks for meta learning
Unsupervised Meta Reinforcement Learning
As the researchers put it the aim of unsupervised meta reinforcement learning is to observe an environment and produce a learning algorithm specifically made for this particular environment. This algorithm learns to maximise reward on any particular task in that particular task. The researchers propose a framework has two components. Component one is a task identification procedure, which interacts with a controlled Markov process without a reward function with an aim to construct a distribution over tasks. Component two, related to actual meta learning which has the reward functions it meta learns a reinforcement learning function that has the power to adapt to new tasks.
Here the description of meta learning algorithm will affect how the reinforcement learning function will work. Because of this some meta reinforcement learning can adapt to new tasks and some simply can not. The researchers work on a stepwise approach, which acquires a task distribution. Then the algorithm meta trains on the task. The research tries out two research directions to extract the task distributions from an environment.
1.Task acquisition via random discriminators
The researchers say that the most effective way to describe a simple task distribution is to use random discriminators on states. What this means is that whenever given a uniformly distributed random variable z, the researchers define a random discriminator as a parametric function. In this parametric function, the parameters are chosen randomly like a random weight initialisation for a neural network.
2. Task acquisition via diversity-driven exploration
The researchers try to acquire more tasks of variety when there is more amount of unsupervised environment interaction. The researchers use a technique called Diversity is All You Need (DIAYN) for task acquisition. DIAYN tries to get a set of behaviours that are different from one another. The researchers mention that method is fully unsupervised. There is no handcrafting of distance metrics or subgoals.
Meta Reinforcement Learning Using The Acquired Task Distributions
The above method tells us how to get a distribution of tasks through various ways. Then the researchers now take a meta learning algorithm to acquire the adaptation procedure from this task distribution. The researchers take tasks T drawn from a manually specified task distribution provided by the researcher. Every task is different Markov Decision Problem (MDP). The main aim is the meta RL is to lean reinforcement function that can adapt to new tasks. The objective function used here is MAML, that is, model agnostic meta learning. MAML learns an initialisation based on data that makes the reinforcement procedure very fast.
The researchers mention that tasks used in the training should be closer to the types of tasks that might be seen at meta test time. The researchers found that the unsupervised meta training learns the dynamics of the controlled Markov process (CMP). It is also found that the meta learning helps the policy to modify its behaviour in many ways with the help of unsupervised meta reinforcement learning.
Systems based on Unsupervised Meta Reinforcement Learning are better than reinforcement learning on simulated 2D navigation and locomotion tasks. The tasks were of increasing difficulty: 2D point navigation, 2D locomotion using the “HalfCheetah,” and 3D locomotion using the “Ant”. The system also performs far better than human-designed tuned reward functions. It also shows that UML can wander around the problem space and build great reward signals.