One of the newest developments in deep learning is the curriculum learning, where algorithms are trained to learn on a meaningful order in increasing complexity rather than just examples being fed to them.
A new study by researchers who worked with OpenAI has dug deep into curriculum learning. Their model, called as Teacher-Student Curriculum Learning (TSCL), aims to be a game changer in learning subtasks associated with major deep learning tasks. This article looks into the technicality surrounding TSCL.
What Is TSCL?
In the context of curriculum learning, a Teacher algorithm gives subtasks to the Student algorithm in the increasing order of complexity, while the Student performs them and returns a score. This is gradually repeated until all tasks are performed by the Student successfully.
So, as and when the Student learns and masters a particular task, the Teacher assigns more probability on the subsequent task ahead thus focussing less on the current one as it has been learnt completely.
Repetition of performing tasks here makes learning faster. One important point to be noted here is the Teacher algorithm also learns information simultaneously along with the Student algorithm. This forms the basis for TSCL algorithm.
In fact, TSCL is specified as a Partially Observable Markov Decision Process (POMDP) in the study. Two cases of POMDP, one for reinforcement learning (simple training) and the other, for supervised learning (batch training), are charted out. The reason POMDP is chosen here is to optimise the Teacher algorithms’ rewards in line with the Student algorithms’ sub-task performance.
Matiisen et.al, the creators of TSCL, say “While an obvious choice for optimization criteria would have been the performance in the final task, initially the Student might not have any success in the final task and this does not provide any meaningful feedback signal to the Teacher. Therefore we choose to maximize the sum of performances in all tasks. The assumption here is that in curriculum learning the final task includes the elements of all previous tasks, therefore good performance in the intermediate tasks usually leads to good performance in the final task”
On this front, PODMPs are generally solved using RL but the training itself takes time and becomes iterative. Thus, researchers derive insights from the popular ‘multi armed non-stationary bandit problem’ and bring out the following new algorithms to incorporate in TSCL.
- Online Algorithm
- Naive Algorithm
- Window Algorithm
- Sampling Algorithm
All these algorithms are tweaked with respect to improving scores as well as keeping a check on the number of times a task has been performed in the Teacher-Student setup.
TSCL In Decimal Number Addition And Minecraft
There are many research-oriented applications under curriculum learning. Decimal number addition through LSTM is one notable work, where a sequence-to-sequence model was implemented. Although this technique has found success in supervised learning, it has faced setbacks either in learning performance or end up using too much memory for addition.
Thus, Matiisen et.al, consider this problem for analysis. As mentioned earlier, batch training POMDP is taken as the TSCL method here. The addition is carried along two parameters: 1-dimensional curriculum teaching and 2-dimensional curriculum teaching. In the former, tasks related to finding the maximum number of digits in the number obtained after addition, while the latter includes another criterion i.e., taking the length of numbers separately on top of finding decimal digits.
Popular video game Minecraft, was also experimented with respect to reinforcement learning strategies. By using Microsoft’s Project Malmo with OpenAI Gym, a 5-step Curriculum Learning is created by Matiisen and team. This generates random mazes in Minecraft where the learning agent carefully navigates and learns the maze environment. (A detailed account of the Minecraft training can be found here.)
Even in this case, Minecraft agent learns faster with every run iteration. If a five-step curriculum is performed without considering each step, the agent terribly fails in learning the environment, which supports the researchers’ critique on selecting only the final task.
While this study has paved way for using TSCL in a handful of applications, it is again yet to stand with standard reinforcement learning and supervised learning algorithms. Nonetheless, TSCL will alleviate complexities in algorithms at every stage (dividing into sub-tasks etc.) thus reducing the burden on computing power and similar resources.