Back-propagation algorithm in simpler terms can be typified as learning how to ride a bicycle. After a few unfortunate falls, one learns how to avoid the fall. And every fall teaches how to ride the bike better and not lean too much on either side so as not to fall before reaching the destination.
Today, back-propagation is part of almost all the neural networks that are deployed in object detection, recommender systems, chatbots and other such applications. It has become part of the de-facto industry standard and doesn’t sound strange even to an AI outsider. However, this invention is not so recent as it appears to be.
This was introduced three decades ago by one of the pioneers of modern AI, University of Toronto’s Geoffrey Hinton. Back in 1986, people couldn’t grasp the significance of this technique and now the machine learning community can’t do without it.
Back-propagation is an ingenious idea that also has its own set of disadvantages like vanishing or exploding gradients. So, in a modest attempt to find an alternative or a way to avoid the use of back-propagation, a team of researchers from New Zealand publish their work titled ‘HSIC (Hilbert-Schmidt independence criterion) Bottleneck: Deep Learning without back-propagation’.
Overview Of Back-Propagation
Source: Matt Mazur
Back-propagation is the procedure of repeatedly adjusting the weights of the connections in the network to minimize the difference between actual output and desired output. These weight adjustments result in making the hidden units of the neural network to represent key features of the data.
As can be seen in the above illustration of a basic neural network, the error E or the magnitude of the difference between actual and desired output is being updated to the hidden layer and eventually is being used in adjusting the weights of the network. This is done over and over until an accurate prediction is made.
Back-prop gave neural networks the ability to create new useful features from the same data. Regardless of its drawbacks, its presence will be felt across many domains using AI. However, having a look at how the neural networks can still be good at what they do without the help of back-propagation is a welcoming change. In the next section we shall look at how information bottlenecks were used to achieve state-of-the-art results.
What Does HSIC Bottleneck Do Differently?
The above figure gives an overview of how training is done using HSIC. The HSIC-trained network, Figure (a), is a standard feedforward network trained using the HSIC IB objective, resulting in hidden representations at the last layer that can be trained rapidly.
Figure (b) shows the σ-combined network, where each branch of the network HSIC-net is trained with a specific σ.
The approach here is to train the network by using an approximation of the information bottleneck instead of back-propagation.
In the next step, a substitute of the mutual information between hidden representations and labels is found and is maximised. This simultaneously minimises the mutual dependency between hidden representations and the inputs.
Thus, each hidden representation from the HSIC-trained network may contain different information obtained by optimizing the HSIC bottleneck objective at a particular scale. Then the aggregator sums the hidden representations to form an output representation.
An intuition for the HSIC approach here is provided by the fact that the series expansion of the exponential contains a weighted sum of all moments of the data, and two distributions are equal if and only if their moments are identical.
The authors in their paper, claim that this method
- facilitates parallel processing and requires significantly less operations.
- does not suffer from exploding or vanishing gradients.
- is biologically more plausible than backpropagation as there is no requirement for symmetric feedback.
- provides a performance on the MNIST/FashionMNIST/CIFAR10 classification comparable to backpropagation and
- appending a single layer trained with SGD (without backpropagation) results in state-of-the-art performance.
Information bottleneck method itself is at least 20 years old. It was introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. Bottleneck method’s main objective is to find the sweet spot between accuracy and complexity.
Applications include distributional clustering and dimension reduction, and more recently it has been suggested as a theoretical foundation for deep learning.
The successful demonstration of HSIC as a method is an indication of the growing research in exploration of deep learning fundamentals from an information theoretical perspective.